Option to not save the current state of the object #1099

volkanunsal · 2018-06-01T15:53:31Z

Is your feature suggestion related to a problem? Please describe.

I'd like to use PaperTrail for an event sourcing project. I only want to keep the object_changes and I don't want to store the current state of the object in the events table.

Describe the solution you'd like to build

I'd like a configuration option that would tell PaperTrail to skip the object column, similar to the option to not record object_changes

has_paper_trail  save_object: false

Describe alternatives you've considered

Monkey patching the library, which I'm already doing.

The text was updated successfully, but these errors were encountered:

jaredbeck · 2018-06-01T20:16:54Z

@seanlinsley and I discussed this recently in #1046. Can you address some of the concerns I gave there? Of course, Sean suggested omitting object only for updates, and you suggest omitting object entirely, so your suggestion is simpler.

How would you handle the situation where someone turned on this config option for one week, then turned it off, then tried to reify something? How would you handle any reify? It's harder to reify from object_changes than from object but it could be done.

How would you handle compatibility with the experimental association tracking feature? I think it would be OK to say that this option is incompatible, but we have to raise a helpful error if people try to combine them.

Would you have people drop their object column? Requiring the column to be dropped would make people stop and think. It would also make it really clear that the data would come from object_changes in the case of a reify.

volkanunsal · 2018-06-01T21:18:09Z

@jaredbeck Reification seems to be equivalent to projection. The way to derive the projection of an event sourced record is to start from the beginning and process the changes of all events in the order in which they are created. This can be an expensive operation when a record has a lot of changes, so derived state is usually stored in a separate table.

How would you handle the situation where someone turned on this config option for one week, then turned it off, then tried to reify something?

If reify were to apply changes, rather than using materialized object for a version, it would still work. For this to completely work, a prerequisite is for user to opt in to store all events.

Would you have people drop their object column?

No, this column can be used like a projection table and increase reification performance.

volkanunsal · 2018-06-02T19:23:16Z

Thinking over the object column a bit more. It might be a little hard to use it as a projection table, but it's still needed, I think, because some other model might using it. They could drop it safely only if the option was implemented as a global configuration.

jaredbeck · 2018-06-04T18:48:54Z

If reify were to apply [incremental] changes, rather than using .. object ..

It seems to me we are talking about two possible designs for a versioning library:

"snapshots" - what PT currently does, vs.
"incremental changes"

So, I feel like we're discussing a big change. In fact, it feels like the biggest and most fundamental change. You can see why I am nervous about it :)

I think it would be cleaner to fork, or just write a different library. I'm concerned about maintaining the complexity of having both "patterns" in one library.

In my mind, these two patterns are not compatible. So, if we were to support both patterns in the same library, I would want a really clear separation between them. That is why I suggested dropping the object column.

volkanunsal · 2018-06-04T18:51:41Z

I agree with that approach. I don't use it myself very often personally because schema changes between the snapshots can cause problems.

jaredbeck · 2018-06-10T17:36:55Z

@seanlinsley what do you think?

seanlinsley · 2018-06-16T20:56:27Z

Over the years of running Rails applications, there's only been a handful of times when reify + save has been needed, and even fewer instances where it was that simple (schema changes, issues with associations, callbacks, and APIs got in the way). Almost every time it's required writing a custom restore script that inlined the data to restore instead of relying on the version history (often that data would be gathered from database backups, or generated by hand from a UI showing PaperTrail's history).

The ability to accurately reify records that have gone through years of application changes hasn't been that useful for me. What's much more important on a multi-year time scale is storing as few extra bytes as possible.

For simplicity's sake I think it'd be fine to disable reify if the object column doesn't exist.

jaredbeck · 2018-06-18T16:40:46Z

Thanks for the thoughtful analysis, Sean.

It sounds like we agree on dropping the object column. Then, if anyone tries to use a feature that depends on the object column, like reify, or where_object, a friendly error should be raised.

jaredbeck · 2018-06-27T16:42:31Z

It sounds like we agree on dropping the object column.

Volkan, sound good? What steps are you going to take next in your implementation of this new feature?

volkanunsal · 2018-06-27T16:53:50Z

@jaredbeck Yup, that's good by me.

As far as the implementation, I have a patch for RecordTrail module that eliminates the use of object method. I could submit that as a PR, but it would also need to be amplified by a configuration option and other niceties, like an error message. Given my time constraints, and lack of familiarity with the codebase, I may not be able to do all those. But let me know how I can help!

seanlinsley · 2018-06-27T16:54:30Z

It appears that while Paper Trail populates both object and object_changes on create events, it only populates object on destroy events.

It's common for us to perform analytics on destroyed objects based on e.g. a foreign key. Dropping the object column would drastically increase the operational cost of performing those types of queries.

With that in mind, it seems there are two paths forward:

add a setting so object isn't populated on update events (proposed in Skip object on update #1046)
make it so destroy events store everything in object_changes as if they're being set to null:

{
  "id": [1, null],
}

Changing the behavior of destroy events might be considered a breaking change, so I think adding a setting to disable use of object by update events makes the most sense.

seanlinsley · 2018-06-28T03:01:41Z

Once we agree on a course of action, I'd be happy to submit a PR. Alternatively, @volkanunsal I'd be more than willing to pair with you

jaredbeck · 2018-07-02T18:35:22Z

add a setting so object isn't populated on update events (proposed in #1046)

I'm opposed to this approach. I've stated my concerns here and in #1046. I believe such a setting would confuse users. I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent. Then, they'll try to reify and that won't work of course.

By contrast, dropping the object column is unambiguous.

make it so destroy events store everything in object_changes as if they're being set to null:

That sounds like a great idea. It should be controlled by a per-request setting (in PaperTrail::Request), and that setting should be off by default.

Alternatively, @volkanunsal I'd be more than willing to pair with you

Volkan, I hope you get the chance to pair with Sean on this. Sean has contributed to PT many times, and has worked on some other big ruby projects. You'll have fun.

seanlinsley · 2018-07-07T17:09:34Z

I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent.

That scenario seems unlikely to me, but I'm imagining a world where we've properly documented the tradeoffs of both approaches so the option for any given application should be clear.

However, seeing how the object_changes column is already optional, it may be a more consistent API to make it so that the object column is also optional.

make it so destroy events store everything in object_changes as if they're being set to null

That sounds like a great idea. It should be controlled by a per-request setting (in PaperTrail::Request), and that setting should be off by default.

Looking through the code, it seems like PaperTrail::Config is used for global configuration, and PaperTrail::Request is used for runtime metadata + model-specific configuration. I was imagining that this would be a global setting; what reason would anyone have to enable it if they hadn't removed the object column?

Another option would be for object_changes to be automatically populated if the object column doesn't exist.

That's persuasive for me b/c it presents a clear choice with simple steps to take.

Want to be able to reify? Keep object.
Want to see incremental changes? Keep object_changes.
Worried about disk usage? Drop object.

Which raises the question... why do we currently have the ability to disable object_changes as a per-model setting? From #507:

For certain models especially with large TEXT columns I don't want to store object_changes but still need object_changes stored on other models.

Which is odd... b/c object would still be storing those large text columns even if object_changes doesn't.

That's a big issue when maintaining a long-lived OSS project: past decisions make today's API design much more difficult than it needs to be 💀

seanlinsley · 2018-07-07T17:30:31Z

Expanding out the feature matrix, you could:

drop object
keep object, but set save_object: false on some models
drop object_changes
keep object_changes, but set save_changes: false on some models

That seems reasonable. The remaining question becomes what to do with storing object_changes data on destroy events. Doing it automatically makes sense when the object column doesn't exist, but could be weird with save_object: false. It seems like it'd be pretty surprising to developers finding that some destroy events have data in object_changes, while some don't, purely b/c of a setting on the model.

For simplicity's sake I think I'd rather do a major version bump and start storing data in object_changes on all destroy events. As a beginner dev I remember being confused that object_changes was missing on destroy events but not on creation events, so doing this might lower the amount of surprising behavior in Paper Trail. In the changelog we could include a migration that existing users could run to backfill the column (at least for JSON/B users on Postgres).

jaredbeck · 2018-07-11T03:19:09Z

I like your analysis, Sean.

Another option would be for object_changes to be automatically populated if the object column doesn't exist.

Makes sense to me, but I think I prefer the simplicity of your other suggestion: to start storing data in object_changes on all destroy events (see below).

.. why do we currently have the ability to disable object_changes as a per-model setting? .. [the object column] would still be storing those large text columns even if object_changes doesn't.

I guess it allows them to save some disk space. I'd like to keep the save_changes option if we can, if for no other reason than backwards compatibility.

If the object column were dropped, and save_changes: false should we raise an error? Maybe not, because they might be content just storing metadata.

That's a big issue when maintaining a long-lived OSS project: past decisions make today's API design much more difficult than it needs to be 💀

Yeah, it's a challenge. Generally speaking, I'm not opposed to breaking changes as long as we provide a convenient, well-documented (and well-deprecated) path to upgrade.

Expanding out the feature matrix, you could:

drop object

keep object, but set save_object: false on some models

drop object_changes

keep object_changes, but set save_changes: false on some models

I don't like the idea of introducing a save_object option, because I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent. I have the same concern about save_changes, but short of removing that feature, there's not much we can do about that.

That said, maybe there's a person out there who wants to keep the object column so they can reify some models, but needs to save disk space by not storing it for other models? Is that likely? Does that person's hypothetical need outweigh my concern about novice users shooting themselves in the foot? WDYT?

For simplicity's sake I think I'd rather do a major version bump and start storing data in object_changes on all destroy events.

I'm OK with that. I'm guessing we don't store object_changes on destroy for two reasons. First, we wanted to save disk space. Second, we did not want where_object_changes to return the "destroy" event. Neither of those reasons is hugely compelling to me, though. If advanced users need to save every byte possible in the object_changes column, they can write their own object_changes_adapter. If they don't want where_object_changes to return the "destroy" event, they can add where("event <> 'destroy'") to their call sites.

I'm hoping this can be broken up into a few PRs to make review easier:

start storing data in object_changes on all destroy events
allow people to drop the object column
(undecided) add the suggested save_object option

seanlinsley · 2018-07-24T13:56:02Z

I don't like the idea of introducing a save_object option, because I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent. I have the same concern about save_changes, but short of removing that feature, there's not much we can do about that.

That said, maybe there's a person out there who wants to keep the object column so they can reify some models, but needs to save disk space by not storing it for other models? Is that likely? Does that person's hypothetical need outweigh my concern about novice users shooting themselves in the foot? WDYT?

I'm not so worried about the novice user situation b/c even if that does happen, it seems like it'd happen early enough in their trial usage of Paper Trail that they wouldn't have lost anything important.

I think #507 which introduced save_changes was misguided: they were trying to reduce the amount of space used by a large text column, but didn't seem to realize that object would duplicate that column on every save.

If we agree that save_changes isn't a feature that makes sense, perhaps the direction forward is:

in a minor version: deprecate save_changes, and introduce a setting to enable object_changes on destroy
in the next major version: remove save_changes, and track changes on destroy by default

Whichever direction we take, these two things would definitely need PRs:

allow people to drop the object column
add a setting to enable object_changes on destroy

So we can at least work on those in the meantime.

jaredbeck · 2018-08-13T05:25:47Z

Closing via #1122, #1123, and #1130.

seanlinsley · 2018-09-04T17:55:28Z

Here's the migration we ran. After pg_repack, our versions table went from 53 GB down to 23 GB.

Note that disable_statement_timeout! is from this gem; remove that line if you don't have it installed.

There's at least one spot where this migration is buggy: approx_count should really be doing something like select max(id) b/c otherwise the migration could end up skipping more recent records (in the case where a good number of version records have been deleted).

class DropObjectFromVersions < ActiveRecord::Migration[5.1]
  disable_ddl_transaction!
  disable_statement_timeout!

  def up
    # we perform these updates outside of a transaction (& in batches) to avoid locking the table
    batch_size = 1_000_000
    start = 0

    approx_count = execute(<<-SQL.squish).to_a.dig 0, 'reltuples'
      select reltuples::int from pg_class where relname = 'versions'
    SQL

    loop do
      break if start > approx_count
      stop = start + batch_size

      # 50% of the table size is `object`, which we don't use
      # https://github.com/paper-trail-gem/paper_trail/issues/1046
      #
      # before dropping the `object` column, we backfill `object_changes` on destroy
      # events so we'll still be able to do analytics queries on destroyed objects
      # https://github.com/paper-trail-gem/paper_trail/pull/1123
      update <<-SQL.squish
        update versions set object_changes = changes
        from (
          select id,
            json_object_agg(object.key, array[object.value, null]) as changes
          from versions
          join lateral jsonb_each(object) as object on true
          where event = 'destroy'
            and object_changes is null
            and id >= #{start} and id < #{stop}
          group by id
        ) as v
        where versions.id = v.id
      SQL

      # updated_at represents 21% of `object_changes` space usage, but isn't useful for us
      # https://github.com/paper-trail-gem/paper_trail/issues/1129
      update <<-SQL.squish
        update versions
        set object_changes = object_changes - 'updated_at'
        where object_changes ? 'updated_at'
          and id >= #{start} and id < #{stop}
      SQL

      start += batch_size
    end

    remove_column :versions, :object

    execute 'create extension pg_repack'

    # then after the deploy: `pg_repack -j 2 -t versions your_db`
  end

  def down
    fail ActiveRecord::IrreversibleMigration
  end
end

This code should be added to an initializer to prevent updated_at from being stored going forward:

class SkipUpdatedAt
  def diff(changes)
    changes.except! 'updated_at'
  end
end

PaperTrail.config.object_changes_adapter = SkipUpdatedAt.new

jaredbeck · 2018-09-04T18:05:31Z

Here's the migration we ran.

What, no down method?! j/k j/k 😁

After pg_repack, our versions table went from 53 GB down to 23 GB.

Nice!

.. backfill ..

That query is really interesting. I think I understand like 20% of it. I wonder if it's possible in MySQL/MariaDB.

seanlinsley · 2018-09-04T18:13:14Z

yeah the lateral join is a little mind-bending. it basically splits each key value pair in object_changes into their own row, then we re-combine them into a new object with json_object_agg.

jaredbeck · 2018-09-04T18:22:16Z

I think we should have a new top-level section in the readme titled "Large Databases". Can you paste all this stuff in there? We can't re-number readme sections because that breaks links, so this new section would be number 11.

jeromedalbert · 2020-05-08T02:56:30Z

Thank you to everyone who made this possible!

I can now make paper_trail use an incremental change design just by omitting the object field, and I have the peace of mind of knowing that the DB size won't get too much out of control too quickly, even if I don't set version limits. I'm OK with the tradeoff of reify not working (although it would be a nice to have), since for the current app I am working on we only care about change history.

This makes paper_trail viable for big databases. 👍

volkanunsal changed the title ~~Option to not store current state of the object~~ Option to not save the current state of the object Jun 1, 2018

batter added the Not a bug label Jun 3, 2018

jaredbeck mentioned this issue Jul 21, 2018

object_changes inserted by #touch_with_version is not correct value after v9.0.0 #1119

Closed

fukayatsu mentioned this issue Jul 24, 2018

object_changes inserted by #touch is not correct value #1121

Closed

This was referenced Jul 24, 2018

Make the object column optional #1122

Merged

Track changes on destroy #1123

Merged

This was referenced Aug 12, 2018

Discussion: timestamps waste a lot of space #1129

Closed

Remove save_changes #1130

Merged

jaredbeck closed this as completed Aug 13, 2018

seanlinsley mentioned this issue May 16, 2019

Migrate from papertrail? palkan/logidze#106

Closed

tlynam mentioned this issue Jun 1, 2020

Feature: S3 backend for object and object_changes to reduce database storage #1246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to not save the current state of the object #1099

Option to not save the current state of the object #1099

volkanunsal commented Jun 1, 2018

jaredbeck commented Jun 1, 2018

volkanunsal commented Jun 1, 2018 •

edited

Loading

volkanunsal commented Jun 2, 2018

jaredbeck commented Jun 4, 2018

volkanunsal commented Jun 4, 2018

jaredbeck commented Jun 10, 2018

seanlinsley commented Jun 16, 2018

jaredbeck commented Jun 18, 2018

jaredbeck commented Jun 27, 2018

volkanunsal commented Jun 27, 2018

seanlinsley commented Jun 27, 2018

seanlinsley commented Jun 28, 2018

jaredbeck commented Jul 2, 2018

seanlinsley commented Jul 7, 2018 •

edited

Loading

seanlinsley commented Jul 7, 2018

jaredbeck commented Jul 11, 2018

seanlinsley commented Jul 24, 2018

jaredbeck commented Aug 13, 2018

seanlinsley commented Sep 4, 2018 •

edited

Loading

jaredbeck commented Sep 4, 2018

seanlinsley commented Sep 4, 2018

jaredbeck commented Sep 4, 2018

jeromedalbert commented May 8, 2020 •

edited

Loading

Option to not save the current state of the object #1099

Option to not save the current state of the object #1099

Comments

volkanunsal commented Jun 1, 2018

jaredbeck commented Jun 1, 2018

volkanunsal commented Jun 1, 2018 • edited Loading

volkanunsal commented Jun 2, 2018

jaredbeck commented Jun 4, 2018

volkanunsal commented Jun 4, 2018

jaredbeck commented Jun 10, 2018

seanlinsley commented Jun 16, 2018

jaredbeck commented Jun 18, 2018

jaredbeck commented Jun 27, 2018

volkanunsal commented Jun 27, 2018

seanlinsley commented Jun 27, 2018

seanlinsley commented Jun 28, 2018

jaredbeck commented Jul 2, 2018

seanlinsley commented Jul 7, 2018 • edited Loading

seanlinsley commented Jul 7, 2018

jaredbeck commented Jul 11, 2018

seanlinsley commented Jul 24, 2018

jaredbeck commented Aug 13, 2018

seanlinsley commented Sep 4, 2018 • edited Loading

jaredbeck commented Sep 4, 2018

seanlinsley commented Sep 4, 2018

jaredbeck commented Sep 4, 2018

jeromedalbert commented May 8, 2020 • edited Loading

volkanunsal commented Jun 1, 2018 •

edited

Loading

seanlinsley commented Jul 7, 2018 •

edited

Loading

seanlinsley commented Sep 4, 2018 •

edited

Loading

jeromedalbert commented May 8, 2020 •

edited

Loading