Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not save the current state of the object #1099

Closed
volkanunsal opened this issue Jun 1, 2018 · 23 comments
Closed

Option to not save the current state of the object #1099

volkanunsal opened this issue Jun 1, 2018 · 23 comments

Comments

@volkanunsal
Copy link

Is your feature suggestion related to a problem? Please describe.

I'd like to use PaperTrail for an event sourcing project. I only want to keep the object_changes and I don't want to store the current state of the object in the events table.

Describe the solution you'd like to build

I'd like a configuration option that would tell PaperTrail to skip the object column, similar to the option to not record object_changes

has_paper_trail  save_object: false

Describe alternatives you've considered

Monkey patching the library, which I'm already doing.

@volkanunsal volkanunsal changed the title Option to not store current state of the object Option to not save the current state of the object Jun 1, 2018
@jaredbeck
Copy link
Member

@seanlinsley and I discussed this recently in #1046. Can you address some of the concerns I gave there? Of course, Sean suggested omitting object only for updates, and you suggest omitting object entirely, so your suggestion is simpler.

How would you handle the situation where someone turned on this config option for one week, then turned it off, then tried to reify something? How would you handle any reify? It's harder to reify from object_changes than from object but it could be done.

How would you handle compatibility with the experimental association tracking feature? I think it would be OK to say that this option is incompatible, but we have to raise a helpful error if people try to combine them.

Would you have people drop their object column? Requiring the column to be dropped would make people stop and think. It would also make it really clear that the data would come from object_changes in the case of a reify.

@volkanunsal
Copy link
Author

volkanunsal commented Jun 1, 2018

@jaredbeck Reification seems to be equivalent to projection. The way to derive the projection of an event sourced record is to start from the beginning and process the changes of all events in the order in which they are created. This can be an expensive operation when a record has a lot of changes, so derived state is usually stored in a separate table.

How would you handle the situation where someone turned on this config option for one week, then turned it off, then tried to reify something?

If reify were to apply changes, rather than using materialized object for a version, it would still work. For this to completely work, a prerequisite is for user to opt in to store all events.

Would you have people drop their object column?

No, this column can be used like a projection table and increase reification performance.

@volkanunsal
Copy link
Author

Thinking over the object column a bit more. It might be a little hard to use it as a projection table, but it's still needed, I think, because some other model might using it. They could drop it safely only if the option was implemented as a global configuration.

@jaredbeck
Copy link
Member

If reify were to apply [incremental] changes, rather than using .. object ..

It seems to me we are talking about two possible designs for a versioning library:

  1. "snapshots" - what PT currently does, vs.
  2. "incremental changes"

So, I feel like we're discussing a big change. In fact, it feels like the biggest and most fundamental change. You can see why I am nervous about it :)

I think it would be cleaner to fork, or just write a different library. I'm concerned about maintaining the complexity of having both "patterns" in one library.

In my mind, these two patterns are not compatible. So, if we were to support both patterns in the same library, I would want a really clear separation between them. That is why I suggested dropping the object column.

@volkanunsal
Copy link
Author

I agree with that approach. I don't use it myself very often personally because schema changes between the snapshots can cause problems.

@jaredbeck
Copy link
Member

@seanlinsley what do you think?

@seanlinsley
Copy link
Member

Over the years of running Rails applications, there's only been a handful of times when reify + save has been needed, and even fewer instances where it was that simple (schema changes, issues with associations, callbacks, and APIs got in the way). Almost every time it's required writing a custom restore script that inlined the data to restore instead of relying on the version history (often that data would be gathered from database backups, or generated by hand from a UI showing PaperTrail's history).

The ability to accurately reify records that have gone through years of application changes hasn't been that useful for me. What's much more important on a multi-year time scale is storing as few extra bytes as possible.

For simplicity's sake I think it'd be fine to disable reify if the object column doesn't exist.

@jaredbeck
Copy link
Member

Thanks for the thoughtful analysis, Sean.

It sounds like we agree on dropping the object column. Then, if anyone tries to use a feature that depends on the object column, like reify, or where_object, a friendly error should be raised.

@jaredbeck
Copy link
Member

It sounds like we agree on dropping the object column.

Volkan, sound good? What steps are you going to take next in your implementation of this new feature?

@volkanunsal
Copy link
Author

@jaredbeck Yup, that's good by me.

As far as the implementation, I have a patch for RecordTrail module that eliminates the use of object method. I could submit that as a PR, but it would also need to be amplified by a configuration option and other niceties, like an error message. Given my time constraints, and lack of familiarity with the codebase, I may not be able to do all those. But let me know how I can help!

@seanlinsley
Copy link
Member

It appears that while Paper Trail populates both object and object_changes on create events, it only populates object on destroy events.

It's common for us to perform analytics on destroyed objects based on e.g. a foreign key. Dropping the object column would drastically increase the operational cost of performing those types of queries.

With that in mind, it seems there are two paths forward:

  • add a setting so object isn't populated on update events (proposed in Skip object on update #1046)
  • make it so destroy events store everything in object_changes as if they're being set to null:
{
  "id": [1, null],
}

Changing the behavior of destroy events might be considered a breaking change, so I think adding a setting to disable use of object by update events makes the most sense.

@seanlinsley
Copy link
Member

Once we agree on a course of action, I'd be happy to submit a PR. Alternatively, @volkanunsal I'd be more than willing to pair with you :octocat:

@jaredbeck
Copy link
Member

add a setting so object isn't populated on update events (proposed in #1046)

I'm opposed to this approach. I've stated my concerns here and in #1046. I believe such a setting would confuse users. I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent. Then, they'll try to reify and that won't work of course.

By contrast, dropping the object column is unambiguous.

make it so destroy events store everything in object_changes as if they're being set to null:

That sounds like a great idea. It should be controlled by a per-request setting (in PaperTrail::Request), and that setting should be off by default.

Alternatively, @volkanunsal I'd be more than willing to pair with you :octocat:

Volkan, I hope you get the chance to pair with Sean on this. Sean has contributed to PT many times, and has worked on some other big ruby projects. You'll have fun.

@seanlinsley
Copy link
Member

seanlinsley commented Jul 7, 2018

I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent.

That scenario seems unlikely to me, but I'm imagining a world where we've properly documented the tradeoffs of both approaches so the option for any given application should be clear.

However, seeing how the object_changes column is already optional, it may be a more consistent API to make it so that the object column is also optional.

make it so destroy events store everything in object_changes as if they're being set to null

That sounds like a great idea. It should be controlled by a per-request setting (in PaperTrail::Request), and that setting should be off by default.

Looking through the code, it seems like PaperTrail::Config is used for global configuration, and PaperTrail::Request is used for runtime metadata + model-specific configuration. I was imagining that this would be a global setting; what reason would anyone have to enable it if they hadn't removed the object column?

Another option would be for object_changes to be automatically populated if the object column doesn't exist.

That's persuasive for me b/c it presents a clear choice with simple steps to take.

Want to be able to reify? Keep object.
Want to see incremental changes? Keep object_changes.
Worried about disk usage? Drop object.

Which raises the question... why do we currently have the ability to disable object_changes as a per-model setting? From #507:

For certain models especially with large TEXT columns I don't want to store object_changes but still need object_changes stored on other models.

Which is odd... b/c object would still be storing those large text columns even if object_changes doesn't.

That's a big issue when maintaining a long-lived OSS project: past decisions make today's API design much more difficult than it needs to be 💀

@seanlinsley
Copy link
Member

Expanding out the feature matrix, you could:

  • drop object
  • keep object, but set save_object: false on some models
  • drop object_changes
  • keep object_changes, but set save_changes: false on some models

That seems reasonable. The remaining question becomes what to do with storing object_changes data on destroy events. Doing it automatically makes sense when the object column doesn't exist, but could be weird with save_object: false. It seems like it'd be pretty surprising to developers finding that some destroy events have data in object_changes, while some don't, purely b/c of a setting on the model.

For simplicity's sake I think I'd rather do a major version bump and start storing data in object_changes on all destroy events. As a beginner dev I remember being confused that object_changes was missing on destroy events but not on creation events, so doing this might lower the amount of surprising behavior in Paper Trail. In the changelog we could include a migration that existing users could run to backfill the column (at least for JSON/B users on Postgres).

@jaredbeck
Copy link
Member

I like your analysis, Sean.

Another option would be for object_changes to be automatically populated if the object column doesn't exist.

Makes sense to me, but I think I prefer the simplicity of your other suggestion: to start storing data in object_changes on all destroy events (see below).

.. why do we currently have the ability to disable object_changes as a per-model setting? .. [the object column] would still be storing those large text columns even if object_changes doesn't.

I guess it allows them to save some disk space. I'd like to keep the save_changes option if we can, if for no other reason than backwards compatibility.

If the object column were dropped, and save_changes: false should we raise an error? Maybe not, because they might be content just storing metadata.

That's a big issue when maintaining a long-lived OSS project: past decisions make today's API design much more difficult than it needs to be 💀

Yeah, it's a challenge. Generally speaking, I'm not opposed to breaking changes as long as we provide a convenient, well-documented (and well-deprecated) path to upgrade.

Expanding out the feature matrix, you could:

  • drop object
  • keep object, but set save_object: false on some models
  • drop object_changes
  • keep object_changes, but set save_changes: false on some models

I don't like the idea of introducing a save_object option, because I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent. I have the same concern about save_changes, but short of removing that feature, there's not much we can do about that.

That said, maybe there's a person out there who wants to keep the object column so they can reify some models, but needs to save disk space by not storing it for other models? Is that likely? Does that person's hypothetical need outweigh my concern about novice users shooting themselves in the foot? WDYT?

For simplicity's sake I think I'd rather do a major version bump and start storing data in object_changes on all destroy events.

I'm OK with that. I'm guessing we don't store object_changes on destroy for two reasons. First, we wanted to save disk space. Second, we did not want where_object_changes to return the "destroy" event. Neither of those reasons is hugely compelling to me, though. If advanced users need to save every byte possible in the object_changes column, they can write their own object_changes_adapter. If they don't want where_object_changes to return the "destroy" event, they can add where("event <> 'destroy'") to their call sites.

I'm hoping this can be broken up into a few PRs to make review easier:

  1. start storing data in object_changes on all destroy events
  2. allow people to drop the object column
  3. (undecided) add the suggested save_object option

@seanlinsley
Copy link
Member

I don't like the idea of introducing a save_object option, because I'm worried about people turning it on for a little while, then off, after which their database will be inconsistent. I have the same concern about save_changes, but short of removing that feature, there's not much we can do about that.

That said, maybe there's a person out there who wants to keep the object column so they can reify some models, but needs to save disk space by not storing it for other models? Is that likely? Does that person's hypothetical need outweigh my concern about novice users shooting themselves in the foot? WDYT?

I'm not so worried about the novice user situation b/c even if that does happen, it seems like it'd happen early enough in their trial usage of Paper Trail that they wouldn't have lost anything important.

I think #507 which introduced save_changes was misguided: they were trying to reduce the amount of space used by a large text column, but didn't seem to realize that object would duplicate that column on every save.

If we agree that save_changes isn't a feature that makes sense, perhaps the direction forward is:

  1. in a minor version: deprecate save_changes, and introduce a setting to enable object_changes on destroy
  2. in the next major version: remove save_changes, and track changes on destroy by default

Whichever direction we take, these two things would definitely need PRs:

  1. allow people to drop the object column
  2. add a setting to enable object_changes on destroy

So we can at least work on those in the meantime.

@jaredbeck
Copy link
Member

Closing via #1122, #1123, and #1130.

@seanlinsley
Copy link
Member

seanlinsley commented Sep 4, 2018

Here's the migration we ran. After pg_repack, our versions table went from 53 GB down to 23 GB.

Note that disable_statement_timeout! is from this gem; remove that line if you don't have it installed.

There's at least one spot where this migration is buggy: approx_count should really be doing something like select max(id) b/c otherwise the migration could end up skipping more recent records (in the case where a good number of version records have been deleted).

class DropObjectFromVersions < ActiveRecord::Migration[5.1]
  disable_ddl_transaction!
  disable_statement_timeout!

  def up
    # we perform these updates outside of a transaction (& in batches) to avoid locking the table
    batch_size = 1_000_000
    start = 0

    approx_count = execute(<<-SQL.squish).to_a.dig 0, 'reltuples'
      select reltuples::int from pg_class where relname = 'versions'
    SQL

    loop do
      break if start > approx_count
      stop = start + batch_size

      # 50% of the table size is `object`, which we don't use
      # https://github.com/paper-trail-gem/paper_trail/issues/1046
      #
      # before dropping the `object` column, we backfill `object_changes` on destroy
      # events so we'll still be able to do analytics queries on destroyed objects
      # https://github.com/paper-trail-gem/paper_trail/pull/1123
      update <<-SQL.squish
        update versions set object_changes = changes
        from (
          select id,
            json_object_agg(object.key, array[object.value, null]) as changes
          from versions
          join lateral jsonb_each(object) as object on true
          where event = 'destroy'
            and object_changes is null
            and id >= #{start} and id < #{stop}
          group by id
        ) as v
        where versions.id = v.id
      SQL

      # updated_at represents 21% of `object_changes` space usage, but isn't useful for us
      # https://github.com/paper-trail-gem/paper_trail/issues/1129
      update <<-SQL.squish
        update versions
        set object_changes = object_changes - 'updated_at'
        where object_changes ? 'updated_at'
          and id >= #{start} and id < #{stop}
      SQL

      start += batch_size
    end

    remove_column :versions, :object

    execute 'create extension pg_repack'

    # then after the deploy: `pg_repack -j 2 -t versions your_db`
  end

  def down
    fail ActiveRecord::IrreversibleMigration
  end
end

This code should be added to an initializer to prevent updated_at from being stored going forward:

class SkipUpdatedAt
  def diff(changes)
    changes.except! 'updated_at'
  end
end

PaperTrail.config.object_changes_adapter = SkipUpdatedAt.new

@jaredbeck
Copy link
Member

Here's the migration we ran.

What, no down method?! j/k j/k 😁

After pg_repack, our versions table went from 53 GB down to 23 GB.

Nice!

.. backfill ..

That query is really interesting. I think I understand like 20% of it. I wonder if it's possible in MySQL/MariaDB.

@seanlinsley
Copy link
Member

yeah the lateral join is a little mind-bending. it basically splits each key value pair in object_changes into their own row, then we re-combine them into a new object with json_object_agg.

@jaredbeck
Copy link
Member

I think we should have a new top-level section in the readme titled "Large Databases". Can you paste all this stuff in there? We can't re-number readme sections because that breaks links, so this new section would be number 11.

@jeromedalbert
Copy link

jeromedalbert commented May 8, 2020

Thank you to everyone who made this possible!

I can now make paper_trail use an incremental change design just by omitting the object field, and I have the peace of mind of knowing that the DB size won't get too much out of control too quickly, even if I don't set version limits. I'm OK with the tradeoff of reify not working (although it would be a nice to have), since for the current app I am working on we only care about change history.

This makes paper_trail viable for big databases. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@volkanunsal @jeromedalbert @jaredbeck @seanlinsley @batter and others