Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: timestamps waste a lot of space #1129

Closed
seanlinsley opened this issue Aug 12, 2018 · 12 comments
Closed

Discussion: timestamps waste a lot of space #1129

seanlinsley opened this issue Aug 12, 2018 · 12 comments

Comments

@seanlinsley
Copy link
Member

After #1046 / #1099, I thought it'd be interesting to see how much space updated_at ends up using. Turns out it represents 21% of the space used by object_changes (41% for updates).

The total amount is even higher, of course, for those that continue to use the now-optional object column.

I put "discussion" in the title b/c I'm not sure what Paper Trail the project should do with this information. Are there scenarios where it's important to preserve historical changes to updated_at?

The two options would be to change Paper Trail's default behavior to exclude updated_at, or to document this issue, making it easy to resolve.

It seems unlikely that we'd change the default behavior if only b/c of momentum. So: what should the documentation look like? I'd worry that the readme is getting a bit hard to follow w/ tips like this scattered about.

with data as (
  select event,
    count(*) / 1000000 * 1000000 as count,
    sum(pg_column_size(object_changes)) as changes_size,
    sum(
      pg_column_size(jsonb_build_object('updated_at', object_changes->'updated_at'))
        - pg_column_size('{}'::jsonb)
    ) as updated_at_size
  from versions group by event
)
select event, count,
  pg_size_pretty(changes_size) as changes_size,
  pg_size_pretty(updated_at_size) as updated_at_size,
  round(100.0 * updated_at_size / changes_size, 1) as percent
from data
union all
select '(all)' as event,
  sum(count) as count,
  pg_size_pretty(sum(changes_size)) as changes_size,
  pg_size_pretty(sum(updated_at_size)) as updated_at_size,
  round(100.0 * sum(updated_at_size) / sum(changes_size), 1) as percent
from data;

  event  |  count   | changes_size | updated_at_size | percent 
---------+----------+--------------+-----------------+---------
 create  | 23000000 | 11 GB        | 1339 MB         |    11.9
 destroy |  2000000 | 1740 MB      | 126 MB          |     7.2
 update  | 31000000 | 6633 MB      | 2726 MB         |    41.1
 (all)   | 56000000 | 19 GB        | 4191 MB         |    21.4
@jaredbeck
Copy link
Member

Are there scenarios where it's important to preserve historical changes to updated_at?

Knowing when a change was made is very important, but we can get that information from versions.created_at. If you want to explore omitting updated_at, I'd be interested to hear what you find. Two potential problems:

  1. Is versions.created_at exactly the same as the record's updated_at? I'd expect it's either exactly the same value, or only a dozen milliseconds after updated_at. Hmmm, though perhaps in the case of PT-AT it could be much later. I guess you'd have to do some reverse engineering to figure out how much precision we'd potentially be losing.
  2. When we reify a version record, would you set its updated_at from versions.created_at?

PS: 56M is a lot of version records!
PPS: nice use of CTE, and really nice use of multiple aggregate functions

@seanlinsley
Copy link
Member Author

When we reify a version record, would you set its updated_at from versions.created_at?

Well in this codebase we'll be dropping the object column, so we wouldn't be relying on reify. With the size of the versions table and the rarity of needing things like reify, it's much more important that we make use of disk space efficiently.

Which leads me back to documentation: I feel like there's got to be a better way of organizing these subjects in the docs.

@jaredbeck
Copy link
Member

When we reify a version record, would you set its updated_at from versions.created_at?

Well in this codebase we'll be dropping the object column, so we wouldn't be relying on reify.

Sure, but other people will still use reify and they will expect updated_at to be .. historically accurate, if you know what I mean.

@lorint
Copy link
Contributor

lorint commented Aug 27, 2018

We've been pondering over removing created_at for creates / updated_at for updates for a little while now. This would save substantial space for us, 2 orders of magnitude different than these numbers.

@jaredbeck
Copy link
Member

Have y'all considered writing an object_changes_adapter for this?

@seanlinsley
Copy link
Member Author

Looking at that currently. The first thing I noticed is the docs don't describe what happens if you don't define all three methods.

@seanlinsley
Copy link
Member Author

seanlinsley commented Aug 29, 2018

Looks like this ends up causing an error when calling changeset:

class SkipUpdatedAt
  def diff(changes)
    changes.except! 'updated_at'
  end
end

PaperTrail.config.object_changes_adapter = SkipUpdatedAt.new

>> u = User.last
>> u.update first_name: 'b'
>> u.versions.last.object_changes
 => {"first_name"=>["a", "b"]} 
>> u.versions.last.changeset
NoMethodError (undefined method `load_changeset' for #<SkipUpdatedAt:0x00007fbfc1b274a0>)

load_changeset ended up being easy enough to implement, but where_object_changes required reinstantiating the object that called it, to produce the original behavior.

class SkipUpdatedAt
  def diff(changes)
    changes.except! 'updated_at'
  end

  def load_changeset(version)
    version.changeset
  end

  def where_object_changes(*args)
    handler = PaperTrail::Queries::Versions::WhereObjectChanges.new(*args)
    handler.send :jsonb
  end
end

@seanlinsley
Copy link
Member Author

Would there be any issue with changing Paper Trail so it only calls the adapter if the given method exists on it?

Another option would be to signal to Paper Trail that we intend to keep the original behavior. Probably the best way to do that would be with throw/catch:

def where_object_changes(*)
  throw :default
end

That'd ensure that people writing adapters don't forget to implement a method accidentally.

@jaredbeck
Copy link
Member

Would there be any issue with changing Paper Trail so it only calls the adapter if the given method exists on it?

Fine by me.

throw :default

That's a neat idea, but let's wait to see if we really need it.

@seanlinsley
Copy link
Member Author

¯\(ツ)/¯ okay. in that case, would this pattern be acceptable?

def execute
  #                                          ⬇️ added code
  if PaperTrail.config.object_changes_adapter&.respond_to?(:where_object_changes)
    return PaperTrail.config.object_changes_adapter.where_object_changes(
      @version_model_class, @attributes
    )
  end

@jaredbeck
Copy link
Member

object_changes_adapter&.respond_to?

Looks good to me 👍

@jaredbeck
Copy link
Member

It sounds like object_changes_adapter is going to work for this (#1099 (comment)) so I'll close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@jaredbeck @seanlinsley @lorint and others