Discussion: timestamps waste a lot of space #1129

seanlinsley · 2018-08-12T22:15:16Z

After #1046 / #1099, I thought it'd be interesting to see how much space updated_at ends up using. Turns out it represents 21% of the space used by object_changes (41% for updates).

The total amount is even higher, of course, for those that continue to use the now-optional object column.

I put "discussion" in the title b/c I'm not sure what Paper Trail the project should do with this information. Are there scenarios where it's important to preserve historical changes to updated_at?

The two options would be to change Paper Trail's default behavior to exclude updated_at, or to document this issue, making it easy to resolve.

It seems unlikely that we'd change the default behavior if only b/c of momentum. So: what should the documentation look like? I'd worry that the readme is getting a bit hard to follow w/ tips like this scattered about.

with data as (
  select event,
    count(*) / 1000000 * 1000000 as count,
    sum(pg_column_size(object_changes)) as changes_size,
    sum(
      pg_column_size(jsonb_build_object('updated_at', object_changes->'updated_at'))
        - pg_column_size('{}'::jsonb)
    ) as updated_at_size
  from versions group by event
)
select event, count,
  pg_size_pretty(changes_size) as changes_size,
  pg_size_pretty(updated_at_size) as updated_at_size,
  round(100.0 * updated_at_size / changes_size, 1) as percent
from data
union all
select '(all)' as event,
  sum(count) as count,
  pg_size_pretty(sum(changes_size)) as changes_size,
  pg_size_pretty(sum(updated_at_size)) as updated_at_size,
  round(100.0 * sum(updated_at_size) / sum(changes_size), 1) as percent
from data;

  event  |  count   | changes_size | updated_at_size | percent 
---------+----------+--------------+-----------------+---------
 create  | 23000000 | 11 GB        | 1339 MB         |    11.9
 destroy |  2000000 | 1740 MB      | 126 MB          |     7.2
 update  | 31000000 | 6633 MB      | 2726 MB         |    41.1
 (all)   | 56000000 | 19 GB        | 4191 MB         |    21.4

The text was updated successfully, but these errors were encountered:

jaredbeck · 2018-08-13T05:53:21Z

Are there scenarios where it's important to preserve historical changes to updated_at?

Knowing when a change was made is very important, but we can get that information from versions.created_at. If you want to explore omitting updated_at, I'd be interested to hear what you find. Two potential problems:

Is versions.created_at exactly the same as the record's updated_at? I'd expect it's either exactly the same value, or only a dozen milliseconds after updated_at. Hmmm, though perhaps in the case of PT-AT it could be much later. I guess you'd have to do some reverse engineering to figure out how much precision we'd potentially be losing.
When we reify a version record, would you set its updated_at from versions.created_at?

PS: 56M is a lot of version records!
PPS: nice use of CTE, and really nice use of multiple aggregate functions

seanlinsley · 2018-08-13T14:23:11Z

When we reify a version record, would you set its updated_at from versions.created_at?

Well in this codebase we'll be dropping the object column, so we wouldn't be relying on reify. With the size of the versions table and the rarity of needing things like reify, it's much more important that we make use of disk space efficiently.

Which leads me back to documentation: I feel like there's got to be a better way of organizing these subjects in the docs.

jaredbeck · 2018-08-14T01:16:04Z

When we reify a version record, would you set its updated_at from versions.created_at?

Well in this codebase we'll be dropping the object column, so we wouldn't be relying on reify.

Sure, but other people will still use reify and they will expect updated_at to be .. historically accurate, if you know what I mean.

lorint · 2018-08-27T02:35:43Z

We've been pondering over removing created_at for creates / updated_at for updates for a little while now. This would save substantial space for us, 2 orders of magnitude different than these numbers.

jaredbeck · 2018-08-29T03:59:48Z

Have y'all considered writing an object_changes_adapter for this?

seanlinsley · 2018-08-29T14:35:36Z

Looking at that currently. The first thing I noticed is the docs don't describe what happens if you don't define all three methods.

seanlinsley · 2018-08-29T14:53:55Z

Looks like this ends up causing an error when calling changeset:

class SkipUpdatedAt
  def diff(changes)
    changes.except! 'updated_at'
  end
end

PaperTrail.config.object_changes_adapter = SkipUpdatedAt.new

>> u = User.last
>> u.update first_name: 'b'
>> u.versions.last.object_changes
 => {"first_name"=>["a", "b"]} 
>> u.versions.last.changeset
NoMethodError (undefined method `load_changeset' for #<SkipUpdatedAt:0x00007fbfc1b274a0>)

load_changeset ended up being easy enough to implement, but where_object_changes required reinstantiating the object that called it, to produce the original behavior.

class SkipUpdatedAt
  def diff(changes)
    changes.except! 'updated_at'
  end

  def load_changeset(version)
    version.changeset
  end

  def where_object_changes(*args)
    handler = PaperTrail::Queries::Versions::WhereObjectChanges.new(*args)
    handler.send :jsonb
  end
end

seanlinsley · 2018-08-29T14:59:56Z

Would there be any issue with changing Paper Trail so it only calls the adapter if the given method exists on it?

Another option would be to signal to Paper Trail that we intend to keep the original behavior. Probably the best way to do that would be with throw/catch:

def where_object_changes(*)
  throw :default
end

That'd ensure that people writing adapters don't forget to implement a method accidentally.

jaredbeck · 2018-08-29T16:25:36Z

Would there be any issue with changing Paper Trail so it only calls the adapter if the given method exists on it?

Fine by me.

throw :default

That's a neat idea, but let's wait to see if we really need it.

seanlinsley · 2018-08-29T16:36:32Z

¯\(ツ)/¯ okay. in that case, would this pattern be acceptable?

def execute
  #                                          ⬇️ added code
  if PaperTrail.config.object_changes_adapter&.respond_to?(:where_object_changes)
    return PaperTrail.config.object_changes_adapter.where_object_changes(
      @version_model_class, @attributes
    )
  end

jaredbeck · 2018-08-29T16:45:29Z

object_changes_adapter&.respond_to?

Looks good to me 👍

jaredbeck · 2018-09-04T19:26:14Z

It sounds like object_changes_adapter is going to work for this (#1099 (comment)) so I'll close it.

jaredbeck added the Not a bug label Aug 17, 2018

seanlinsley mentioned this issue Aug 29, 2018

Allow object_changes_adapter to use the default behavior #1148

Merged

jaredbeck closed this as completed Sep 4, 2018

tlynam mentioned this issue Jun 1, 2020

Feature: S3 backend for object and object_changes to reduce database storage #1246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: timestamps waste a lot of space #1129

Discussion: timestamps waste a lot of space #1129

seanlinsley commented Aug 12, 2018

jaredbeck commented Aug 13, 2018

seanlinsley commented Aug 13, 2018

jaredbeck commented Aug 14, 2018

lorint commented Aug 27, 2018

jaredbeck commented Aug 29, 2018

seanlinsley commented Aug 29, 2018

seanlinsley commented Aug 29, 2018 •

edited

Loading

seanlinsley commented Aug 29, 2018

jaredbeck commented Aug 29, 2018

seanlinsley commented Aug 29, 2018

jaredbeck commented Aug 29, 2018

jaredbeck commented Sep 4, 2018

Discussion: timestamps waste a lot of space #1129

Discussion: timestamps waste a lot of space #1129

Comments

seanlinsley commented Aug 12, 2018

jaredbeck commented Aug 13, 2018

seanlinsley commented Aug 13, 2018

jaredbeck commented Aug 14, 2018

lorint commented Aug 27, 2018

jaredbeck commented Aug 29, 2018

seanlinsley commented Aug 29, 2018

seanlinsley commented Aug 29, 2018 • edited Loading

seanlinsley commented Aug 29, 2018

jaredbeck commented Aug 29, 2018

seanlinsley commented Aug 29, 2018

jaredbeck commented Aug 29, 2018

jaredbeck commented Sep 4, 2018

seanlinsley commented Aug 29, 2018 •

edited

Loading