Feather file with compression and larger than RAM #340

Moelf · 2022-10-06T14:14:32Z

Last time I checked, mmap breaks down for files with compression. This is understandable because the compressed buffers clearly can't be re-interpreted without inflation.

But the larger the file is more likely to be compressed, can we decompress only a single "row group" (and only the relevant columns, of course) on the fly yet? -- this is for the case when a user is doing per-row iteration

if user access the table by tbl[range, range], then clearly, we might need to read more than one row-group and chop off head/tails depending on where the overlap is

The text was updated successfully, but these errors were encountered:

quinnj · 2022-10-06T17:48:32Z

Hmmmm......we'll have to see what we can do here. I've had the idea for a while as a Tables.jl-wide feature to support projection/filter push down for sources in a generic way. That would translate really well to Arrow and would allow us to more easily avoid decompressing when not necessary. There's probably more we can do in the short-term though to avoid materializing when not needed.

Moelf · 2022-10-06T18:07:02Z

this is the whole thing we do un UnROOT.jl for a physics-community only thing called TTree, their next-gen storage called RNTuple is basically Apache Feather: https://indico.cern.ch/event/1208767/contributions/5083082/attachments/2523220/4340111/PPP_uproot_RNTuple.pdf#page=13

while we will get there eventually, in UnROOT I have the whole machinery basically:

getindex -> find the row group ->
- if not in cache, decompress and put in cache
- if in cache, directly try to locate the slot

this way, at most one row group worth of data ever lives in RAM, in fact that's the minimal amount you need in RAM, because you can only know row number start-end for an entire row group and you have to count inside it.

but yeah, this is a whole thing in UnROOT.jl and it's mission-critical because our data are like, O(100) GB compressed all the time

JoaoAparicio · 2023-04-03T00:25:05Z

Hmmmm......we'll have to see what we can do here. I've had the idea for a while as a Tables.jl-wide feature to support projection/filter push down for sources in a generic way. That would translate really well to Arrow and would allow us to more easily avoid decompressing when not necessary. There's probably more we can do in the short-term though to avoid materializing when not needed.

Ping for comments on #412.
This isn't the most general filter push down, but it does allow us to avoid unnecessary decompression.

JoaoAparicio mentioned this issue Apr 3, 2023

Add kwarg to filter columns #412

Open

Moelf mentioned this issue Apr 11, 2023

Pre-allocate buffer #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feather file with compression and larger than RAM #340

Feather file with compression and larger than RAM #340

Moelf commented Oct 6, 2022 •

edited

Loading

quinnj commented Oct 6, 2022

Moelf commented Oct 6, 2022 •

edited

Loading

JoaoAparicio commented Apr 3, 2023

Feather file with compression and larger than RAM #340

Feather file with compression and larger than RAM #340

Comments

Moelf commented Oct 6, 2022 • edited Loading

quinnj commented Oct 6, 2022

Moelf commented Oct 6, 2022 • edited Loading

JoaoAparicio commented Apr 3, 2023

Moelf commented Oct 6, 2022 •

edited

Loading

Moelf commented Oct 6, 2022 •

edited

Loading