[BUG] Not all _delta_log
file operations go through the LogStore
interface
#4175
Labels
bug
Something isn't working
Bug
Which Delta project/connector is this regarding?
Describe the problem
While trying to implement a custom LogStore I realized that not all operations accessing files in the
_delta_log
directory go through the store.Specifically, reads of the delta log entries,
<version>.json
don't use theLogStore
at all and instead are read directly using the Hadoop Filesystem (see an example here in theSnapshot
code). I assume this is because we want to read the delta log entries in parallel.Question: is this the intended behavior of
LogStore
, or is it a bug? I need to extend a bit how we access the delta log and wondering whether I should just do this at the Filesystem layer.Steps to reproduce
I reproduced this by doing the following:
LogStore
implementationspark.delta.logStore.class
to the classname of my customLogStore
LogStore
to find out when each file is accessedObserved results
Some files are read using the
LogStore
, like the checkpoing<version>.crc
files, but not the<version>.json
files.Expected results
Expected all files in
_delta_log
to be read from theLogStore
.Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: