Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columnar json writer for arrow-json #6411

Closed
adriangb opened this issue Sep 17, 2024 · 5 comments
Closed

Columnar json writer for arrow-json #6411

adriangb opened this issue Sep 17, 2024 · 5 comments
Labels
arrow Changes to the arrow crate

Comments

@adriangb
Copy link
Contributor

adriangb commented Sep 17, 2024

To get an output like:

{
  "a": [1, 2, 3],
  "b": ["foo", "bar", null]
}

The idea is that I can attach a schema to this and it will be much more compact (and possibly more performant to serializer / deserialize?).

Does this sounds like a good idea that would be accepted to the package?

@adriangb adriangb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 17, 2024
@adriangb
Copy link
Contributor Author

cc @tustvold

@tustvold
Copy link
Contributor

tustvold commented Sep 18, 2024

How would this encode nested types like ListArray, StructArray or MapArray?

This would also not lend itself to streaming reads, which is normally important to bound memory usage

@adriangb
Copy link
Contributor Author

I'm not claiming to have thought it all the way through but {"list": [["a","b"],[],["c"]]} and such?

@tustvold
Copy link
Contributor

That would only work for a list of primitives, a list of structs would need to encode the structs as list records to preserve the multiple levels of nullability, at which point you're back to effectively the current format, just exploded by one level

I think given:

  • There would be little ability to share code between the two formats
  • It would not be compatible with other arrow implementations
  • It is unlikely to perform significantly differently, the major overhead of JSON is tokenising and integer/float parsing which would be unchanged
  • There are open questions about supporting nested types
  • It can't be decoded a batch at a time

It is hard for me to recommend including it in this repository.

Perhaps we could take a step back and ascertain what the desired outcome is? If it is just to reduce the size, running the current JSON format through lz4 will likely yield far greater returns for very little additional overhead compared to the costs of JSON parsing

@adriangb
Copy link
Contributor Author

Closing this as wonfix

@adriangb adriangb closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024
@alamb alamb removed the enhancement Any new improvement worthy of a entry in the changelog label Oct 2, 2024
@alamb alamb added the arrow Changes to the arrow crate label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

No branches or pull requests

3 participants