-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45457: [Python] Add pyarrow.ArrayStatistics
#45550
base: main
Are you sure you want to change the base?
Conversation
@github-actions crossbow submit -g python |
|
This comment was marked as outdated.
This comment was marked as outdated.
@pitrou @jorisvandenbossche Could you take a look at this? |
I'll merge this in a few days if nobody objects it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @kou ! Some minor comments below, but LGTM in general.
if null_count.has_value(): | ||
return null_count.value() | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I've opened a Cython feature request to make this more automatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've added a comment that refers the issue.
std::optional<arrow::ArrayStatistics::ValueType>> data. | ||
|
||
arrow::ArrayStatistics::ValueType is | ||
std::variant<bool, int64_t, uint64_t, double, std::string>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t
isn't handled below, should the docstring or the code be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... The code was wrong... I've added the uint64_t
case.
python/pyarrow/array.pxi
Outdated
raise TypeError("Do not call {}'s constructor directly" | ||
.format(self.__class__.__name__)) | ||
|
||
cdef void init(self, const shared_ptr[CArrayStatistics]& sp_statistics) except *: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except *
means it could raise Python exceptions, but it doesn't here, so perhaps you can remove that annotation (though it's not really a problem either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't know much about except
in Cython...
It's the bindings of `arrow::ArrayStatistics`. You can get it by `pyarrow.Array.statistics()`.
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
54a3e44
to
bf7bdda
Compare
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
@github-actions crossbow submit -g python |
This comment was marked as outdated.
This comment was marked as outdated.
assert statistics.min == -1 | ||
assert statistics.is_min_exact | ||
assert statistics.max == 3 | ||
assert statistics.is_max_exact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test for repr(statistics)
to make sure that the string representation works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea. I've added it.
@github-actions crossbow submit -g python |
Revision: e3a20b5 Submitted crossbow builds: ursacomputing/crossbow @ actions-747dbaddf2 |
Rationale for this change
Apache Arrow C++ can attach statistics read from Apache Parquet data to
arrow::Array
. If we have the bindings of the feature in Python, Python users can also use attached statistics.What changes are included in this PR?
pyarrow.ArrayStatistics
pyarrow.Array.statistics()
.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::ArrayStatistics
bindings #45457