Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45457: [Python] Add pyarrow.ArrayStatistics #45550

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kou
Copy link
Member

@kou kou commented Feb 17, 2025

Rationale for this change

Apache Arrow C++ can attach statistics read from Apache Parquet data to arrow::Array. If we have the bindings of the feature in Python, Python users can also use attached statistics.

What changes are included in this PR?

  • Add pyarrow.ArrayStatistics
  • Add pyarrow.Array.statistics().

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou
Copy link
Member Author

kou commented Feb 17, 2025

@github-actions crossbow submit -g python

Copy link

⚠️ GitHub issue #45457 has been automatically assigned in GitHub to PR creator.

This comment was marked as outdated.

@kou kou requested review from pitrou and jorisvandenbossche and removed request for pitrou February 18, 2025 00:45
@kou
Copy link
Member Author

kou commented Feb 20, 2025

@pitrou @jorisvandenbossche Could you take a look at this?

@kou
Copy link
Member Author

kou commented Feb 24, 2025

I'll merge this in a few days if nobody objects it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @kou ! Some minor comments below, but LGTM in general.

if null_count.has_value():
return null_count.value()
else:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I've opened a Cython feature request to make this more automatic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've added a comment that refers the issue.

std::optional<arrow::ArrayStatistics::ValueType>> data.

arrow::ArrayStatistics::ValueType is
std::variant<bool, int64_t, uint64_t, double, std::string>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64_t isn't handled below, should the docstring or the code be fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... The code was wrong... I've added the uint64_t case.

raise TypeError("Do not call {}'s constructor directly"
.format(self.__class__.__name__))

cdef void init(self, const shared_ptr[CArrayStatistics]& sp_statistics) except *:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except * means it could raise Python exceptions, but it doesn't here, so perhaps you can remove that annotation (though it's not really a problem either).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't know much about except in Cython...

kou and others added 4 commits February 25, 2025 14:15
It's the bindings of `arrow::ArrayStatistics`. You can get it by
`pyarrow.Array.statistics()`.
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@kou
Copy link
Member Author

kou commented Feb 25, 2025

@github-actions crossbow submit -g python

This comment was marked as outdated.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 25, 2025
@kou
Copy link
Member Author

kou commented Feb 25, 2025

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 25, 2025

This comment was marked as outdated.

assert statistics.min == -1
assert statistics.is_min_exact
assert statistics.max == 3
assert statistics.is_max_exact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test for repr(statistics) to make sure that the string representation works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. I've added it.

@kou
Copy link
Member Author

kou commented Feb 25, 2025

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 25, 2025
Copy link

Revision: e3a20b5

Submitted crossbow builds: ursacomputing/crossbow @ actions-747dbaddf2

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants