Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement nan_value_counts && distinct_counts metrics in parquet writer #417

Open
ZENOTME opened this issue Jun 23, 2024 · 8 comments · May be fixed by #907
Open

Implement nan_value_counts && distinct_counts metrics in parquet writer #417

ZENOTME opened this issue Jun 23, 2024 · 8 comments · May be fixed by #907
Assignees

Comments

@ZENOTME
Copy link
Contributor

ZENOTME commented Jun 23, 2024

For parquet writer, we still miss following field in DataFile.

  • nan_value_count
  • distinct_counts
@liurenjie1024 liurenjie1024 changed the title miss nan_value_counts && distinct_counts in parquet writer Missing nan_value_counts && distinct_counts in parquet writer Jun 26, 2024
@liurenjie1024 liurenjie1024 changed the title Missing nan_value_counts && distinct_counts in parquet writer Implement nan_value_counts && distinct_counts metrics in parquet writer Jun 26, 2024
@vaibhawvipul
Copy link
Contributor

I can take this up @liurenjie1024

@liurenjie1024
Copy link
Contributor

I can take this up @liurenjie1024

Thanks!

@Xuanwo
Copy link
Member

Xuanwo commented Jul 9, 2024

I can take this up @liurenjie1024

Welcome!

@Fokko
Copy link
Contributor

Fokko commented Nov 27, 2024

Just checking in @vaibhawvipul if you're still interested in adding this :)

@Fokko Fokko mentioned this issue Nov 27, 2024
28 tasks
@feniljain
Copy link
Contributor

feniljain commented Dec 1, 2024

Hey @Fokko ! 👋🏻

As the original author has not replied, I am interested in taking it up :)

Few points regardless of who this gets assigned to:

  • I couldn't see distinct_counts in java or python documentation, am I reading them wrong somewhere, if they are present can someone point me to them please? Also, from what I understand, distinct counts are present on ColumnChunk level, but they would not be possible to aggregate at DataFile level because fields can be same between two different ColumnChunks. Am I understanding this correctly?
  • For NaN value counts, as the javadoc mentions:
Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers. This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing exceptions when they are accessed.

We will have to keep track of it on our own, so I think we would go through each Field in each Column of RecordBatch supplied here, find float values and then count NaNs in it. Is this understanding correct?

@vaibhawvipul vaibhawvipul removed their assignment Dec 1, 2024
@liurenjie1024
Copy link
Contributor

Hi, @feniljain I also didn't find how distinct counts are implemented in java, but according to the spec it's supposed to be an estimated value using sketch. I think we could start with nan values and ignore distinct counts first.

Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts

@feniljain
Copy link
Contributor

but according to the spec it's supposed to be an estimated value using sketch

That sounds interesting, thanks for the link up to spec!

I think we could start with nan values and ignore distinct counts first.

Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?

@liurenjie1024
Copy link
Contributor

Yup, let me work out a PR for nan_values first, also just confirming is the method mentioned by me up above correct for nan_values?

Yes, exactly.

@feniljain feniljain linked a pull request Jan 22, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

6 participants