Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: nan_value_counts support #907

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

feniljain
Copy link
Contributor

Issue

Fixes #417

Description

  • We compute upper and lower bounds by relying on parquet statistics, but those statistics don't provide nan_value_count, so we have to implement it in library itself when arrow record batches are received.
  • We keep track of it at ParquetWriter level cause write can be called multiple times .
  • Added new assert for nan_val_count in test_parquet_writer itself.

@feniljain feniljain force-pushed the feat-nan-value-counts branch from 281daa1 to 7e76ade Compare January 22, 2025 20:17
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @feniljain for this pr! Left some comments to resolve.

let dt = col.data_type();

let nan_val_cnt: u64 = match dt {
DataType::Float32 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add float64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is incorrect, it ignored nested primitive type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement nan_value_counts && distinct_counts metrics in parquet writer
2 participants