Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation #5007

Closed
2 tasks done
0dunay0 opened this issue Feb 3, 2025 · 0 comments
Closed
2 tasks done
Labels
bug Something isn't working

Comments

@0dunay0
Copy link
Contributor

0dunay0 commented Feb 3, 2025

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

1.1-SNAPSHOT

Compute Engine

  • Using Flink to write data to Paimon with Iceberg compatibility.
  • AWS Athena, Spark, Flink SQL to query the Iceberg table.

Minimal reproduce step

Create a Paimon table with Iceberg compatibility enabled and partitioned by a string field. Try to query the Iceberg table with a predicate based on the partition field will not match any data. The Paimon table itself can be queried by the partition field but not the Iceberg table.

  • Create a Paimon table with Iceberg compatibility enabled with string partitioning, such as event_date.
  • Write data into the table, ensuring values like 2024-12-30 are stored in the partition.
  • Query the Iceberg table using equality predicates, such as: SELECT * FROM iceberg_table WHERE event_date = '2024-12-30';
  • Observe that no results are returned due to the null character suffix in the manifest files.
  • Rerun the query but apply the trim function or wildcard filtering e.g.
    • `SELECT * FROM iceberg_table WHERE TRIM(event_date) = '2024-12-30';
    • `SELECT * FROM iceberg_table WHERE event_date LIKE '2024-12-30%';

What doesn't meet your expectations?

The expectation is that the Iceberg table should accurately reflect the partitions defined in the underlying Paimon tables without any changes or alterations to the values during the serialization process. The presence of a null character suffix in the manifest files prevents successful querying by various client applications (Spark, Flink, Athena).

Anything else?

Concretely, when we inspect the avro manifest files we see that the column stats and partitions summary values have \u0000 suffix e.g.

Snapshot metadata file:

{
    "manifest_path": "s3a://some-bucket/some-prefix/warehouse/some_db/some_table/metadata/dc8e3d96-4144-4853-8ad6-1959ccac318e-m1.avro",
    "manifest_length": 10207,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 1,
    "added_data_files_count": 26,
    "existing_data_files_count": 0,
    "deleted_data_files_count": 0,
    "added_rows_count": 3011776,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": "[{\"contains_null\": false, \"contains_nan\": false, \"lower_bound\": \"2024-12-16\\u0000\", \"upper_bound\": \"2025-01-20\\u0000\"}]"
  }

Manifest file column stats:

{
  "key": 13,
  "value": "2025-01-04\u0000"
}

As a result, when a client (e.g. Spark/Athena) performs a scan of the Iceberg table, it'll skip all the data files after failing to find any manifests that match the predicate given in the query.

Are you willing to submit a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants