[Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation #5007

0dunay0 · 2025-02-03T12:05:08Z

Search before asking

I searched in the issues and found nothing similar.

Paimon version

1.1-SNAPSHOT

Compute Engine

Using Flink to write data to Paimon with Iceberg compatibility.
AWS Athena, Spark, Flink SQL to query the Iceberg table.

Minimal reproduce step

Create a Paimon table with Iceberg compatibility enabled and partitioned by a string field. Try to query the Iceberg table with a predicate based on the partition field will not match any data. The Paimon table itself can be queried by the partition field but not the Iceberg table.

Create a Paimon table with Iceberg compatibility enabled with string partitioning, such as event_date.
Write data into the table, ensuring values like 2024-12-30 are stored in the partition.
Query the Iceberg table using equality predicates, such as: SELECT * FROM iceberg_table WHERE event_date = '2024-12-30';
Observe that no results are returned due to the null character suffix in the manifest files.
Rerun the query but apply the trim function or wildcard filtering e.g.
- `SELECT * FROM iceberg_table WHERE TRIM(event_date) = '2024-12-30';
- `SELECT * FROM iceberg_table WHERE event_date LIKE '2024-12-30%';

What doesn't meet your expectations?

The expectation is that the Iceberg table should accurately reflect the partitions defined in the underlying Paimon tables without any changes or alterations to the values during the serialization process. The presence of a null character suffix in the manifest files prevents successful querying by various client applications (Spark, Flink, Athena).

Anything else?

Concretely, when we inspect the avro manifest files we see that the column stats and partitions summary values have \u0000 suffix e.g.

Snapshot metadata file:

{
    "manifest_path": "s3a://some-bucket/some-prefix/warehouse/some_db/some_table/metadata/dc8e3d96-4144-4853-8ad6-1959ccac318e-m1.avro",
    "manifest_length": 10207,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 1,
    "added_data_files_count": 26,
    "existing_data_files_count": 0,
    "deleted_data_files_count": 0,
    "added_rows_count": 3011776,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": "[{\"contains_null\": false, \"contains_nan\": false, \"lower_bound\": \"2024-12-16\\u0000\", \"upper_bound\": \"2025-01-20\\u0000\"}]"
  }

Manifest file column stats:

{
  "key": 13,
  "value": "2025-01-04\u0000"
}

As a result, when a client (e.g. Spark/Athena) performs a scan of the Iceberg table, it'll skip all the data files after failing to find any manifests that match the predicate given in the query.

Are you willing to submit a PR?

I'm willing to submit a PR! [iceberg] Improved ByteBuffer string conversion for Iceberg manifests #5008

The text was updated successfully, but these errors were encountered:

0dunay0 added the bug Something isn't working label Feb 3, 2025

0dunay0 mentioned this issue Feb 3, 2025

[iceberg] Improved ByteBuffer string conversion for Iceberg manifests #5008

Merged

JingsongLi closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation #5007

[Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation #5007

0dunay0 commented Feb 3, 2025 •

edited

Loading

[Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation #5007

[Bug] Null Character Suffix in Iceberg Manifest Files Due to toByteBuffer Invocation #5007

Comments

0dunay0 commented Feb 3, 2025 • edited Loading

Search before asking

Paimon version

Compute Engine

Minimal reproduce step

What doesn't meet your expectations?

Anything else?

Are you willing to submit a PR?

0dunay0 commented Feb 3, 2025 •

edited

Loading