support all_entries in pyiceberg #1608

amitgilad3 · 2025-02-04T17:10:57Z

Implements all_entries metadata table - #1053
This is my initial pr towards supporting all_entries in pyiceberg , i'm currently running into an issue where the number of entries returned by spark is less then what pyiceberg returns and i'm not sure why

pyiceberg results:

spark results:

Im also attaching the csv for investigation puporses
py_iceberg vs spark.csv

*Update - issue resolved

kevinjqliu · 2025-02-04T17:27:12Z

the number of entries returned by spark is less then what pyiceberg returns and i'm not sure why

looks like there are a lot of dups in pyiceberg. if you do set and sort, it should match

amitgilad3 · 2025-02-04T17:28:41Z

Thnx @kevinjqliu - wasn't sure how the iceberg java library does it :)

amitgilad3 · 2025-02-05T21:19:54Z

Ready for review

kevinjqliu

Thanks for the PR! I added some comments

kevinjqliu · 2025-02-08T17:53:47Z

tests/integration/test_inspect_table.py

+        for snapshot_id in df["snapshot_id"]:
+            assert isinstance(snapshot_id.as_py(), int)


this is redundant, right? already included in the above

kevinjqliu · 2025-02-08T17:54:23Z

tests/integration/test_inspect_table.py

+            "readable_metrics",
+        ]
+
+        # Make sure that they are filled properly


nit: add a comment to signal that this checks the first 4 columns. and the rest of the tests check for the last 2 data_file and readable_metrics

kevinjqliu · 2025-02-08T17:54:56Z

tests/integration/test_inspect_table.py

+        lhs["content"] = lhs["data_file"].apply(lambda x: x.get("content"))
+        lhs["file_path"] = lhs["data_file"].apply(lambda x: x.get("file_path"))
+        lhs = lhs.sort_values(["status", "snapshot_id", "sequence_number", "content", "file_path"]).drop(
+            columns=["file_path", "content"]
+        )


nit: maybe inline these operation to show that the same operations are performed on lhs and rhs

pyiceberg/table/inspect.py

kevinjqliu · 2025-02-08T18:00:41Z

pyiceberg/table/inspect.py

+                        "partition": partition_record_dict,
+                        "record_count": entry.data_file.record_count,
+                        "file_size_in_bytes": entry.data_file.file_size_in_bytes,
+                        "column_sizes": dict(entry.data_file.column_sizes) if entry.data_file.column_sizes is not None else None,


how about

"column_sizes": dict(entry.data_file.column_sizes) or None

kevinjqliu · 2025-02-08T18:03:59Z

pyiceberg/table/inspect.py

+    def _get_entries(self, manifest: ManifestFile, discard_deleted: bool = True) -> "pa.Table":
+        import pyarrow as pa
+
+        schema = self.tbl.metadata.schema()


something im concerned about with the all_* metadata tables is taking into account the table schema/partition evolution.

since we're looking across all snapshots, it might not be right to use the current snapshot here

amitgilad3 · 2025-02-10T18:01:46Z

Ready for review

kevinjqliu

Generally LGTM. I've added some more comments about using the table's current metadata

kevinjqliu · 2025-02-16T21:20:56Z

mkdocs/docs/api.md

@@ -716,6 +716,8 @@ readable_metrics: [
 [6.0989]]
 ```

+To show all the table's current manifest entries for both data and delete files, use `table.inspect.all_entries()`.


i'd be great to add this to its own section, similar to https://iceberg.apache.org/docs/nightly/spark-queries/#all-metadata-tables

kevinjqliu · 2025-02-16T21:24:43Z

pyiceberg/table/inspect.py

+                        "value_counts": dict(entry.data_file.value_counts) if entry.data_file.value_counts is not None else None,
+                        "null_value_counts": dict(entry.data_file.null_value_counts)
+                        if entry.data_file.null_value_counts is not None
+                        else None,
+                        "nan_value_counts": dict(entry.data_file.nan_value_counts)
+                        if entry.data_file.nan_value_counts is not None
+                        else None,


nit: can we use the same dict() or None pattern here

i used the same logic you used in lines L645 - L650 , if i use dict() or None then the code breaks when trying todo dict(None)

kevinjqliu · 2025-02-16T21:26:14Z

pyiceberg/table/inspect.py

@@ -157,74 +158,96 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType:
                pa.field("readable_metrics", pa.struct(readable_metrics_struct), nullable=True),
            ]
        )
+        return entries_schema
+
+    def _get_entries(self, schema: "pa.Schema", manifest: ManifestFile, discard_deleted: bool = True) -> "pa.Table":


schema here should be iceberg schema instead of pa.schema

there are still a couple places in this function using self.tbl. which is the table's current metadata
L189 & L195.
We should use the schema and partition spec at the time of the snapshot instead

kevinjqliu · 2025-02-16T21:28:10Z

pyiceberg/table/inspect.py

@@ -94,7 +94,7 @@ def snapshots(self) -> "pa.Table":
            schema=snapshots_schema,
        )

-    def entries(self, snapshot_id: Optional[int] = None) -> "pa.Table":
+    def _get_entries_schema(self) -> "pa.Schema":


same problem here using the table's current metadata (self.tbl.)

soumya-ghosh · 2025-05-09T19:41:26Z

@amitgilad3 @Fokko @kevinjqliu Let's get this PR to completion to complete remaining metadata tables.

2. dont use latest schema and partition spec , use the one at the time of the snapshot instead

amitgilad3 · 2025-05-15T16:42:39Z

fixed merge conflict so i think we are good to go @soumya-ghosh

jayceslesar · 2025-06-29T16:43:18Z

pyiceberg/table/inspect.py

+    def entries(self, snapshot_id: Optional[int] = None) -> "pa.Table":
+        import pyarrow as pa
+
+        entries = []
+        snapshot = self._get_snapshot(snapshot_id)
+
+        if snapshot.schema_id is None:
+            raise ValueError(f"Cannot find schema_id for snapshot {snapshot.snapshot_id}")
+
+        schema = self.tbl.schemas().get(snapshot.schema_id)
+        if not schema:
+            raise ValueError(f"Cannot find schema with ID {snapshot.schema_id}")
+        for manifest in snapshot.manifests(self.tbl.io):
+            manifest_entries = self._get_entries(schema=schema, manifest=manifest, discard_deleted=True)
+            entries.append(manifest_entries)
+        return pa.concat_tables(entries)


I think its safe to use an executor for parallelism in this for loop right? entries isnt used anywhere else

jayceslesar · 2025-06-29T16:44:52Z

pyiceberg/table/inspect.py

+        all_manifests: Iterator[List[ManifestFile]] = executor.map(lambda snapshot: snapshot.manifests(self.tbl.io), snapshots)
+        unique_flattened_manifests = list(
+            {manifest.manifest_path: manifest for manifest_list in all_manifests for manifest in manifest_list}.values()
+        )
+


can we use the all_manifests method of the InspectTable class instead of "re-defining" it to get the unique_flattened_manifests ?

amitgilad3 marked this pull request as draft February 4, 2025 17:11

amitgilad3 marked this pull request as ready for review February 5, 2025 05:54

kevinjqliu reviewed Feb 8, 2025

View reviewed changes

kevinjqliu reviewed Feb 16, 2025

View reviewed changes

amitgilad3 added 5 commits May 15, 2025 19:30

support all_entries in pyiceberg

815ea0d

support all_entries in pyiceberg

56235e3

fix comments of pr

d88179b

remove option for a specific snapshot

1834dd8

1. update docs

8888093

2. dont use latest schema and partition spec , use the one at the time of the snapshot instead

amitgilad3 force-pushed the all_entries branch from 5f04f91 to 8888093 Compare May 15, 2025 16:37

lint

a822a76

jayceslesar reviewed Jun 29, 2025

View reviewed changes

		for snapshot_id in df["snapshot_id"]:
		assert isinstance(snapshot_id.as_py(), int)

support all_entries in pyiceberg #1608

Are you sure you want to change the base?

support all_entries in pyiceberg #1608

Uh oh!

Conversation

amitgilad3 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Feb 4, 2025

Uh oh!

amitgilad3 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitgilad3 commented Feb 5, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amitgilad3 commented Feb 10, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soumya-ghosh commented May 9, 2025

Uh oh!

amitgilad3 commented May 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayceslesar Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amitgilad3 commented Feb 4, 2025 •

edited

Loading

amitgilad3 commented Feb 4, 2025 •

edited

Loading

jayceslesar Jun 29, 2025 •

edited

Loading