feat: delete orphaned files #1958

jayceslesar · 2025-04-29T22:42:05Z

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

pyiceberg/table/__init__.py

Fokko

Thanks for working on this @jayceslesar, sorry for the late review.

I think this is a great start, I left some comments, let me know what you think!

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

smaheshwar-pltr

Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄

pyiceberg/table/inspect.py

kevinjqliu

Thanks for the PR! I added a few comments. ptal :)

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

kevinjqliu · 2025-05-04T01:21:41Z

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

jayceslesar · 2025-05-04T16:53:12Z

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

I think that makes sense -- would #1880 end up there too?

Also ideally there is a CLI that exposes all the maintenance actions too right?

I think moving things to a new OptimizeTable class in a new namespace optimize.py makes a lot of sense, can be modeled very similar to the InspectTable and generally makes things cleaner -- I think it still makes sense to have the all_known_files inside of inspect though, and can still use that in the new OptimizeTable

Fokko · 2025-05-13T14:42:37Z

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

That's a good point. However, I think we should be able to either run them separate as well. For example, delete orphan files won't affect the speed of the table, so it is more of a maintenance feature to reduce object storage costs. Delete orphan files can also be pretty costly because of the list operation, ideally you would delegate this to the catalog that uses, for example, s3 inventory.

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

Fokko · 2025-05-16T11:33:53Z

pyiceberg/table/inspect.py

@@ -678,6 +685,28 @@ def all_manifests(self) -> "pa.Table":
        )
        return pa.concat_tables(manifests_by_snapshots)

+    def _all_known_files(self) -> dict[str, set[str]]:


Do we still need this now we have all_files?

I dont think that all_files includes manifest lists or statistics files right?

It should reference deletes, but thinking of it, this is probably much more efficient because it does not pull in all the other data.

Any updates here? What else is blocking this PR -- happy to address

pyiceberg/table/optimize.py

pyiceberg/table/inspect.py

smaheshwar-pltr · 2025-05-22T16:04:18Z

pyiceberg/table/inspect.py

+        _all_known_files = {}
+        _all_known_files["manifests"] = set(self.all_manifests(snapshots)["path"].to_pylist())
+        _all_known_files["manifest_lists"] = {snapshot.manifest_list for snapshot in snapshots}
+        _all_known_files["statistics"] = {statistic.statistics_path for statistic in self.tbl.metadata.statistics}


Nice!

Looking at the Java side, there seems to be a focus on (especially in tests) partition statistics files being not deleted by the procedure too. I don't think this is a concern here because PyIceberg doesn't support reading tables with partition statistics in the first place. But I think it should soon (I put up #2034 / #2033) in which case this logic would have to include that.

Nice if yours merges first I can add that here

hahah nevermind -- yours definitely needs to go first

Anton-Tarazi

Nice work, left some minor comments. Looking forward to this feature :)

Anton-Tarazi · 2025-06-16T03:49:58Z

pyiceberg/table/inspect.py

+        executor = ExecutorFactory.get_or_create()
+        snapshot_ids = [snapshot.snapshot_id for snapshot in snapshots]
+        files_by_snapshots: Iterator[Set[str]] = executor.map(
+            lambda snapshot_id: set(self.files(snapshot_id)["file_path"].to_pylist()), snapshot_ids


might be nice if InspectTable.files or InspectTable._files took an Optional[Union[int, Snapshot]] so we didn't have to get the id from a snapshot and then turn it back into a Snapshot inside InspectTable._files

Yeah I think there are a lot of places where we arbitrarily use one over the other and imo would be nice to standardize. Probably out of scope for this PR but I think would definitely clean things up

Anton-Tarazi · 2025-06-16T03:51:52Z

pyiceberg/table/maintenance.py

+        as_of = datetime.now(timezone.utc) - older_than
+        all_files = [
+            f.path for f in fs.get_file_info(selector) if f.type == FileType.File and (as_of is None or (f.mtime < as_of))
+        ]


when would as_of be None? Also can we construct a set directly here?

Good catch, cleaner now

Anton-Tarazi · 2025-06-16T03:57:49Z

pyiceberg/table/maintenance.py

+        except ModuleNotFoundError as e:
+            raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e
+
+    def _orphaned_files(self, location: str, older_than: timedelta = timedelta(days=3)) -> Set[str]:


nit: could we get rid of the default here since its in remove_orphan_files? could also make this default to None and update handling of as_of below to support None

This should be implemented

jayceslesar · 2025-06-24T12:35:57Z

@Fokko we probably also want pyiceberg to have some idea about https://iceberg.apache.org/spec/#delete-formats right? Is it currently aware of those files?

Fokko · 2025-06-24T14:44:08Z

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

jayceslesar · 2025-06-24T15:35:22Z

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

Sounds good, I will add the partition statistics files when that is merged!

jayceslesar and others added 3 commits April 29, 2025 16:58

feat: delete orphaned files

9dcb580

simpler and a test

e43505c

remove

eed5ea8

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed May 2, 2025

View reviewed changes

jayceslesar added 3 commits May 2, 2025 17:22

updates from review!

8cca600

include dry run and older than

75b1240

add case for dry run

6379480

smaheshwar-pltr suggested changes May 3, 2025

View reviewed changes

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

jayceslesar added 7 commits May 3, 2025 14:16

use .path so we get paths pack

0c2822e

actually pass in iterable

aaf8fc2

capture manifest_list files

b09641b

refactor into all_known_files

beec233

fix type in docstring

b888c56

mildly more readable

ff461ed

beef up tests

3b3b10e

kevinjqliu reviewed May 4, 2025

View reviewed changes

jayceslesar added 4 commits May 4, 2025 12:54

make older_than required

a62c8cf

move under optimize namespace

07cbf1b

add some better logging about what was/was not deleted

54e1e00

Merge branch 'main' into feat/orphan-files

7c780d3

Fokko mentioned this pull request May 13, 2025

Add all filles metadata tables #1626

Merged

Merge branch 'main' into feat/orphan-files

9b6c9ed

Fokko reviewed May 16, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed May 16, 2025

View reviewed changes

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

Fokko reviewed May 16, 2025

View reviewed changes

pyiceberg/table/optimize.py Outdated Show resolved Hide resolved

jayceslesar added 3 commits May 17, 2025 14:08

rename optimize -> maintenance

34d10b9

make orphaned_files private

0335957

correctly coerce list

9f8145c

smaheshwar-pltr reviewed May 22, 2025

View reviewed changes

pyiceberg/table/inspect.py Show resolved Hide resolved

smaheshwar-pltr reviewed May 22, 2025

View reviewed changes

jayceslesar added 3 commits May 28, 2025 14:55

add metadata files

fbdcbd3

Merge branch 'main' into feat/orphan-files

85b4ab3

Merge branch 'main' into feat/orphan-files

c414df8

smaheshwar-pltr mentioned this pull request Jun 11, 2025

Added ExpireSnapshots Feature #1880

Merged

Anton-Tarazi reviewed Jun 16, 2025

View reviewed changes

jayceslesar added 3 commits June 21, 2025 11:40

Merge branch 'main' into feat/orphan-files

aa9d536

fix test

b4c14fc

allow older_than to be None

f4d98d2

feat: delete orphaned files #1958

Are you sure you want to change the base?

feat: delete orphaned files #1958

Conversation

jayceslesar commented Apr 29, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smaheshwar-pltr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented May 4, 2025

Uh oh!

jayceslesar commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Anton-Tarazi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jayceslesar commented May 4, 2025 •

edited

Loading

Fokko May 16, 2025 •

edited

Loading