Skip to content

feat: delete orphaned files #1958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

jayceslesar
Copy link
Contributor

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jayceslesar, sorry for the late review.

I think this is a great start, I left some comments, let me know what you think!

Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a few comments. ptal :)

@kevinjqliu
Copy link
Contributor

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

@jayceslesar
Copy link
Contributor Author

jayceslesar commented May 4, 2025

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

I think that makes sense -- would #1880 end up there too?

Also ideally there is a CLI that exposes all the maintenance actions too right?

I think moving things to a new OptimizeTable class in a new namespace optimize.py makes a lot of sense, can be modeled very similar to the InspectTable and generally makes things cleaner -- I think it still makes sense to have the all_known_files inside of inspect though, and can still use that in the new OptimizeTable

@Fokko
Copy link
Contributor

Fokko commented May 13, 2025

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

That's a good point. However, I think we should be able to either run them separate as well. For example, delete orphan files won't affect the speed of the table, so it is more of a maintenance feature to reduce object storage costs. Delete orphan files can also be pretty costly because of the list operation, ideally you would delegate this to the catalog that uses, for example, s3 inventory.

@@ -678,6 +685,28 @@ def all_manifests(self) -> "pa.Table":
)
return pa.concat_tables(manifests_by_snapshots)

def _all_known_files(self) -> dict[str, set[str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this now we have all_files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think that all_files includes manifest lists or statistics files right?

Copy link
Contributor

@Fokko Fokko May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should reference deletes, but thinking of it, this is probably much more efficient because it does not pull in all the other data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates here? What else is blocking this PR -- happy to address

_all_known_files = {}
_all_known_files["manifests"] = set(self.all_manifests(snapshots)["path"].to_pylist())
_all_known_files["manifest_lists"] = {snapshot.manifest_list for snapshot in snapshots}
_all_known_files["statistics"] = {statistic.statistics_path for statistic in self.tbl.metadata.statistics}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Looking at the Java side, there seems to be a focus on (especially in tests) partition statistics files being not deleted by the procedure too. I don't think this is a concern here because PyIceberg doesn't support reading tables with partition statistics in the first place. But I think it should soon (I put up #2034 / #2033) in which case this logic would have to include that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice if yours merges first I can add that here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahah nevermind -- yours definitely needs to go first

Copy link
Contributor

@Anton-Tarazi Anton-Tarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, left some minor comments. Looking forward to this feature :)

executor = ExecutorFactory.get_or_create()
snapshot_ids = [snapshot.snapshot_id for snapshot in snapshots]
files_by_snapshots: Iterator[Set[str]] = executor.map(
lambda snapshot_id: set(self.files(snapshot_id)["file_path"].to_pylist()), snapshot_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice if InspectTable.files or InspectTable._files took an Optional[Union[int, Snapshot]] so we didn't have to get the id from a snapshot and then turn it back into a Snapshot inside InspectTable._files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there are a lot of places where we arbitrarily use one over the other and imo would be nice to standardize. Probably out of scope for this PR but I think would definitely clean things up

as_of = datetime.now(timezone.utc) - older_than
all_files = [
f.path for f in fs.get_file_info(selector) if f.type == FileType.File and (as_of is None or (f.mtime < as_of))
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would as_of be None? Also can we construct a set directly here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, cleaner now

except ModuleNotFoundError as e:
raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e

def _orphaned_files(self, location: str, older_than: timedelta = timedelta(days=3)) -> Set[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we get rid of the default here since its in remove_orphan_files? could also make this default to None and update handling of as_of below to support None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented

@jayceslesar
Copy link
Contributor Author

@Fokko we probably also want pyiceberg to have some idea about https://iceberg.apache.org/spec/#delete-formats right? Is it currently aware of those files?

@Fokko
Copy link
Contributor

Fokko commented Jun 24, 2025

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

@jayceslesar
Copy link
Contributor Author

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

Sounds good, I will add the partition statistics files when that is merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete orphan files
5 participants