Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: add support to optimize, analyze tables and expire snapshots, remove orphan files #8183

Closed
alaturqua opened this issue Jul 29, 2023 · 12 comments
Labels

Comments

@alaturqua
Copy link
Contributor

alaturqua commented Jul 29, 2023

Feature Request / Improvement

As an pyiceberg user I would like to be able to do following with PyIceberg:

  • expire snapshots
  • remove orphan files
  • optimize tables
  • analyze tables

Regards.

Query engine

Other

@Fokko Fokko added the python label Aug 1, 2023
@Fokko
Copy link
Contributor

Fokko commented Aug 1, 2023

Thanks @alaturqua for raising this. It would be cool to add this. Can you elaborate on what you mean by analyze tables?

@alaturqua
Copy link
Contributor Author

alaturqua commented Aug 1, 2023

@Fokko

I mean letting it gather statistics. I don‘t know exactly if metadata files are holding statistics.

For example trino has this function:
analyze table_name
https://trino.io/docs/current/connector/iceberg.html#table-statistics

additionally letting it roll back snapshots or registering iceberg tables would be a nice thing as well.

@Fokko
Copy link
Contributor

Fokko commented Aug 1, 2023

@alaturqua That's an interesting thought. I think that should be quite straightforward. We could collect all the column metrics, and combine them to show count, null_count, nan_count, min and max.

@Fokko
Copy link
Contributor

Fokko commented Aug 1, 2023

@alaturqua which catalog are you using? Rollback of snapshots is also not that much work.

@alaturqua
Copy link
Contributor Author

@Fokko

We are using hive metastore. But plan to switch to rest catalog, if it supports views in the future.

@alaturqua
Copy link
Contributor Author

alaturqua commented Aug 1, 2023

@Fokko

I have a use case, where we copy iceberg table folders from a blob storage location to another.

And updating metadata and file location on meta files is really cumbersome. Being able to update metadata location and file locations respectively plus registering the table via pyiceberg would be great improvement as well.

It can be used for migrations between vendors or storages etc.

@Fokko
Copy link
Contributor

Fokko commented Aug 1, 2023

This would be quite straightforward. You would need to implement the _commit_table to update the table properties for the Hive catalog, this is already done for the REST catalog.

If you're interested in contributing, this might be an interesting PR, otherwise I can try to squeeze it in at some point

@onerishabh
Copy link
Contributor

Hey @Fokko, happy to start looking into some features here one by one. Would be great if each task is an issue and has a brief description about how to tackle it. 😃

@netanelm-upstream
Copy link

@alaturqua Hi, Can you please also add the rewriteManifests operation?
It can be very useful.

@alaturqua
Copy link
Contributor Author

alaturqua commented Aug 9, 2023

Hi @netanelm-upstream,

I created the feature request but unfortunately I do not have time to work on these in the near future.

I was hoping @Fokko would take these into his backlog.

@Fokko
Copy link
Contributor

Fokko commented Aug 9, 2023

Thanks everyone for jumping in here.

  • Expire snapshots, and optimize table These require write support, which we are working on! This will also include rewriteManifest operations.
  • Remove orphan files Is something that could be done today. This would require listing all the files from the table, and also listing all the files from a specific directory. We could do this in the CLI where you get a prompt of the files that need to be deleted.
  • Analyze tables We currently have listing capabilities in the CLI, maybe it would be cool to also include the min/max/nulls in there.

I was hoping @Fokko would take these into his backlog.

Yes, I'm happy to, but we first need to get write support in :)

@Fokko
Copy link
Contributor

Fokko commented Oct 2, 2023

I've migrated this issue to the new repository: apache/iceberg-python#31 We definitely don't want to lose track of this! 👍

@Fokko Fokko closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants