Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support get partition table with filter #24

Closed
Fokko opened this issue Oct 2, 2023 · 8 comments
Closed

Support get partition table with filter #24

Fokko opened this issue Oct 2, 2023 · 8 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Oct 2, 2023

Feature Request / Improvement

Migration of issue apache/iceberg#8619

@puchengy
Copy link
Contributor

puchengy commented Oct 2, 2023

Hello @Fokko, here is my use case:

  1. Given a table, find out all the partitions it has.
  2. Given a partition filter, check if such partition exist in the table.

Thank you!

@Fokko
Copy link
Contributor Author

Fokko commented Oct 2, 2023

@puchengy The problem with Iceberg is that the partition is more of a logical concept, rather than a physical path like in a Hive table. What do you think of passing in a predicate, and letting the Airflow sensor pass if there are rows?

For example, you could go from a daily to an hourly partition. Then you would get:

2023-01-01T00:00:00
2023-01-02T00:00:00
2023-01-03T00:00:00
2023-01-03T23:00:00 # Changed from daily to hourly
2023-01-04T00:00:00
2023-01-04T01:00:00
2023-01-04T02:00:00
2023-01-04T03:00:00

@puchengy
Copy link
Contributor

puchengy commented Oct 2, 2023

What do you think of passing in a predicate, and letting the Airflow sensor pass if there are rows?

@Fokko That works. This is actually what we are doing (but for legacy_python) pinterest/iceberg@7d8d65d

Would this be something we can implement in the upstream? Thanks

@puchengy
Copy link
Contributor

puchengy commented Oct 5, 2023

@Fokko gentle ping, thanks ^

@Fokko
Copy link
Contributor Author

Fokko commented Oct 5, 2023

@puchengy Yes, certainly. Would this be something that you're interested in working on? From the snapshot, we can load the manifest list, and from there the manifests themselves, which contain the partition information

@puchengy
Copy link
Contributor

puchengy commented Oct 6, 2023

@Fokko Yes, I can help. Thanks.

@pp-akursar
Copy link

I was looking for something comparable to spark's partitions metadata table, which lets me do something like this

SELECT partition, last_updated_snapshot_id, last_updated_at
FROM prod.db.table.partitions
WHERE partition.foo='bar'

to determine if and when a partition was updated, and came across this issue. It sounds like this could be provided by this Feature Request if it includes a ManifestReader with filtering features like the linked code in legacy_python. Is that correct? If not, I will try to raise as a separate issue.

@Fokko Fokko mentioned this issue May 13, 2024
8 tasks
@sungwy
Copy link
Collaborator

sungwy commented Jun 15, 2024

Partitions table was added in: #603

@sungwy sungwy closed this as completed Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants