-
Notifications
You must be signed in to change notification settings - Fork 284
[feat] add missing metadata tables #1053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kevinjqliu I would like to work on this one. |
Hey @soumya-ghosh - if you want to split the workload between us i would love to also give this a try |
Sure @amitgilad3, most likely there will be separate PRs for each of above metadata tables. |
Thanks for volunteering to contribute! I was thinking we could do something similar to #511 where each metadata table can be assigned at a time. And feel free to work on another after the first is done! |
@kevinjqliu we can group the tasks in following way:
What do you think? |
That makes sense to me, thanks @soumya-ghosh |
@kevinjqliu added PR #1066 for |
Hey @kevinjqliu, any thoughts how to implement |
What is the difference between your implementation's output vs sparks? From the spark docs, "To show all files, data files and delete files across all tracked snapshots, query prod.db.table.all_files"
this sounds right to me. maybe spark gets rid of duplicate rows? |
From spark docs,
So, here's my approach (pseudo-code): metadata = load_table_metadata()
for snapshot in metadata["snapshots"]:
manifest_list = read manifest list from snapshot
for manifest_file in manifest_list:
manifest = read manifest file
for file in manifest:
process file (data_file or delete_file) With this approach the number of files in output is much higher than the corresponding output of |
I see. So if I have a new table and append to it 5 times, I expect 5 snapshots and 5 manifest list files. I think each manifest list file will repeatedly refer to the same underlying manifest file, which will be read over and over causing duplicates. What if you just return all unique (data+delete) files? |
In this case, output will not match with Spark. Will that be okay? Also found this PR from Iceberg,
|
@soumya-ghosh I wonder if that's still the case today, that PR is from 2020. |
@kevinjqliu added PR - #1241 for Will get on with |
Thanks for your contribution here @soumya-ghosh. I just merged #1241 for |
Yes I will start working on that soon, have been busy last few weeks so couldn't make any progress. |
Hey @soumya-ghosh & @kevinjqliu , would love to contribute . i dont want to step on you work so i was wondering what i can take from this list: positional_deletes, all_files, all_data_files and all_delete_files ? |
sure @amitgilad3. You can work on If you want to work on |
@soumya-ghosh , Ill will start with positional_deletes and see how fast i can finish it , once im done we can see about the rest |
@kevinjqliu added PR - #1626 for |
Awesome work!! @soumya-ghosh - if all goes well next release will have all metadata tables acessable from pyiceberg 🚀 |
@amitgilad3 Right back at you! |
Thanks for the contribution!! Appreciate it.
Other than those, i think we're good to include this in the next release! 🥳 |
For point 1 - will raise a separate PR covering documentation updates for these metadata tables. For point 2, 3, 4 - As per current iceberg code, such operations are not supported on |
what i mean is that for I guess this can also occur for the rest of the metadata tables too. For example, there's a bug in the partitions metadata table right now for partition evolution #1120 I just want to double check these things before calling this done :) |
I understand that I did a test to see the behavior in Spark, observations in gist. It appears that in Spark constructs the Thoughts @kevinjqliu ? |
Hey @soumya-ghosh @kevinjqliu - just so i understand since i already implemented support for specific snapshot in all_entries and in position_deletes , do we want to support this or not ? |
@amitgilad3 were you able to test |
@soumya-ghosh - when i run
i get the following error -
so i guess we should not support it for all_entries but for position_deletes it works with spark so ill keep it |
@kevinjqliu awaiting your thoughts on above 4 comments. |
hey folks, sorry for the late response here. I think there are a couple of different things here.
@soumya-ghosh @amitgilad3 does that make sense? Please let me know if i misinterpreted anything |
@kevinjqliu I agree to your point that while time-travelling to older snapshots, metadata tables should adhere to schema as of that snapshot.
In output files of attached gist, we can see that columns like lower_bound, upper_bound, value_counts are adhering to schema as they are obtained from the data files itself, whereas readable_metrics is not as it is deriving its structure from current schema. My opinion is we should merge this PR and initiate a discussion / PR to correct this behavior in main Iceberg. WDYT, @kevinjqliu ? |
@soumya-ghosh Thanks for pointing this out with the Gist. I agree that it would be good to raise this upstream. Could you create an issue at apache/iceberg? |
After playing a bit more with it, I don't think it makes sense to always show the readable metrics based on the current schema. I think they should be derived from the metrics that are already there. Also, having all the readable metrics showing up for a Puffin file (with a deletion vector), does not make any sense: |
Implements below metadata table from - #1053 - `all_files` - `all_data_files` - `all_delete_files` Refactored code for files metadata for better reusability
Feature Request / Improvement
Looks like there are a few more metadata tables currently missing in PyIceberg.
Source of truth for metadata tables: https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/MetadataTableType.html
Done: https://py.iceberg.apache.org/api/#inspecting-tables
Missing:
The text was updated successfully, but these errors were encountered: