Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logic to determine the partitions #728

Closed
Fokko opened this issue Nov 27, 2024 · 5 comments
Closed

Logic to determine the partitions #728

Fokko opened this issue Nov 27, 2024 · 5 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Nov 27, 2024

Given an Arrow Buffer, we want to apply a specific Partition that will determine the partitions that are being appended.

@liurenjie1024
Copy link
Contributor

I think you mean arrow record batch? Could you elaborate the use case of this method?

@Fokko
Copy link
Contributor Author

Fokko commented Nov 28, 2024

I think you mean arrow record batch?

Yes :)

This could be a broader discussion on where the responsibilities lie between iceberg-rust and the query engine.

On the read-side there Tasks are passed to the query engine. I think this is a nice and clean boundary between the engine and library. I would love to go to a similar API for writes. For example, a table is passed in, Iceberg-rust does all the checks to make sure that the input is compatible. It could even make the table compatible, e.g. by apply schema evolution when needed, this is very easy using the UnionByName. Based on the input (Arrow Table or equivalent). Similar to the read path, the library comes up with a set of write tasks that are passed back to the query engine to write out the files and return the DataFile with all the statistics and such.

The current focus is on appending DataFiles, which is reasonable for engines to take control over. As a next step we're also going to add delete operations. Here it gets more complicated since it can be that the delete can be performed purely on Iceberg metadata (eg. dropping a partition), but it can also be the certain Parquet files need te be rewritten. In such a case, the old DataFile will be dropped, and one or more DataFiles are being added when the engines have rewritten the Parquet files, excluding the rows that need to be dropped.

Thoughts?

@Fokko
Copy link
Contributor Author

Fokko commented Nov 28, 2024

Let me close this one for now

@Fokko Fokko closed this as completed Nov 28, 2024
@liurenjie1024
Copy link
Contributor

I think @ZENOTME did some solid work for appending data files, but currently we are missing transaction api for adding data files.

@Fokko
Copy link
Contributor Author

Fokko commented Nov 28, 2024

@liurenjie1024 I think this is what I meant with this issue: #342

but currently we are missing transaction api for adding data files

That one has been merged this morning :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants