Logic to determine the partitions #728

Fokko · 2024-11-27T14:33:35Z

Given an Arrow Buffer, we want to apply a specific Partition that will determine the partitions that are being appended.

liurenjie1024 · 2024-11-28T02:39:29Z

I think you mean arrow record batch? Could you elaborate the use case of this method?

Fokko · 2024-11-28T06:41:22Z

I think you mean arrow record batch?

Yes :)

This could be a broader discussion on where the responsibilities lie between iceberg-rust and the query engine.

On the read-side there Tasks are passed to the query engine. I think this is a nice and clean boundary between the engine and library. I would love to go to a similar API for writes. For example, a table is passed in, Iceberg-rust does all the checks to make sure that the input is compatible. It could even make the table compatible, e.g. by apply schema evolution when needed, this is very easy using the UnionByName. Based on the input (Arrow Table or equivalent). Similar to the read path, the library comes up with a set of write tasks that are passed back to the query engine to write out the files and return the DataFile with all the statistics and such.

The current focus is on appending DataFiles, which is reasonable for engines to take control over. As a next step we're also going to add delete operations. Here it gets more complicated since it can be that the delete can be performed purely on Iceberg metadata (eg. dropping a partition), but it can also be the certain Parquet files need te be rewritten. In such a case, the old DataFile will be dropped, and one or more DataFiles are being added when the engines have rewritten the Parquet files, excluding the rows that need to be dropped.

Thoughts?

Fokko · 2024-11-28T14:11:09Z

Let me close this one for now

liurenjie1024 · 2024-11-28T14:43:16Z

I think @ZENOTME did some solid work for appending data files, but currently we are missing transaction api for adding data files.

Fokko · 2024-11-28T14:53:42Z

@liurenjie1024 I think this is what I meant with this issue: #342

but currently we are missing transaction api for adding data files

That one has been merged this morning :)

Fokko closed this as completed Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logic to determine the partitions #728

Logic to determine the partitions #728

Fokko commented Nov 27, 2024

liurenjie1024 commented Nov 28, 2024

Fokko commented Nov 28, 2024 •

edited

Loading

Fokko commented Nov 28, 2024

liurenjie1024 commented Nov 28, 2024

Fokko commented Nov 28, 2024

Logic to determine the partitions #728

Logic to determine the partitions #728

Comments

Fokko commented Nov 27, 2024

liurenjie1024 commented Nov 28, 2024

Fokko commented Nov 28, 2024 • edited Loading

Fokko commented Nov 28, 2024

liurenjie1024 commented Nov 28, 2024

Fokko commented Nov 28, 2024

Fokko commented Nov 28, 2024 •

edited

Loading