Predicate Pushdown for scans #2

rustyconover · 2023-07-07T03:11:12Z

It currently appears that predicate pushdown isn't applied before calculating what Parquet files are included in scanning the table's data.

The AVRO manifest files include statistics about each Parquet file. Those should be leveraged to reduce the number of files that need to be read. Additionally there is the byte offset for each row group in each Parquet file. That can prevent the need to read the footer of the Parquet field and allow better concurrent scanning.

samansmink · 2023-07-17T13:22:42Z

@rustyconover Hey thanks for all the input! All your submitted issues are all very valid points and are definitely on our to-do list in getting this extension beyond its initial proof-of-concept state!

rustyconover · 2023-07-17T13:56:06Z

Thanks Sam I look forward to contributing

…

On Mon, Jul 17, 2023 at 09:22 Sam Ansmink ***@***.***> wrote: @rustyconover <https://github.com/rustyconover> Hey thanks for all the input! All your submitted issues are all very valid points and are definitely on our to-do list in getting this extension beyond its initial proof-of-concept state! — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFSWJJXOMZXPPBO26U2J73XQU4CZANCNFSM6AAAAAA2BHSCTE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Fokko · 2023-07-18T13:32:38Z

How we do this on Iceberg (both Java and Python) is using visitors.

When a query is executed against a snapshot, that has a reference to a ManifestList, which references one or more Manifests. The manifests contain information that can be used to prune datafiles.

RewriteNot: To simplify the expression, we need to make sure that we rewrite the not predicates. For example, not a = b, will be rewritten to a != b (Python). In PyIceberg this can be done both on a bound and an unbound predicate (binding it to a schema).
BindSchema: binds the predicate to the field (Python]. It will lookup the field using the field name.
InclusiveMetricsVisitor: We first construct an InclusiveMetricsEvaluator (Java, Python). Because it is inclusive, it will return true if the file may match. There is also a very extensive test-suite from Java that is highly recommended to port as well since we don't want to skip files that may contain relevant data. In Python, we decided to return a function that accepts a manifest_entry (that carries all the statistics, such as value_count, null_count, nan_counts, lower_bound, upper_bound). If this evaluates to MIGHT_MATCH, it is included in the scan planning.

dev-goyal · 2024-01-06T02:29:31Z

Hey guys! Just wanted to check in and see if there might be updates on this - i.e., is this still prioritized? I wonder if it might be possible to have an intermediate solution in the meantime, where similar to read_parquet(), a list of files (both data and delete, obtained using PyIceberg say) may be passed to iceberg_scan(), greatly reducing the complexity of the scan.

rustyconover · 2024-04-10T02:08:51Z

@Fokko can you please update your links to the existing Java and Python implementations since the references have moved when the Iceberg repos were split out?

Fokko · 2024-04-12T19:37:58Z

@rustyconover Certainly, updated!

rustyconover · 2024-04-13T15:54:57Z

Hi @samansmink and @Fokko,

I spent a few hours looking at this extension, DuckDB itself and the Iceberg spec to implement a plan for this. Right now, I want to query Iceberg tables that consist of 100,000+ of parquet files that are highly partitioned, but without DuckDB supporting partitions and data file metrics it will be impossible. So I'm trying to figure out how to make this happen.

Here is what I've noticed:

TableFunction's bind_replace prevents access to query filters.

The use of the bind_replace function for iceberg_scan(), means that the init calls that contain the filter information for the query are never called. The bind_replace function currently read the Iceberg metadata then substitutes itself in the query plan with a call to parquet_scan() or a sub query that applies the delete file and still calls parquet_scan().

I don't see a way to get access to the DuckDB query filters since they are passed in the "init" phase of a TableFunction rather than the bind stage. Once the function has been changed to a parquet_scan(), it appears the init function of the iceberg_scan() is never called.

@samansmink What do you think is a reasonable approach to enable iceberg_scan() to enable access to the DuckDB query filter's structure? If we have access to the requested filters at the same time we are reading the manifest and determining the list of Parquet files to scan, we could keep using parquet_scan(). @Mytherin did I overlook the ability to obtain the query filters in the context of the bind function of a TableFunction?

A thought I had would be populating the metrics from the Iceberg manifests into the existing statistics infrastructure of parquet_scan(), but that would require a lot of reworking and could lead to a lot of memory usage.

Manifest metrics and partition information

It appears the manifest metrics and partition information aren't being parsed from the Avro files. This seems to just require some work to the generated code for parsing the Avro files. This can happen in parallel to the other improvements.

rustyconover · 2024-04-24T12:21:31Z

Seems like some progress for the necessary DuckDB infrastructure is being started here: duckdb/duckdb#11806

peterboncz · 2024-05-21T22:44:26Z

@rustyconover: AFAIK the filter-pushdown with iceberg gets triggered.

Note pushdown happens

in bind (the "complex-filter" mechanism for hive-partitioned file collections) and
during query optimization (pushdown of (non)equality and range predicates).

And note that immediately after bind_replace, the generated replaced piece of query plan is bound. So then, hive-partitioned file pruning based on pushdown happens.

However, when iceberg support was initially released, the second mechanism (in the query optimization) did suffer from absence of a pushdown rewrite-rule for ANTI-joins. Note that bind_replace generates an anti-join between the main data files and the files with deleted row-ids. But, late 2023 these rules were implemented, so this should now work.

rustyconover · 2024-05-22T19:00:28Z

Hi @peterboncz,

Thanks for replying.

The SQL query filter pushdown does get passed to the parquet_scan() nodes created by iceberg_scan(). To apply the query filters, DuckDB reads the footers from each Parquet file.

When an Iceberg table has millions of files, reading all these footers takes a long time. Iceberg addressed this issue by adding column metrics in the manifest files and detailed "Scan Planning" in their spec:

https://iceberg.apache.org/spec/#scan-planning

I'm looking forward to DuckDB using Iceberg's statistics to plan the scan of the table, right now the current code includes every file in the manifest. By reading the manifest files and using their info to plan a scan, scanning an Iceberg table with a million files can be reduced to scanning just 1 or 2 files especially if the query is referencing a small number of partition predicates.

Fokko · 2024-05-27T21:17:58Z

Thanks for chiming in here @peterboncz. I believe @rustyconover is right here. #8 focuses on the partitioning (called hidden-partitioning in Iceberg that's slightly different than Hive-style partitioning).

Next to that, Iceberg had on a file-level basis additional statistics. For example, there are upper- and lower-bounds to easily skip files without having to read the footer of the Parquet file. But I don't see the upper_bound field being referenced in the code: https://github.com/search?q=repo%3Aduckdb%2Fduckdb_iceberg+upper_bound&type=code

JichaoS · 2024-07-11T20:11:10Z

Hey @rustyconover it looks like the blocker you mentioned: duckdb/duckdb#11806 got merged. Do you think the current status ready for implementation?

febinsathar · 2024-09-30T12:39:26Z

@rustyconover are you planning to pick this up? thanks

rustyconover · 2024-09-30T12:40:59Z

Not in the short term.

mike-luabase · 2024-10-02T15:26:02Z

@febinsathar I have a version partially working here

febinct · 2024-12-05T10:09:01Z

#72 finally will this get merged ? @mike-luabase or any blocker left

harel-e mentioned this issue Jun 4, 2024

Question: how to speed up iceberg reads #55

Open

mike-luabase mentioned this issue Oct 9, 2024

Add predicate pushdown #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predicate Pushdown for scans #2

Predicate Pushdown for scans #2

rustyconover commented Jul 7, 2023

samansmink commented Jul 17, 2023

rustyconover commented Jul 17, 2023 via email

Fokko commented Jul 18, 2023 •

edited

Loading

dev-goyal commented Jan 6, 2024

rustyconover commented Apr 10, 2024

Fokko commented Apr 12, 2024

rustyconover commented Apr 13, 2024

rustyconover commented Apr 24, 2024

peterboncz commented May 21, 2024

rustyconover commented May 22, 2024

Fokko commented May 27, 2024

JichaoS commented Jul 11, 2024

febinsathar commented Sep 30, 2024

rustyconover commented Sep 30, 2024

mike-luabase commented Oct 2, 2024

febinct commented Dec 5, 2024 •

edited

Loading

Predicate Pushdown for scans #2

Predicate Pushdown for scans #2

Comments

rustyconover commented Jul 7, 2023

samansmink commented Jul 17, 2023

rustyconover commented Jul 17, 2023 via email

Fokko commented Jul 18, 2023 • edited Loading

dev-goyal commented Jan 6, 2024

rustyconover commented Apr 10, 2024

Fokko commented Apr 12, 2024

rustyconover commented Apr 13, 2024

TableFunction's bind_replace prevents access to query filters.

Manifest metrics and partition information

rustyconover commented Apr 24, 2024

peterboncz commented May 21, 2024

rustyconover commented May 22, 2024

Fokko commented May 27, 2024

JichaoS commented Jul 11, 2024

febinsathar commented Sep 30, 2024

rustyconover commented Sep 30, 2024

mike-luabase commented Oct 2, 2024

febinct commented Dec 5, 2024 • edited Loading

Fokko commented Jul 18, 2023 •

edited

Loading

febinct commented Dec 5, 2024 •

edited

Loading