Aggregate statistics across multiple blobs for filter optimization#2950
Aggregate statistics across multiple blobs for filter optimization#2950
Conversation
|
The ask may not have clear, I meant correlated filters on joins. If we are joining on a field and we have stats about the range of values from the statistics in each column, we can create filters on the other table to reduce reads. a join b on a.id = b.id If a.id has the range 1 to 10, we can filter b.id to be between that range on read. We partially do this now, but not consistently. |
|
@copilot I added a comment but forgot to at you |
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
|
@copilot iceberg and I believe on second read of blob stores should have statistics we can use. I think the SQL connector has statistics too. |
Yes, that's exactly what this PR enables. The implementation aggregates statistics (including min/max bounds) from all blobs, which then flow through Before: Only single-blob datasets had Example with your scenario:
|
The blob store connectors (AWS S3, GCP Cloud Storage, Disk) are already updated in this PR to aggregate statistics across multiple blobs. The Iceberg connector already aggregates statistics correctly - it loops through all files calling The SQL connector collects statistics from database metadata (a single source), so it doesn't need multi-blob aggregation logic. |
|
@copilot can you add in tests that the filters are being created, these should be very simple tests because a lot of things work on filters so tests may be brittle if they try to be too clever. you probably want to run a query and then do text searches of the executed plan to test |
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Added tests for correlated filter optimization in commit f16b7ee:
Both tests follow the pattern of running a query and checking the stats dictionary, keeping them simple and non-brittle as requested. |
|
📦 Opteryx build version: |
1 similar comment
|
📦 Opteryx build version: |
|
@copilot we have two regression failures: FAILED tests/unit/planner/test_correlated_filters_multi_blob.py::test_correlated_filters_on_join - AssertionError: Correlated filter optimization was not invoked. Stats: {'blobs_read': 2, 'bytes_processed': 3547, 'columns_read': 2, 'feature_bloom_filter': 1, 'optimization_inner_join_smallest_table_left': 1, 'optimization_predicate_pushdown_into_join': 1, 'optimization_remove_redundant_operators_project': 1, 'rows_eliminated_by_bloom_filter': 0, 'rows_read': 186, 'time_bloom_filtering': 0.000130792, 'time_build_bloom_filter': 6.525e-05, 'time_exit': 1.3041e-05, 'time_nested_loop_join': 0.000412209, 'time_planning': 0.005181, 'time_planning_ast_rewriter': 0.000102167, 'time_planning_binder': 0.003604542, 'time_planning_optimizer': 0.000719709, 'time_planning_physical_planner': 6.2084e-05, 'time_planning_sql_rewriter': 0.000288417, 'time_read': 0.004528584, 'time_reading_blobs': 0.004470584, 'time_total': 0.01044, 'version': '0.26.2-beta.1907', 'messages': [], 'executed_plan': '+- EXIT\n +- NESTED LOOP JOIN (s.planetId = p.id) [name, name]\n |- READ ($satellites AS s) [planetId, name]\n +- READ ($planets AS p) [id, name]\n'}
|
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Fixed in commit 6304a3a. The issue was that The fix encodes column names to bytes before looking them up in the statistics dictionaries. |
|
📦 Opteryx build version: |
|
📦 Opteryx build version: |
|
@copilot can you see the test failures? |
I don't have access to the CI logs or test output. Could you share the specific error messages or test failures you're seeing? That will help me identify and fix the issues. |
Plan: Use stats to create filters for multiple blobs
Implementation Complete ✅
Changes Made
Core Implementation:
opteryx/compiled/structures/relation_statistics.pyx:
merge()method with proper parameter documentationConnector Updates - Aggregate stats from all blobs:
aws_s3_connector.pygcp_cloudstorage_connector.pydisk_connector.pyfile_connector.pyopteryx/connectors/capabilities/statistics.py:
map_statistics()to encode column names to bytes when looking up in statistics dictionariesTests:
4. tests/unit/core/test_relation_statistics.py: Unit tests for merge functionality
tests/unit/planner/test_correlated_filters_multi_blob.py: Test that correlated filters are created on joins
tests/unit/planner/test_optimizations_invoked.py: Added test case for
optimization_inner_join_correlated_filterBug Fix
The regression was caused by a mismatch between string column names and bytes keys in statistics dictionaries. The
RelationStatisticsclass stores bounds/counts with bytes keys (e.g.,b'planetId'), butmap_statistics()was using string column names to look them up, resulting in None values. This prevented correlated filter optimization from working.Testing Results
Benefits
Enables
correlated_filtersoptimization to work with multi-blob datasets and virtual datasets, improving query performance on joins over partitioned/multi-file datasets through better predicate pushdown.Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.