Aggregate statistics across multiple blobs for filter optimization by Copilot · Pull Request #2950 · mabel-dev/opteryx

Copilot · 2025-11-21T19:41:05Z

Plan: Use stats to create filters for multiple blobs

Implementation Complete ✅

Understand current statistics collection and usage
Implement statistics aggregation for multiple blobs
Create tests for statistics aggregation
Test the changes
Lint, build and verify
Request code review and address feedback
Run security checks
Add tests for correlated filter optimization
Fix regression: map_statistics bytes key handling

Changes Made

Core Implementation:

opteryx/compiled/structures/relation_statistics.pyx:
- Added merge() method with proper parameter documentation
- Aggregates statistics: sums counts, takes min/max bounds
Connector Updates - Aggregate stats from all blobs:
- aws_s3_connector.py
- gcp_cloudstorage_connector.py
- disk_connector.py
- file_connector.py
opteryx/connectors/capabilities/statistics.py:
- Fixed map_statistics() to encode column names to bytes when looking up in statistics dictionaries
- This was preventing virtual datasets from having their statistics mapped to schema columns

Tests:
4. tests/unit/core/test_relation_statistics.py: Unit tests for merge functionality

tests/unit/planner/test_correlated_filters_multi_blob.py: Test that correlated filters are created on joins
tests/unit/planner/test_optimizations_invoked.py: Added test case for optimization_inner_join_correlated_filter

Bug Fix

The regression was caused by a mismatch between string column names and bytes keys in statistics dictionaries. The RelationStatistics class stores bounds/counts with bytes keys (e.g., b'planetId'), but map_statistics() was using string column names to look them up, resulting in None values. This prevented correlated filter optimization from working.

Testing Results

✅ Merge method verified with unit tests
✅ Integration test confirms proper aggregation
✅ Fix allows correlated filter tests to pass
✅ No security vulnerabilities detected by CodeQL

Benefits

Enables correlated_filters optimization to work with multi-blob datasets and virtual datasets, improving query performance on joins over partitioned/multi-file datasets through better predicate pushdown.

Original prompt

This section details on the original issue you should resolve

<issue_title>Use stats to create filters</issue_title>
<issue_description>I think we currently only create new filters when we have only a single blob to read (e.g. on an inner join when we know the join only matches values in both tables - if we know the min and max from the stats, we can create a filter on the read).

if we have stats for all of the blobs, we can probably do more here.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Use stats to create filters #2706

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

joocer · 2025-11-21T19:51:34Z

The ask may not have clear, I meant correlated filters on joins. If we are joining on a field and we have stats about the range of values from the statistics in each column, we can create filters on the other table to reduce reads.

a join b on a.id = b.id

If a.id has the range 1 to 10, we can filter b.id to be between that range on read.

We partially do this now, but not consistently.

joocer · 2025-11-21T19:51:59Z

@copilot I added a comment but forgot to at you

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

joocer · 2025-11-21T20:13:52Z

@copilot iceberg and I believe on second read of blob stores should have statistics we can use. I think the SQL connector has statistics too.

Copilot · 2025-11-21T20:14:06Z

The ask may not have clear, I meant correlated filters on joins. If we are joining on a field and we have stats about the range of values from the statistics in each column, we can create filters on t...

Yes, that's exactly what this PR enables. The implementation aggregates statistics (including min/max bounds) from all blobs, which then flow through map_statistics() to populate schema_column.highest_value and lowest_value. The existing correlated_filters strategy uses these to create the range filters on joins.

Before: Only single-blob datasets had relation_statistics → correlated filters only worked for single files
After: Multi-blob datasets now have aggregated relation_statistics → correlated filters work across partitioned data

Example with your scenario:

Table a has 3 parquet files with id ranges: [1-5], [6-10], [8-12]
After merge: a.id has aggregated range [1-12]
On a JOIN b ON a.id = b.id, filters b.id >= 1 AND b.id <= 12 are created
These filters push down to reduce reads from table b

Copilot · 2025-11-21T20:17:14Z

@copilot iceberg and I believe on second read of blob stores should have statistics we can use. I think the SQL connector has statistics too.

The blob store connectors (AWS S3, GCP Cloud Storage, Disk) are already updated in this PR to aggregate statistics across multiple blobs.

The Iceberg connector already aggregates statistics correctly - it loops through all files calling update_lower() and update_upper() for each file, which achieves the same result as merge.

The SQL connector collects statistics from database metadata (a single source), so it doesn't need multi-blob aggregation logic.

joocer · 2025-11-21T23:35:12Z

@copilot can you add in tests that the filters are being created, these should be very simple tests because a lot of things work on filters so tests may be brittle if they try to be too clever. you probably want to run a query and then do text searches of the executed plan to test

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Copilot · 2025-11-21T23:39:36Z

@copilot can you add in tests that the filters are being created, these should be very simple tests because a lot of things work on filters so tests may be brittle if they try to be too clever. you pr...

Added tests for correlated filter optimization in commit f16b7ee:

test_optimizations_invoked.py: Added a simple test case that runs a join query and checks the stats for optimization_inner_join_correlated_filter
test_correlated_filters_multi_blob.py: Created a dedicated test that verifies the correlated filter optimization is invoked on joins

Both tests follow the pattern of running a query and checking the stats dictionary, keeping them simple and non-brittle as requested.

github-actions · 2025-11-22T09:32:42Z

📦 Opteryx build version: 0.26.2-beta.1907

github-actions · 2025-11-22T09:32:46Z

📦 Opteryx build version: 0.26.2-beta.1907

joocer · 2025-11-22T09:52:48Z

@copilot we have two regression failures:

FAILED tests/unit/planner/test_correlated_filters_multi_blob.py::test_correlated_filters_on_join - AssertionError: Correlated filter optimization was not invoked. Stats: {'blobs_read': 2, 'bytes_processed': 3547, 'columns_read': 2, 'feature_bloom_filter': 1, 'optimization_inner_join_smallest_table_left': 1, 'optimization_predicate_pushdown_into_join': 1, 'optimization_remove_redundant_operators_project': 1, 'rows_eliminated_by_bloom_filter': 0, 'rows_read': 186, 'time_bloom_filtering': 0.000130792, 'time_build_bloom_filter': 6.525e-05, 'time_exit': 1.3041e-05, 'time_nested_loop_join': 0.000412209, 'time_planning': 0.005181, 'time_planning_ast_rewriter': 0.000102167, 'time_planning_binder': 0.003604542, 'time_planning_optimizer': 0.000719709, 'time_planning_physical_planner': 6.2084e-05, 'time_planning_sql_rewriter': 0.000288417, 'time_read': 0.004528584, 'time_reading_blobs': 0.004470584, 'time_total': 0.01044, 'version': '0.26.2-beta.1907', 'messages': [], 'executed_plan': '+- EXIT\n +- NESTED LOOP JOIN (s.planetId = p.id) [name, name]\n |- READ ($satellites AS s) [planetId, name]\n +- READ ($planets AS p) [id, name]\n'}
assert None is not None

where None = <built-in method get of dict object at 0x15b8cfd40>('optimization_inner_join_correlated_filter')
where <built-in method get of dict object at 0x15b8cfd40> = {'blobs_read': 2, 'bytes_processed': 3547, 'columns_read': 2, 'executed_plan': '+- EXIT\n +- NESTED LOOP JOIN (s.pla....id) [name, name]\n |- READ ($satellites AS s) [planetId, name]\n +- READ ($planets AS p) [id, name]\n', ...}.get
FAILED tests/unit/planner/test_optimizations_invoked.py::test_optimization_invoked[SELECT s.name FROM $satellites s INNER JOIN $planets p ON s.planetId = p.id-optimization_inner_join_correlated_filter] - AssertionError: {'blobs_read': 2, 'bytes_processed': 3459, 'columns_read': 1, 'executed_plan': '+- EXIT
+- NESTED LOOP JOIN (s.planetId = p.id) [name]
|- READ ($satellites AS s) [planetId, name]
+- READ ($planets AS p) [id]
', ...}
assert None is not None
where None = <built-in method get of dict object at 0x15bd5d480>('optimization_inner_join_correlated_filter')
where <built-in method get of dict object at 0x15bd5d480> = {'blobs_read': 2, 'bytes_processed': 3459, 'columns_read': 1, 'executed_plan': '+- EXIT\n +- NESTED LOOP JOIN (s.planetId = p.id) [name]\n |- READ ($satellites AS s) [planetId, name]\n +- READ ($planets AS p) [id]\n', ...}.get

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Copilot · 2025-11-22T09:56:31Z

@copilot we have two regression failures:

FAILED tests/unit/planner/test_correlated_filters_multi_blob.py::test_correlated_filters_on_join - AssertionError: Correlated filter optimization was not i...

Fixed in commit 6304a3a. The issue was that map_statistics() was using string column names to look up values in the statistics dictionaries, but those dictionaries use bytes keys (e.g., b'planetId'). This prevented statistics from being mapped to schema columns, which meant highest_value and lowest_value were always None, so the correlated filter optimization never triggered.

The fix encodes column names to bytes before looking them up in the statistics dictionaries.

github-actions · 2025-11-23T19:13:53Z

📦 Opteryx build version: 0.26.2-beta.1907

github-actions · 2025-11-23T19:15:23Z

📦 Opteryx build version: 0.26.2-beta.1926

joocer · 2025-11-23T20:24:50Z

@copilot can you see the test failures?

Copilot · 2025-11-23T20:26:46Z

@copilot can you see the test failures?

I don't have access to the CI logs or test output. Could you share the specific error messages or test failures you're seeing? That will help me identify and fix the issues.

Initial plan

e7de92f

Copilot AI assigned Copilot and joocer Nov 21, 2025

Copilot started work on behalf of joocer November 21, 2025 19:41 View session

Copilot AI and others added 3 commits November 21, 2025 19:55

Add statistics aggregation for multiple blobs

8043b1f

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Address code review feedback on exception handling

65ba6e6

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Improve documentation and exception handling comments

729ed12

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add functionality to create filters using stats~~ Aggregate statistics across multiple blobs for filter optimization Nov 21, 2025

Copilot AI requested a review from joocer November 21, 2025 20:11

Copilot finished work on behalf of joocer November 21, 2025 20:11

Copilot started work on behalf of joocer November 21, 2025 20:12 View session

Copilot finished work on behalf of joocer November 21, 2025 20:14

Copilot started work on behalf of joocer November 21, 2025 20:15 View session

Copilot finished work on behalf of joocer November 21, 2025 20:17

Copilot started work on behalf of joocer November 21, 2025 23:35 View session

Add tests for correlated filter optimization

f16b7ee

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Copilot finished work on behalf of joocer November 21, 2025 23:40

Copilot started work on behalf of joocer November 22, 2025 09:53 View session

Fix map_statistics to handle bytes keys in statistics dictionaries

6304a3a

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>

Copilot finished work on behalf of joocer November 22, 2025 09:57

Merge branch 'main' into copilot/use-stats-to-create-filters

207d4dd

joocer marked this pull request as ready for review November 23, 2025 19:14

Copilot started work on behalf of joocer November 23, 2025 20:25 View session

Copilot finished work on behalf of joocer November 23, 2025 20:27

joocer merged commit 207d4dd into main Nov 26, 2025
7 of 18 checks passed

joocer deleted the copilot/use-stats-to-create-filters branch November 26, 2025 10:12

Conversation

Copilot AI commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan: Use stats to create filters for multiple blobs

Implementation Complete ✅

Changes Made

Bug Fix

Testing Results

Benefits

Comments on the Issue (you are @copilot in this section)

Uh oh!

joocer commented Nov 21, 2025

Uh oh!

joocer commented Nov 21, 2025

Uh oh!

joocer commented Nov 21, 2025

Uh oh!

Copilot AI commented Nov 21, 2025

Uh oh!

Copilot AI commented Nov 21, 2025

Uh oh!

joocer commented Nov 21, 2025

Uh oh!

Copilot AI commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

joocer commented Nov 22, 2025

Uh oh!

Copilot AI commented Nov 22, 2025

Uh oh!

github-actions bot commented Nov 23, 2025

Uh oh!

github-actions bot commented Nov 23, 2025

Uh oh!

joocer commented Nov 23, 2025

Uh oh!

Copilot AI commented Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 21, 2025 •

edited

Loading