Summary
When executing a SQL query that selects only coordinate columns with a LIMIT clause, the Rust CLI returns fewer rows than expected. The issue appears to be that LIMIT is applied to the DictionaryArray’s unique values rather than the expanded Cartesian product rows.
Bug revealed in #9
Steps to Reproduce
- Generate test data:
./scripts/generate_data.sh
- Run the following query via zarr-cli:
CREATE EXTERNAL TABLE data STORED AS ZARR LOCATION 'data/synthetic_v3.zarr'
SELECT lat FROM data LIMIT 11
-
Expected: 11 rows (from the 700-row Cartesian product: 7 time × 10 lat × 10 lon)
-
Actual: 10 rows (the number of unique lat values in the DictionaryArray)
Minimal Reproduction
# Build CLI
cargo build --bin zarr-cli
# Run query
echo "CREATE EXTERNAL TABLE data STORED
AS ZARR LOCATION 'data/ synthetic_v3.zarr'
SELECT lat FROM data LIMIT 11 quit" | ./target/debug/zarr-cli
Analysis
The Rust implementation uses
'DictionaryArray for coordinate columns like 'lat'
" "lon'
" 'time. When
a query selects only coordinate columns:
- 'lat' has 10 unique dictionary values
I0, 1, 2, ..., 9]
- The full table has 700 rows where each lat value appears 70 times
- 'SELECT lat FROM data LIMIT 11' should return 11 rows
The LIMIT appears to be incorrectly applied to the dictionary values (10 unique) rather than the expanded index array (700 rows).
Workaround
Include at least one data variable column in the SELECT to force full row expansion:
SELECT lat, temperature FROM data LIMIT
11 - Works correctly
Environment
- Found via hypothesis property-based testing
- Test file: 'python/tests/ test_integration.py
- Relevant Rust files: 'src/reader/ zarr_reader.rs', 'src/physical_plan/ zarr_exec.rs'
Summary
When executing a SQL query that selects only coordinate columns with a LIMIT clause, the Rust CLI returns fewer rows than expected. The issue appears to be that LIMIT is applied to the DictionaryArray’s unique values rather than the expanded Cartesian product rows.
Bug revealed in #9
Steps to Reproduce
Expected: 11 rows (from the 700-row Cartesian product: 7 time × 10 lat × 10 lon)
Actual: 10 rows (the number of unique lat values in the DictionaryArray)
Minimal Reproduction
Analysis
The Rust implementation uses
'DictionaryArray for coordinate columns like 'lat'
" "lon'
" 'time. When
a query selects only coordinate columns:
I0, 1, 2, ..., 9]
The LIMIT appears to be incorrectly applied to the dictionary values (10 unique) rather than the expanded index array (700 rows).
Workaround
Include at least one data variable column in the SELECT to force full row expansion:
Environment