ORC RowSelection API Design

## Overview

This document describes the proposed RowSelection API for orc-rust, modeled after arrow-rs Parquet's implementation. This API will enable fine-grained row filtering at the decoder level, significantly reducing I/O and improving query performance.

## Motivation

Currently, ORC readers must decode all rows in a stripe, even if only a subset is needed. RowSelection allows:

- **Skip rows before decoding**: Avoid reading and decompressing data for filtered rows
- **Reduce I/O**: Skip data pages that contain only filtered rows  
- **Improve performance**: Lower CPU and memory usage by decoding only selected rows
- **Page-level pruning**: Leverage ORC row indexes to skip entire row groups

## API Design

### Core Types

```rust
/// A single selector specifying rows to select or skip
#[derive(Debug, Clone, Copy, Eq, PartialEq)]
pub struct RowSelector {
    /// The number of rows
    pub row_count: usize,
    
    /// If true, skip `row_count` rows
    pub skip: bool,
}

impl RowSelector {
    /// Select `row_count` rows
    pub fn select(row_count: usize) -> Self {
        Self {
            row_count,
            skip: false,
        }
    }
    
    /// Skip `row_count` rows
    pub fn skip(row_count: usize) -> Self {
        Self {
            row_count,
            skip: true,
        }
    }
}

/// A collection of [`RowSelector`] used to skip rows when scanning an ORC file
///
/// Invariants:
/// * Contains no [`RowSelector`] of 0 rows
/// * Consecutive [`RowSelector`]s alternate between skipping and selecting
#[derive(Debug, Clone, Default, Eq, PartialEq)]
pub struct RowSelection {
    selectors: Vec<RowSelector>,
}
```

## Usage Examples

### Example 1: Skip First N Rows

```rust
use orc_rust::{ArrowReaderBuilder, RowSelection, RowSelector};

let selection = RowSelection::from(vec![
    RowSelector::skip(10000),      // Skip first 10k rows
    RowSelector::select(1000000),  // Read next 1M rows
]);

let reader = ArrowReaderBuilder::try_new(file_data)?
    .with_row_selection(selection)
    .build();

for batch in reader {
    // Process only rows 10000-1010000
}
```

### Example 2: From Filter Results

```rust
use arrow::array::BooleanArray;
use orc_rust::RowSelection;

// After evaluating predicates against row indexes
let filter1 = BooleanArray::from(vec![true, false, true, false, true]);
let filter2 = BooleanArray::from(vec![false, true, true, false, true]);

let selection = RowSelection::from_filters(&[filter1, filter2]);

let reader = ArrowReaderBuilder::try_new(file_data)?
    .with_row_selection(selection)
    .build();
```

### Example 3: Combining with Stripe Filtering

```rust
use datafusion_orc::stripe_filter::StripeAccessPlanFilter;
use orc_rust::RowSelection;

// 1. First, filter stripes by statistics
let mut stripe_filter = StripeAccessPlanFilter::new(access_plan);
stripe_filter.prune_by_statistics(&schema, metadata, &predicate, &metrics);
let stripe_plan = stripe_filter.build();

// 2. For selected stripes, create row-level selection
let mut row_selections = Vec::new();
for stripe_idx in stripe_plan.stripe_indexes() {
    // Use ORC row index to create fine-grained selection
    let row_index = metadata.stripe_metadatas()[stripe_idx].row_index()?;
    let selection = evaluate_predicate_on_row_index(&row_index, &predicate)?;
    row_selections.push(selection);
}

// 3. Combine all selections
let combined = row_selections.into_iter()
    .fold(RowSelection::default(), |acc, sel| acc.union(&sel));

// 4. Read with both stripe and row filtering
let reader = ArrowReaderBuilder::try_new(file_data)?
    .with_row_groups(stripe_plan.stripe_indexes())  // Stripe filtering
    .with_row_selection(combined)                    // Row filtering
    .build();
```

### Example 4: Two-Stage Filtering

```rust
// Stage 1: Filter based on column A
let filter_a = evaluate_predicate_a(&row_index)?;  // BooleanArray
let selection_a = RowSelection::from_filters(&[filter_a]);

// Stage 2: Among selected rows, filter based on column B
let filter_b = evaluate_predicate_b(&row_index)?;  // BooleanArray
let selection_b = RowSelection::from_filters(&[filter_b]);

// Combine: only rows where both A and B match
let final_selection = selection_a.and_then(&selection_b);

let reader = ArrowReaderBuilder::try_new(file_data)?
    .with_row_selection(final_selection)
    .build();
```

## Implementation Phases

### Phase 1: Basic API (MVP)
- [x] Define `RowSelector` and `RowSelection` types
- [x] Implement `FromIterator` and basic conversions
- [x] Add `with_row_selection()` to `ArrowReaderBuilder`
- [x] Pass selection through to stripe decoder
- [x] Implement basic skip in decoders (decode and discard)

### Phase 2: Efficient Skipping
- [x] Implement efficient `skip_rows()` for each decoder type
- [x] Skip without decoding for:
  - Integer types (RLE skip)
  - Boolean (skip bit runs)
  - Strings (skip dictionary entries)
  - Timestamps, decimals, etc.

### Phase 3: Row Index Integration
- [x] Parse ORC row indexes
- [x] Implement `row_selection_from_row_index()`
- [x] Integrate with stripe filtering
- [ ] Add benchmarks

### Phase 4: Advanced Features
- [ ] `scan_ranges()` for page-level I/O planning
- [ ] `and_then()` for multi-stage filtering
- [ ] `intersection()` and `union()` operations
- [ ] Expand to batch boundaries for caching

## References

- **Parquet RowSelection**: `arrow-rs/parquet/src/arrow/arrow_reader/selection.rs`
- **ORC Specification**: https://orc.apache.org/specification/ORCv1/
- **ORC Row Index**: https://orc.apache.org/docs/indexes.html
- **DataFusion Pruning**: `datafusion/pruning/src/lib.rs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC RowSelection API Design #58

Overview

Motivation

API Design

Core Types

Usage Examples

Example 1: Skip First N Rows

Example 2: From Filter Results

Example 3: Combining with Stripe Filtering

Example 4: Two-Stage Filtering

Implementation Phases

Phase 1: Basic API (MVP)

Phase 2: Efficient Skipping

Phase 3: Row Index Integration

Phase 4: Advanced Features

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ORC RowSelection API Design #58

Description

Overview

Motivation

API Design

Core Types

Usage Examples

Example 1: Skip First N Rows

Example 2: From Filter Results

Example 3: Combining with Stripe Filtering

Example 4: Two-Stage Filtering

Implementation Phases

Phase 1: Basic API (MVP)

Phase 2: Efficient Skipping

Phase 3: Row Index Integration

Phase 4: Advanced Features

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions