Skip to content

Support "A column is known to be entirely NULL" in PruningPredicate #9171

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This is broken out from #7869 which is describing a slightly different problem

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as

File 1: col_a
File 2: col_a, col_b (col_b was added later)

A predicate like col_a != A AND col_b='bananas' can not be true for File 1 (as col_B is logically NULL for all rows)

This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from PruningStatistics::min_values()

Describe the solution you'd like

  1. Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
  2. Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
  3. Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

  1. the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
  2. PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
  3. The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions