Description
Is your feature request related to a problem or challenge?
This is broken out from #7869 which is describing a slightly different problem
PruningPredicate
can't be told about columns that are known to contain only NULL
. It can be told which columns have no nulls (via the PruningStatistics::null_counts()
).
Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as
File 1: col_a
File 2: col_a
, col_b
(col_b
was added later)
A predicate like col_a != A AND col_b='bananas'
can not be true
for File 1 (as col_B
is logically NULL
for all rows)
This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from PruningStatistics::min_values()
Describe the solution you'd like
- Add a new method
PruningStatistics::row_counts()
to get the total row counts in each container. - Use the information from
PruningStatistics::row_counts()
andPruningStatistics::null_counts()
to determine containers where columns are entirely NULL - Rewrite the predicate, replacing references to columns known to be
NULL
with aNULL
literal and try to simplify the expressions (e.g.a = 5
-->NULL = 5
-->NULL
)
For the example in this ticket's description with predicate col_a != A AND col_b='bananas'
where col_b
is not known and the relevant container had 100
rows,
- the relevant
PruningStatistics
would returncol_b: {null_count = 100, row_count = 100}
PruningPredicate::prune
would determinecol_b
was entirely null, and would rewrite the predicate to becol_a != A AND NULL = 'bananas'
.- The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for
col_b
and thus could be proven to be not true.
Describe alternatives you've considered
No response
Additional context
No response