@@ -149,11 +149,12 @@ pub trait PruningStatistics {
149
149
/// for any row in the Row Group, the entire Row Group is skipped during query
150
150
/// execution.
151
151
///
152
- /// The `PruningPredicate` API is designed to be general, so it can used for
153
- /// pruning other types of containers (e.g. files) based on statistics that may
154
- /// be known from external catalogs (e.g. Delta Lake) or other sources.
152
+ /// The `PruningPredicate` API is general, and can be used for pruning other
153
+ /// types of containers (e.g. files) based on statistics that may be known from
154
+ /// external catalogs (e.g. Delta Lake) or other sources. How this works is a
155
+ /// subtle topic. See the Background and Implementation section for details.
155
156
///
156
- /// It currently supports:
157
+ /// `PruningPredicate` supports:
157
158
///
158
159
/// 1. Arbitrary expressions (including user defined functions)
159
160
///
@@ -190,6 +191,188 @@ pub trait PruningStatistics {
190
191
/// ```
191
192
///
192
193
/// See [`PruningPredicate::try_new`] and [`PruningPredicate::prune`] for more information.
194
+ ///
195
+ /// # Background
196
+ ///
197
+ /// ## Boolean Tri-state logic
198
+ ///
199
+ /// To understand the details of the rest of this documentation, it is important
200
+ /// to understand how the tri-state boolean logic in SQL works. As this is
201
+ /// somewhat esoteric, we review it here.
202
+ ///
203
+ /// SQL has a notion of `NULL` that represents the value is `“unknown”` and this
204
+ /// uncertainty propagates through expressions. SQL `NULL` behaves very
205
+ /// differently than the `NULL` in most other languages where it is a special,
206
+ /// sentinel value (e.g. `0` in `C/C++`). While representing uncertainty with
207
+ /// `NULL` is powerful and elegant, SQL `NULL`s are often deeply confusing when
208
+ /// first encountered as they behave differently than most programmers may
209
+ /// expect.
210
+ ///
211
+ /// In most other programming languages,
212
+ /// * `a == NULL` evaluates to `true` if `a` also had the value `NULL`
213
+ /// * `a == NULL` evaluates to `false` if `a` has any other value
214
+ ///
215
+ /// However, in SQL `a = NULL` **always** evaluates to `NULL` (never `true` or
216
+ /// `false`):
217
+ ///
218
+ /// Expression | Result
219
+ /// ------------- | ---------
220
+ /// `1 = NULL` | `NULL`
221
+ /// `NULL = NULL` | `NULL`
222
+ ///
223
+ /// Also important is how `AND` and `OR` works with tri-state boolean logic as
224
+ /// (perhaps counterintuitively) the result is **not** always NULL. While
225
+ /// consistent with the notion of `NULL` representing “unknown”, this is again,
226
+ /// often deeply confusing 🤯 when first encountered.
227
+ ///
228
+ /// Expression | Result | Intuition
229
+ /// --------------- | --------- | -----------
230
+ /// `NULL AND true` | `NULL` | The `NULL` stands for “unknown” and if it were `true` or `false` the overall expression value could change
231
+ /// `NULL AND false` | `false` | If the `NULL` was either `true` or `false` the overall expression is still `false`
232
+ /// `NULL AND NULL` | `NULL` |
233
+ ///
234
+ /// Expression | Result | Intuition
235
+ /// --------------- | --------- | ----------
236
+ /// `NULL OR true` | `true` | If the `NULL` was either `true` or `false` the overall expression is still `true`
237
+ /// `NULL OR false` | `NULL` | The `NULL` stands for “unknown” and if it were `true` or `false` the overall expression value could change
238
+ /// `NULL OR NULL` | `NULL` |
239
+ ///
240
+ /// ## SQL Filter Semantics
241
+ ///
242
+ /// The SQL `WHERE` clause has a boolean expression, often called a filter or
243
+ /// predicate. The semantics of this predicate are that the query evaluates the
244
+ /// predicate for each row in the input tables and:
245
+ ///
246
+ /// * Rows that evaluate to `true` are returned in the query results
247
+ ///
248
+ /// * Rows that evaluate to `false` are not returned (“filtered out” or “pruned” or “skipped”).
249
+ ///
250
+ /// * Rows that evaluate to `NULL` are **NOT** returned (also “filtered out”).
251
+ /// Note: *this treatment of `NULL` is **DIFFERENT** than how `NULL` is treated
252
+ /// in the rewritten predicate described below.*
253
+ ///
254
+ /// # `PruningPredicate` Implementation
255
+ ///
256
+ /// Armed with the information in the Background section, we can now understand
257
+ /// how the `PruningPredicate` logic works.
258
+ ///
259
+ /// ## Interface
260
+ ///
261
+ /// **Inputs**
262
+ /// 1. An input schema describing what columns exist
263
+ ///
264
+ /// 2. A predicate (expression that evaluates to a boolean)
265
+ ///
266
+ /// 3. [`PruningStatistics`] that provides information about columns in that
267
+ /// schema, for multiple “containers”. For each column in each container, it
268
+ /// provides optional information on contained values, min_values, max_values,
269
+ /// and null_counts counts.
270
+ ///
271
+ /// **Outputs**:
272
+ /// A (non null) boolean value for each container:
273
+ /// * `true`: There MAY be rows that match the predicate
274
+ ///
275
+ /// * `false`: There are no rows that could possibly match the predicate (the
276
+ /// predicate can never possibly be true). The container can be pruned (skipped)
277
+ /// entirely.
278
+ ///
279
+ /// Note that in order to be correct, `PruningPredicate` must return false
280
+ /// **only** if it can determine that for all rows in the container, the
281
+ /// predicate could never evaluate to `true` (always evaluates to either `NULL`
282
+ /// or `false`).
283
+ ///
284
+ /// ## Contains Analysis and Min/Max Rewrite
285
+ ///
286
+ /// `PruningPredicate` works by first analyzing the predicate to see what
287
+ /// [`LiteralGuarantee`] must hold for the predicate to be true.
288
+ ///
289
+ /// Then, the `PruningPredicate` rewrites the original predicate into an
290
+ /// expression that references the min/max values of each column in the original
291
+ /// predicate.
292
+ ///
293
+ /// When the min/max values are actually substituted in to this expression and
294
+ /// evaluated, the result means
295
+ ///
296
+ /// * `true`: there MAY be rows that pass the predicate, **KEEPS** the container
297
+ ///
298
+ /// * `NULL`: there MAY be rows that pass the predicate, **KEEPS** the container
299
+ /// Note that rewritten predicate can evaluate to NULL when some of
300
+ /// the min/max values are not known. *Note that this is different than
301
+ /// the SQL filter semantics where `NULL` means the row is filtered
302
+ /// out.*
303
+ ///
304
+ /// * `false`: there are no rows that could possibly match the predicate,
305
+ /// **PRUNES** the container
306
+ ///
307
+ /// For example, given a column `x`, the `x_min` and `x_max` and `x_null_count`
308
+ /// represent the minimum and maximum values, and the null count of column `x`,
309
+ /// provided by the `PruningStatistics`. Here are some examples of the rewritten
310
+ /// predicates:
311
+ ///
312
+ /// Original Predicate | Rewritten Predicate
313
+ /// ------------------ | --------------------
314
+ /// `x = 5` | `x_min <= 5 AND 5 <= x_max`
315
+ /// `x < 5` | `x_max < 5`
316
+ /// `x = 5 AND y = 10` | `x_min <= 5 AND 5 <= x_max AND y_min <= 10 AND 10 <= y_max`
317
+ /// `x IS NULL` | `x_null_count > 0`
318
+ ///
319
+ /// ## Predicate Evaluation
320
+ /// The PruningPredicate works in two passes
321
+ ///
322
+ /// **First pass**: For each `LiteralGuarantee` calls
323
+ /// [`PruningStatistics::contained`] and rules out containers where the
324
+ /// LiteralGuarantees are not satisfied
325
+ ///
326
+ /// **Second Pass**: Evaluates the rewritten expression using the
327
+ /// min/max/null_counts values for each column for each container. For any
328
+ /// container that this expression evaluates to `false`, it rules out those
329
+ /// containers.
330
+ ///
331
+ /// For example, given the predicate, `x = 5 AND y = 10`, if we know `x` is
332
+ /// between `1 and 100` and we know that `y` is between `4` and `7`, the input
333
+ /// statistics might look like
334
+ ///
335
+ /// Column | Value
336
+ /// -------- | -----
337
+ /// `x_min` | `1`
338
+ /// `x_max` | `100`
339
+ /// `y_min` | `4`
340
+ /// `y_max` | `7`
341
+ ///
342
+ /// The rewritten predicate would look like
343
+ ///
344
+ /// `x_min <= 5 AND 5 <= x_max AND y_min <= 10 AND 10 <= y_max`
345
+ ///
346
+ /// When these values are substituted in to the rewritten predicate and
347
+ /// simplified, the result is `false`:
348
+ ///
349
+ /// * `1 <= 5 AND 5 <= 100 AND 4 <= 10 AND 10 <= 7`
350
+ /// * `true AND true AND true AND false`
351
+ /// * `false`
352
+ ///
353
+ /// Returning `false` means the container can be pruned, which matches the
354
+ /// intuition that `x = 5 AND y = 10` can’t be true for any row if all values of `y`
355
+ /// are `7` or less.
356
+ ///
357
+ /// If, for some other container, we knew `y` was between the values `4` and
358
+ /// `15`, then the rewritten predicate evaluates to `true` (verifying this is
359
+ /// left as an exercise to the reader -- are you still here?), and the container
360
+ /// **could not** be pruned. The intuition is that there may be rows where the
361
+ /// predicate *might* evaluate to `true`, and the only way to find out is to do
362
+ /// more analysis, for example by actually reading the data and evaluating the
363
+ /// predicate row by row.
364
+ ///
365
+ /// # Related Work
366
+ ///
367
+ /// [`PruningPredicate`] implements the type of min/max pruning described in
368
+ /// Section `3.3.3` of the [`Snowflake SIGMOD Paper`]. The technique is
369
+ /// described by various research such as [small materialized aggregates], [zone
370
+ /// maps], and [data skipping].
371
+ ///
372
+ /// [`Snowflake SIGMOD Paper`]: https://dl.acm.org/doi/10.1145/2882903.2903741
373
+ /// [small materialized aggregates]: https://www.vldb.org/conf/1998/p476.pdf
374
+ /// [zone maps]: https://dl.acm.org/doi/10.1007/978-3-642-03730-6_10
375
+ ///[data skipping]: https://dl.acm.org/doi/10.1145/2588555.2610515
193
376
#[ derive( Debug , Clone ) ]
194
377
pub struct PruningPredicate {
195
378
/// The input schema against which the predicate will be evaluated
@@ -227,6 +410,9 @@ impl PruningPredicate {
227
410
/// For example, the filter expression `(column / 2) = 4` becomes
228
411
/// the pruning predicate
229
412
/// `(column_min / 2) <= 4 && 4 <= (column_max / 2))`
413
+ ///
414
+ /// See the struct level documentation on [`PruningPredicate`] for more
415
+ /// details.
230
416
pub fn try_new ( expr : Arc < dyn PhysicalExpr > , schema : SchemaRef ) -> Result < Self > {
231
417
// build predicate expression once
232
418
let mut required_columns = RequiredColumns :: new ( ) ;
0 commit comments