Skip to content

fix: describe escaped quoted identifiers #16082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 21, 2025

Conversation

jfahne
Copy link
Contributor

@jfahne jfahne commented May 18, 2025

Which issue does this PR close?

Rationale for this change

The dataframe describe method serves as a tidier way to produce standard summary statistics about a dataset via the dataframe API. The abstraction hides an implicit normalization of column names performed by calls made to col from within the method's definition. @johnkerl flagged that use of datasets with non-normal column names (i.e. using uppercase or periods in the name) causes unexpected failures. The method is leaking its implementation and the fix provided properly seals it by escaping identifiers passed to col.

Regarding Other Kinds Of Non-Alphanumeric Characters In Identifiers

@findepi asks:

what if f.name() contains " itself?

should we use some format ident function?

The question raises a nuanced edge case. The following is a diagram of the call chain from passing a string to col followed by an explanation of each call.

col_call_chain_diagram

  1. The col method (source here) is passed the string literal ""Colu.MN1"".
  2. The string literal passed is implicitly cast on instantiation of the enum variant Expr::Column via an implementation of the std::convert::Into trait's into() method.
  3. Expr::Column (source here) is an enum variant holding a datafusion_common::Column (source here). datafusion_common::Column is the type to which the string literal is cast.
  4. The type constraint impl Into<Column> requires that the string literal implement a mechanism to cast to a string. std::convert provides generic implementation of the From and Into trait such that only one or the other need be defined (i.e. the From<&str> trait implemented for Column implies the Into<Column> trait on &str) (source here).
  5. There are several string types for which Column implements the From<...> trait thereby implicitly implementing the Into<Column> trait for those types. These include: From<&str>, From<&String>, and From<String> (source here)
  6. Regardless of the particular from method, the string literal is passed to one of them which produces a Column instance.
  7. All of the from methods for string to Column conversion call from_qualified_name (source here) which parses the string to produce a column.
  8. from_qualified_name expects an argument which implements the Into<String> conversion trait. Into<...> and From<...> traits are reflexive so String implements Into<String> by default. The string literal is "converted" to a String which largely amounts to a no-op in this example's call to from_qualified_name.
  9. from_qualified_name calls parse_identifiers_normalized (source here) which parses the string.
  10. parse_identifiers_normalized calls parse_identifiers (source here) which produces a Result holding a vector of sqlparser::ast::Ident structs (source here.
  11. To produce the Ident structs, parse_identifiers defines a sqlparser::parser::Parser (source here) with a sql dialect of sqlparser::dialect::GenericDialect (source here). The Parser calls try_with_sql (source here which uses a tokenizer to produce locations in the string for the identifier tokens. The parser then has the parse_multipart_identifier method called (source here) which is meant to extract the Ident structs from the tokens.
  12. The resulting vector of Ident structs is mapped with alias id for each element to a match statement.
  13. The match statement is checking each Ident struct's "quote style" which per the doc strings are:

/// The starting quote if any. Valid quote characters are the single quote,
/// double quote, backtick, and opening square bracket.

  1. The quote style is matched against via id.quote_style which is getting the quote_style attribute for each Ident in the result from parse_identifiers.
  2. When the identifier is quoted, the quote style matches Some(_) and the string is returned as from the Ident structs value attribute.
  3. When the identifier is not quoted, the quote style is None and the additional condition is checked that parse_identifiers_normalized was called with the ignore_case boolean flag set to true or false. from_qualified_name always calls parse_identifiers_normalized with ignore_case set to false so this branch is never entered.
  4. When the identifier is not quoted and parse_identifiers_normalized was called with ignore_case set to false, the default branch is entered which converts the Ident structs value attribute to lower case only ascii. This branch is not entered in this example because "\"Colu.MN1\"" is quoted.
  5. collect() is called on the mapping to produce a new vector of String values.
  6. from_idents (source here is called on the vector of Strings from parse_identifiers_normalized which produces a column by processing all but the last tokens into a relation on the Column struct, the last token as the column's name, and initializes a new Span instance.
  7. A column struct is returned from col with the quoted name.

I wrote all of that up to document what I understand about col and compare that with ident (source here) which just calls from_name on Column (source here) which is essentially a special constructor that creates a Column with no relation attribute and the string as-is for a name (bypassing using the sqlparser crate entirely). My concern in using ident to potentially more readily handle messier column names is that by doing no parsing it may break describe on something like a Postgres table. That appears to be moot as the identifier just has to be valid for the resulting dataframe/relation and not the source. ident appears to work well in this context. Thanks to @findepi for the suggestion.

What changes are included in this PR?

Calls to col in the describe method of dataframes to calculate summary statistics are passed the escape-quoted identifier via a format! macro.

Are these changes tested?

Yes, a test was added to the file: datafusion/core/dataframe/tests/mod.rs

Are there any user-facing changes?

Yes, the change ensures the dataframe API behaves as expected when using describe with non-normal identifiers.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate labels May 18, 2025
@jfahne jfahne force-pushed the describe-column-name-bug branch from 9b2f381 to c6f8524 Compare May 18, 2025 03:34
@github-actions github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate labels May 18, 2025
@jfahne jfahne force-pushed the describe-column-name-bug branch 3 times, most recently from ffb2671 to f03b36a Compare May 18, 2025 03:49
@jfahne
Copy link
Contributor Author

jfahne commented May 18, 2025

CC: @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jfahne and @findepi -- this looks really nice now

rm: dev files

fmt: final formatting

sed: s/<comment>//
@jfahne jfahne force-pushed the describe-column-name-bug branch from 44d890c to 9d98231 Compare May 21, 2025 11:35
@alamb alamb merged commit 0589dbb into apache:main May 21, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented May 21, 2025

Thanks again @jfahne and @findepi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

describe does not handle mixed case or dots in column names
3 participants