fix: describe escaped quoted identifiers #16082
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
describe
does not handle mixed case or dots in column names #16017Rationale for this change
The dataframe
describe
method serves as a tidier way to produce standard summary statistics about a dataset via the dataframe API. The abstraction hides an implicit normalization of column names performed by calls made tocol
from within the method's definition. @johnkerl flagged that use of datasets with non-normal column names (i.e. using uppercase or periods in the name) causes unexpected failures. The method is leaking its implementation and the fix provided properly seals it by escaping identifiers passed tocol
.Regarding Other Kinds Of Non-Alphanumeric Characters In Identifiers
@findepi asks:
The question raises a nuanced edge case. The following is a diagram of the call chain from passing a string to
col
followed by an explanation of each call.col
method (source here) is passed the string literal ""Colu.MN1"".into()
method.Expr::Column
(source here) is an enum variant holding adatafusion_common::Column
(source here).datafusion_common::Column
is the type to which the string literal is cast.impl Into<Column>
requires that the string literal implement a mechanism to cast to a string.std::convert
provides generic implementation of theFrom
andInto
trait such that only one or the other need be defined (i.e. theFrom<&str>
trait implemented forColumn
implies theInto<Column>
trait on&str
) (source here).Column
implements theFrom<...>
trait thereby implicitly implementing theInto<Column>
trait for those types. These include:From<&str>
,From<&String>
, andFrom<String>
(source here)from
method, the string literal is passed to one of them which produces aColumn
instance.from
methods for string toColumn
conversion callfrom_qualified_name
(source here) which parses the string to produce a column.from_qualified_name
expects an argument which implements theInto<String>
conversion trait.Into<...>
andFrom<...>
traits are reflexive soString
implementsInto<String>
by default. The string literal is "converted" to aString
which largely amounts to a no-op in this example's call tofrom_qualified_name
.from_qualified_name
callsparse_identifiers_normalized
(source here) which parses the string.parse_identifiers_normalized
callsparse_identifiers
(source here) which produces aResult
holding a vector ofsqlparser::ast::Ident
structs (source here.Ident
structs,parse_identifiers
defines asqlparser::parser::Parser
(source here) with a sql dialect ofsqlparser::dialect::GenericDialect
(source here). TheParser
callstry_with_sql
(source here which uses a tokenizer to produce locations in the string for the identifier tokens. The parser then has theparse_multipart_identifier
method called (source here) which is meant to extract theIdent
structs from the tokens.Ident
structs is mapped with aliasid
for each element to a match statement.Ident
struct's "quote style" which per the doc strings are:id.quote_style
which is getting thequote_style
attribute for eachIdent
in the result fromparse_identifiers
.Some(_)
and the string is returned as from theIdent
structsvalue
attribute.None
and the additional condition is checked thatparse_identifiers_normalized
was called with theignore_case
boolean flag set totrue
orfalse
.from_qualified_name
always callsparse_identifiers_normalized
withignore_case
set tofalse
so this branch is never entered.parse_identifiers_normalized
was called withignore_case
set tofalse
, the default branch is entered which converts theIdent
structs value attribute to lower case only ascii. This branch is not entered in this example because"\"Colu.MN1\""
is quoted.collect()
is called on the mapping to produce a new vector ofString
values.from_idents
(source here is called on the vector ofStrings
fromparse_identifiers_normalized
which produces a column by processing all but the last tokens into a relation on theColumn
struct, the last token as the column's name, and initializes a newSpan
instance.col
with the quoted name.I wrote all of that up to document what I understand about
col
and compare that withident
(source here) which just callsfrom_name
onColumn
(source here) which is essentially a special constructor that creates aColumn
with norelation
attribute and the string as-is for aname
(bypassing using thesqlparser
crate entirely).My concern in usingThat appears to be moot as the identifier just has to be valid for the resulting dataframe/relation and not the source.ident
to potentially more readily handle messier column names is that by doing no parsing it may break describe on something like a Postgres table.ident
appears to work well in this context. Thanks to @findepi for the suggestion.What changes are included in this PR?
Calls to
col
in thedescribe
method of dataframes to calculate summary statistics are passed the escape-quoted identifier via aformat!
macro.Are these changes tested?
Yes, a test was added to the file:
datafusion/core/dataframe/tests/mod.rs
Are there any user-facing changes?
Yes, the change ensures the dataframe API behaves as expected when using
describe
with non-normal identifiers.