-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Enable placeholders with extension types #17986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
97a8408
to
04cebe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timsaucer @alamb I'm happy to add tests for all these components but wanted to make sure this is vaugely headed in the right direction before I do so!
pub data_type: Option<DataType>, | ||
pub field: Option<FieldRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change. We can change this to a less severely breaking option (e.g., just add metadata: FieldMetadata
to the struct)...I started with the most breaking version to identify its use in as many places as possible.
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | ||
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<FieldRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change would also enable supporting UUIDs and other SQL types that map to extension types
Ok(Expr::Cast(Cast::new( | ||
Box::new(expr), | ||
dt.data_type().clone(), | ||
))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place where metadata is dropped that I didn't update (casts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made #18060 for this one
#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)] | ||
pub struct FieldMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this from the expr crate so I could us it in ParamValues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @paleolimbot -- I think this is definitely the right direction. I had some small API suggestions, but overall looks good
The largest open question in my mind is what you have highlighted for customizing behavior for different extension types (e.g. comparing two fields for "equality" and printing them, and casting them, etc.)
@findepi brought up the same thing many months ago when discussing adding new types in
One idea is to create a TypeRegistry
similar to a FunctionRegistry
and some sort of ExtensionType
trait that encapsulates these behaviors.
The challenge would then be to thread the registry to all places that need it. Though that is likely largely an API design / plumbing exercise
If you think that is an idea worth exploring
datafusion/common/src/param_value.rs
Outdated
impl ParamValues { | ||
/// Verify parameter list length and type | ||
pub fn verify(&self, expect: &[DataType]) -> Result<()> { | ||
pub fn verify(&self, expect: &[FieldRef]) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing that would be nice to help people upgrade could be to add a new function and deprecate this one -- perhaps something like suggested in https://datafusion.apache.org/contributor-guide/api-health.html#api-health-policy
#[deprecated]
pub fn verify(&self, expect: &[DataType]) -> Result<()> {
// make dummy Fields
let expect = ...;
self.verify_fields(&expect)
}
// new function that has the new signature
pub fn verify_fields(&self, expect: &[FieldRef]) -> Result<()> {
...
}
datafusion/common/src/param_value.rs
Outdated
} | ||
|
||
if let Some(expected_metadata) = maybe_metadata { | ||
// Probably too strict of a comparison (this is an example of where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree straight up comparing strings is probably not ideal
If we wanted to introduce type equality, I thing the bigger question is how to thread it through (you would have to have some way to register your types / methods to check equality and ensure that somehow ended up here 🤔 )
I've stumbled across this PR and remembered that I implemented such a thing in #15106 for defining custom sort orders for certain types. I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that. |
Thank you for the review! I will clean this up today/tomorrow and ensure we can test all the spots.
I'd love that! Perhaps the first step from the non-sorting angle might be to just consolidate all the places we print, cast, or test equality (which are currently scattered around the code base). A vague thought based mostly on this PR, SedonaDB's pub fn resolve_df_data_type(session: Option<SessionSomething>, storage_type: &DataType, metadata: impl MetadataViewOfSomeSort ) -> Box<dyn DFDataType> {
// sanity check to avoid hitting the lock on any registry whatsoever when not needed)
// If the session is specified, use a session registry
// otherwise, use a static or compile-time populated registry until such a time as the session
// is made available everywhere and/or DataType::Extension is a thing
}
pub trait DFDataType: Clone {
fn storage_type(&self) -> DataType;
// e.g., the sanity check on the return field of a UDxF
fn type_identical_to(&self, &dyn DFDataType) -> Result<bool>;
// for when this has to be a storage field (it might already be one)
fn storage_field(&self, name: Option<&str>, nullability: Option<bool>) -> FieldRef;
fn add_to_existing_storage_field(&self
// for when this has to be just metadata (e.g., the Expr::Literal)
fn storage_field_metadata(&self) -> Option<FieldMetadata>;
// to avoid dumping a Field debug string in user-facing errors
fn display_type(&self) -> String;
// for any hope of ever being able to deal with shredded/unshredded variants in UNION ALL
// slash listing tables. This is useful for some geoarrow types too (wkb/wkb_view -> wkb)
fn common_castable_type(&self, &dyn DFDataType) -> Result<Box<dyn DFDataType>> { /* self if types are identical or error */ }
// usually printing the storage is OK but no ideal (e.g., variants should be jsonish)
fn format_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
// Or something more advanced like the dynamic comparator in the linked PR
fn sort_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
// It's a shame to bleed execution in here but casting shows up a lot
fn cast_storage(&self, storage: ArrayRef, to: Box<dyn DFDataType>);
}
// All the structures that are currently communicating that type information
impl DFDataType for FieldMetadata {
// implement strict equality based on all metadata (or just extension name/extension metadata?)
// printing based on metadata
// everything else based on datatype
// ...basically everything datafusion does right now
}
impl DFDataType for FieldRef {
// defer to fieldmetadata
}
impl DFDataType for DataType {
// defer to DataTypes
} |
Btw, if you want another set of eyes you can ping me once its ready for another review. |
Thank you...I'd love that! I just need to finish up a few tests...I didn't quite get there today but I'm confident I can get this ready for eyes tomorrow. |
9b6c182
to
27e8e42
Compare
479fe56
to
70f3663
Compare
/// A [`ScalarValue`] with optional [`FieldMetadata`] | ||
#[derive(Debug, Clone)] | ||
pub struct Literal { | ||
pub value: ScalarValue, | ||
pub metadata: Option<FieldMetadata>, | ||
} | ||
|
||
impl Literal { | ||
/// Create a new Literal from a scalar value with optional [`FieldMetadata`] | ||
pub fn new(value: ScalarValue, metadata: Option<FieldMetadata>) -> Self { | ||
Self { value, metadata } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure Literal
is the best name here (we already have the trait datafusion_expr::Literal
and datafusion_physical_expr::Literal
), but ScalarValueAndMetadata
seemed verbose and using (ScalarValue, Option<FieldMetadata>)
has been used in enough places that I think having its own struct makes sense.
Other places where we pair a ScalarValue
with metadata are Expr::Literal
, and ConstSimplifyResult::Simplified
.
I don't have strong opinions here and am happy to have this be whatever.
pub fn check_metadata_with_storage_equal( | ||
actual: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, | ||
), | ||
expected: ( | ||
&DataType, | ||
Option<&std::collections::HashMap<String, String>>, | ||
), | ||
what: &str, | ||
context: &str, | ||
) -> Result<(), DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of ugly but there are two places in this PR that do this check (type equality). I figured at least collecting the checks in one place would make it easier to do something smarter later if it comes up.
pub fn format_type_and_metadata( | ||
data_type: &DataType, | ||
metadata: Option<&std::collections::HashMap<String, String>>, | ||
) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to keep the LogicalType::Statement(Prepared {..})
output in the plan text from dumping the debug output from Field
so that it is mostly unchanged (unless there is, in fact, metadata). Also helps with the error message above.
Expr::Placeholder(Placeholder { field, .. }) => match &field { | ||
// TODO: This seems to be pulling metadata/types from the schema | ||
// based on the Alias destination. name? Is this correct? | ||
None => schema | ||
.field_from_column(&Column::from_name(name)) | ||
.map(|f| f.clone().with_name(&schema_name)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this first branch and why it would pull metadata from a schema based on the destination name?
This is ready for 👀! @tobixdev The CI failure is
I'm happy to take a look if this keeps showing up. |
I believe that is due to this issue (which was fixed yesterday) It should be fixed with a merge up from main |
Which issue does this PR close?
Rationale for this change
Most logical plan expressions now propagate metadata; however, parameters with extension types or other field metadata cannot participate in placeholder/parameter binding.
What changes are included in this PR?
The DataType in the Placeholder struct was replaced with a FieldRef along with anything that stored the "DataType" of a parameter.
Strictly speaking one could bind parameters with an extension type by copy/pasting the placeholder replacer, which I figured out towards the end of this change. I still think this change makes sense and opens up the door for things like handling UUID in SQL with full parameter binding support.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, one new function was added to extract the placeholder fields from a plan.
This is a breaking change for code that specifically interacts with the pub fields of the modified structs (ParamValues, Placeholder, and Prepare are the main ones).