Skip to content

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Oct 8, 2025

Which issue does this PR close?

Rationale for this change

Most logical plan expressions now propagate metadata; however, parameters with extension types or other field metadata cannot participate in placeholder/parameter binding.

What changes are included in this PR?

The DataType in the Placeholder struct was replaced with a FieldRef along with anything that stored the "DataType" of a parameter.

Strictly speaking one could bind parameters with an extension type by copy/pasting the placeholder replacer, which I figured out towards the end of this change. I still think this change makes sense and opens up the door for things like handling UUID in SQL with full parameter binding support.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, one new function was added to extract the placeholder fields from a plan.

This is a breaking change for code that specifically interacts with the pub fields of the modified structs (ParamValues, Placeholder, and Prepare are the main ones).

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions proto Related to proto crate physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate common Related to common crate labels Oct 8, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 10, 2025
Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timsaucer @alamb I'm happy to add tests for all these components but wanted to make sure this is vaugely headed in the right direction before I do so!

Comment on lines -1373 to +1145
pub data_type: Option<DataType>,
pub field: Option<FieldRef>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change. We can change this to a less severely breaking option (e.g., just add metadata: FieldMetadata to the struct)...I started with the most breaking version to identify its use in as many places as possible.

Comment on lines -590 to +592
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> {
pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<FieldRef> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change would also enable supporting UUIDs and other SQL types that map to extension types

Comment on lines +989 to +994
Ok(Expr::Cast(Cast::new(
Box::new(expr),
dt.data_type().clone(),
)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where metadata is dropped that I didn't update (casts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made #18060 for this one

Comment on lines +54 to +168
#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
pub struct FieldMetadata {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this from the expr crate so I could us it in ParamValues

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot -- I think this is definitely the right direction. I had some small API suggestions, but overall looks good

The largest open question in my mind is what you have highlighted for customizing behavior for different extension types (e.g. comparing two fields for "equality" and printing them, and casting them, etc.)

@findepi brought up the same thing many months ago when discussing adding new types in

One idea is to create a TypeRegistry similar to a FunctionRegistry and some sort of ExtensionType trait that encapsulates these behaviors.

The challenge would then be to thread the registry to all places that need it. Though that is likely largely an API design / plumbing exercise

If you think that is an idea worth exploring

impl ParamValues {
/// Verify parameter list length and type
pub fn verify(&self, expect: &[DataType]) -> Result<()> {
pub fn verify(&self, expect: &[FieldRef]) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that would be nice to help people upgrade could be to add a new function and deprecate this one -- perhaps something like suggested in https://datafusion.apache.org/contributor-guide/api-health.html#api-health-policy

    #[deprecated]
    pub fn verify(&self, expect: &[DataType]) -> Result<()> {
      // make dummy Fields
      let expect = ...;
      self.verify_fields(&expect)
     }

    // new function that has the new signature
    pub fn verify_fields(&self, expect: &[FieldRef]) -> Result<()> {
    ...
    }

}

if let Some(expected_metadata) = maybe_metadata {
// Probably too strict of a comparison (this is an example of where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree straight up comparing strings is probably not ideal

If we wanted to introduce type equality, I thing the bigger question is how to thread it through (you would have to have some way to register your types / methods to check equality and ensure that somehow ended up here 🤔 )

@tobixdev
Copy link
Contributor

tobixdev commented Oct 13, 2025

One idea is to create a TypeRegistry similar to a FunctionRegistry and some sort of ExtensionType trait that encapsulates these behaviors.

I've stumbled across this PR and remembered that I implemented such a thing in #15106 for defining custom sort orders for certain types. I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that.

@paleolimbot
Copy link
Member Author

Thank you for the review! I will clean this up today/tomorrow and ensure we can test all the spots.

I now have a bit more time on my hand and I'll try to revive that effort as we would be still interested in that.

I'd love that! Perhaps the first step from the non-sorting angle might be to just consolidate all the places we print, cast, or test equality (which are currently scattered around the code base).

A vague thought based mostly on this PR, SedonaDB's SedonaType, and the one linked (likely needs much refinement):

pub fn resolve_df_data_type(session: Option<SessionSomething>, storage_type: &DataType, metadata: impl MetadataViewOfSomeSort ) -> Box<dyn DFDataType> {
  // sanity check to avoid hitting the lock on any registry whatsoever when not needed)

  // If the session is specified, use a session registry

  // otherwise, use a static or compile-time populated registry until such a time as the session
  // is made available everywhere and/or DataType::Extension is a thing
}

pub trait DFDataType: Clone {
  fn storage_type(&self) -> DataType;
  // e.g., the sanity check on the return field of a UDxF
  fn type_identical_to(&self, &dyn DFDataType) -> Result<bool>;
  // for when this has to be a storage field (it might already be one)
  fn storage_field(&self, name: Option<&str>, nullability: Option<bool>) -> FieldRef;
  fn add_to_existing_storage_field(&self
  // for when this has to be just metadata (e.g., the Expr::Literal)
  fn storage_field_metadata(&self) -> Option<FieldMetadata>;
  // to avoid dumping a Field debug string in user-facing errors
  fn display_type(&self) -> String;
  // for any hope of ever being able to deal with shredded/unshredded variants in UNION ALL
  // slash listing tables. This is useful for some geoarrow types too (wkb/wkb_view -> wkb)
  fn common_castable_type(&self, &dyn DFDataType) -> Result<Box<dyn DFDataType>> { /* self if types are identical or error */ }
  // usually printing the storage is OK but no ideal (e.g., variants should be jsonish)
  fn format_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
  // Or something more advanced like the dynamic comparator in the linked PR
  fn sort_proxy(&self, storage: ArrayRef) -> Result<ArrayRef> { Ok(storage) }
  // It's a shame to bleed execution in here but casting shows up a lot
  fn cast_storage(&self, storage: ArrayRef, to: Box<dyn DFDataType>);
}

// All the structures that are currently communicating that type information

impl DFDataType for FieldMetadata {
  // implement strict equality based on all metadata (or just extension name/extension metadata?)
  // printing based on metadata
  // everything else based on datatype
  // ...basically everything datafusion does right now
}

impl DFDataType for FieldRef {
  // defer to fieldmetadata
}

impl DFDataType for DataType {
 // defer to DataTypes
}

@tobixdev
Copy link
Contributor

Btw, if you want another set of eyes you can ping me once its ready for another review.

@paleolimbot
Copy link
Member Author

Thank you...I'd love that! I just need to finish up a few tests...I didn't quite get there today but I'm confident I can get this ready for eyes tomorrow.

@paleolimbot paleolimbot marked this pull request as ready for review October 16, 2025 23:02
Comment on lines +25 to +36
/// A [`ScalarValue`] with optional [`FieldMetadata`]
#[derive(Debug, Clone)]
pub struct Literal {
pub value: ScalarValue,
pub metadata: Option<FieldMetadata>,
}

impl Literal {
/// Create a new Literal from a scalar value with optional [`FieldMetadata`]
pub fn new(value: ScalarValue, metadata: Option<FieldMetadata>) -> Self {
Self { value, metadata }
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure Literal is the best name here (we already have the trait datafusion_expr::Literal and datafusion_physical_expr::Literal), but ScalarValueAndMetadata seemed verbose and using (ScalarValue, Option<FieldMetadata>) has been used in enough places that I think having its own struct makes sense.

Other places where we pair a ScalarValue with metadata are Expr::Literal, and ConstSimplifyResult::Simplified.

I don't have strong opinions here and am happy to have this be whatever.

Comment on lines +78 to +89
pub fn check_metadata_with_storage_equal(
actual: (
&DataType,
Option<&std::collections::HashMap<String, String>>,
),
expected: (
&DataType,
Option<&std::collections::HashMap<String, String>>,
),
what: &str,
context: &str,
) -> Result<(), DataFusionError> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of ugly but there are two places in this PR that do this check (type equality). I figured at least collecting the checks in one place would make it easier to do something smarter later if it comes up.

Comment on lines +124 to +127
pub fn format_type_and_metadata(
data_type: &DataType,
metadata: Option<&std::collections::HashMap<String, String>>,
) -> String {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to keep the LogicalType::Statement(Prepared {..}) output in the plan text from dumping the debug output from Field so that it is mostly unchanged (unless there is, in fact, metadata). Also helps with the error message above.

Comment on lines +439 to +444
Expr::Placeholder(Placeholder { field, .. }) => match &field {
// TODO: This seems to be pulling metadata/types from the schema
// based on the Alias destination. name? Is this correct?
None => schema
.field_from_column(&Column::from_name(name))
.map(|f| f.clone().with_name(&schema_name)),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this first branch and why it would pull metadata from a schema based on the destination name?

@paleolimbot
Copy link
Member Author

This is ready for 👀! @tobixdev

The CI failure is

1. query is expected to fail, but actually succeed:
[SQL] SELECT ALL ( + 73 ) + - + 92 * - 97 + COALESCE ( + ( + NULLIF ( + 75, - MIN ( ALL + + 53 ) ) ), + ( - CASE - - 73 WHEN COALESCE ( - + 51, - 26 ) / + 0 THEN COUNT ( * ) * 14 END ), + COUNT ( * ) / + 3, + COUNT ( ALL 73 ) ) + - 86 * ( 8 ) * + 37 AS col1

I'm happy to take a look if this keeps showing up.

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

I'm happy to take a look if this keeps showing up.

I believe that is due to this issue (which was fixed yesterday)

It should be fixed with a merge up from main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates proto Related to proto crate sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical plan placeholders can't contain metadata

3 participants