Skip to content

Can no longer easily join duplicate schemas as of version 43 #14112

Closed
@westonpace

Description

@westonpace

Describe the bug

This may be expected behavior so feel free to close if this is intended. However, it is (potentially) different from postgres behavior and I figured I would mention it. The reproducer can probably explain the issue better than I can.

I'm able to work around the issue by renaming all fields on one of the inputs with a prefix but I didn't have to do this before and so I figured I'd report it and make sure the change is intentional.

To Reproduce

use arrow::array::{ArrayRef, Int32Array, RecordBatch};

use datafusion::prelude::*;
use std::sync::Arc;

#[tokio::main]
async fn main() {
    let ctx = SessionContext::new();

    let id: ArrayRef = Arc::new(Int32Array::from(vec![0, 1, 2]));
    let value: ArrayRef = Arc::new(Int32Array::from(vec![0, 1, 2]));
    let batch = RecordBatch::try_from_iter(vec![("id", id), ("value", value)]).unwrap();

    ctx.register_batch("tes", batch).unwrap();

    let id: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
    let value: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
    let batch = RecordBatch::try_from_iter(vec![("id", id), ("value", value)]).unwrap();

    ctx.register_batch("tes2", batch).unwrap();

    let tes = ctx.table("tes").await.unwrap();
    let tes2 = ctx.table("tes2").await.unwrap();

    // This succeeds (the two tables have different names and so the qualified names of the columns differ)
    let joined = tes
        .clone()
        .join(tes2, JoinType::Full, &["id"], &["id"], None)
        .unwrap();

    joined.show().await.unwrap();

    // This fails with the error:
    //
    // SchemaError(DuplicateQualifiedField { qualifier: Bare { table: "tes" }, name: "id" }, Some(""))
    let tes_clone = tes.clone();
    let joined = tes
        .join(tes_clone, JoinType::Full, &["id"], &["id"], None)
        .unwrap();

    joined.show().await.unwrap();
}

Expected behavior

I would expect both joins to succeed (they did in version 42).

Additional context

In postgres the closest I get is:

CREATE TABLE tes (id int, val int);
INSERT INTO tes (id, val) VALUES (0, 0), (1, 1), (2, 2);
CREATE TABLE tes2 (id int, val int);
INSERT INTO tes2 (id, val) VALUES (1, 1), (2, 2), (3, 3);
SELECT * FROM tes FULL OUTER JOIN tes2 ON tes.id = tes2.id;
SELECT * FROM tes as t1 FULL OUTER JOIN tes as t2 ON t1.id = t2.id;

It's not exactly the same as I have to alias tes. I will mention that my full motivation here is to support a join we do in lance during a merge_insert. We do a full outer join between the existing data (target table) and the new data (source table). Since these tables have the same schema and they are created with SessionContext::read_table they have the same name.

An alternative (and maybe simpler) fix would be to introduce a SessionContext::read_table_with_alias function which takes in an optional table name.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions