EPIC: spark connect #3581

universalmind303 · 2024-12-16T20:17:58Z

spark connect

distributed execution

for distributed execution we need a ray runner that we can call from rust

create rust based shim around our existing python ray runner

We might need this?

move DaftContext into rust (this one should be relatively easy)

compatibility/interop

some of the text based methods (printSchema, show, explain) should have a spark compatibile output.

modify the to_comfy_table to be able to output a spark compatible df output.
alternative display implementation for Schema that matches spark's
create a new TreeDisplay implementation that somewhat matches spark's plans

pyspark.sql.DataFrame

pyspark.sql.Catalog

TODO (I don't think this is stabilized in spark connect yet)

pyspark.sql.functions

see https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html for list of functions

UDFS

spark UDF's should be mappable to our UDF's. They use a very similar pickling approach, and we'll likely just need to use their deserializer to deserialize them back into python. Likely a bit more discovery needed.

UX/DX

Documentation

Quick start guide for spark connect via daft (daft connect)
Distributed computing guide for daft connect
- this should include a guide for how to set up a cluster and connect to it via spark.

Issue Tracking

Upstream Spark issues

The text was updated successfully, but these errors were encountered:

jaychia · 2024-12-16T22:34:17Z

WRT to the catalogs, @universalmind303 what do you think of starting to unify around the DaftMetaCatalog that I introduced in #3036?

I think we have a few competing standards atm (including the SQLCatalog). It could be good to start having a catalog abstraction that can be shared across our different frontends

universalmind303 · 2024-12-16T23:29:22Z

WRT to the catalogs, @universalmind303 what do you think of starting to unify around the DaftMetaCatalog that I introduced in #3036?

I think we have a few competing standards atm (including the SQLCatalog). It could be good to start having a catalog abstraction that can be shared across our different frontends

yes, that is something I want to do and have been thinking about. I'll open up an issue to unify daft.catalog and daft.sql.catalog as well

universalmind303 added enhancement New feature or request epic labels Dec 16, 2024

universalmind303 mentioned this issue Dec 16, 2024

sql: unify daft.catalog and daft.sql.catalog #3586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: spark connect #3581

EPIC: spark connect #3581

universalmind303 commented Dec 16, 2024 •

edited

Loading

jaychia commented Dec 16, 2024

universalmind303 commented Dec 16, 2024 •

edited

Loading

EPIC: spark connect #3581

EPIC: spark connect #3581

Comments

universalmind303 commented Dec 16, 2024 • edited Loading

spark connect

distributed execution

compatibility/interop

pyspark.sql.DataFrame

pyspark.sql.Catalog

pyspark.sql.functions

UDFS

UX/DX

Documentation

Issue Tracking

Upstream Spark issues

jaychia commented Dec 16, 2024

universalmind303 commented Dec 16, 2024 • edited Loading

universalmind303 commented Dec 16, 2024 •

edited

Loading

universalmind303 commented Dec 16, 2024 •

edited

Loading