Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

Question - Can datafusion-python be used without pyarrow? #22

Open
matthewmturner opened this issue Feb 12, 2022 · 2 comments
Open

Question - Can datafusion-python be used without pyarrow? #22

matthewmturner opened this issue Feb 12, 2022 · 2 comments

Comments

@matthewmturner
Copy link
Contributor

matthewmturner commented Feb 12, 2022

I feel odd even asking this - but is it possible to make enhancements so that datafusion-python can be used without pyarrow? pyarrow is fantastic and I already use it, but, it is fairly large which makes it somewhat painful to deploy for some serverless use cases (such as on AWS Lambda). If I am able to do everything I need in datafusion is there a need for pyarrow? I confess I'm not very familiar with the interface between rust / datafusion and python / arrow so hopefully this isnt too stupid of a question.

thx!

@wjones127
Copy link

I think it might be possible; a good portion of the module doesn't require PyArrow. The only things that do are UDFs, UDAFs, and the parts of the Dataframe API that return PyArrow data structures (like collect(), and schema()). Does a datafusion-python without those features sound appealing?

@matthewmturner
Copy link
Contributor Author

Cool - that was what it looked like to me as well from my scan of the code. IMHO in the medium term it would be nice to have pyarrow as an optional feature. I think that datafusion should have some improvements on the IO front though before enabling this (im looking into / working on writing capabilities apache/datafusion#1777). Right now I think pyarrow has more functionality there which is useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants