Support reading from PyArrow datasets #10

wjones127 · 2022-01-09T00:35:33Z

Given the success of the Datasets + DuckDB integration, a similar integration might be worthwhile in this module.

The datasets API allows taking filters and columns subset, and provides an iterator of Arrow record batches. I think that could be wrapped in a TableProvider, though I'm unclear how predicate pushdown is implemented in Datafusion.

The text was updated successfully, but these errors were encountered:

houqp · 2022-01-09T19:27:30Z

Predicate pushdown is supported as an argument for the scan method, the doc you linked is out of date, you should see that argument in the latest version: https://docs.rs/datafusion/latest/datafusion/datasource/datasource/trait.TableProvider.html#tymethod.scan.

wjones127 mentioned this issue Jan 30, 2022

Draft PyArrow Dataset reader impl #21

Closed

This was referenced Jul 19, 2022

Implement PyArrow Dataset TableProvider #59

Open

Implement PyArrow Dataset TableProvider apache/datafusion-python#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading from PyArrow datasets #10

Support reading from PyArrow datasets #10

wjones127 commented Jan 9, 2022

houqp commented Jan 9, 2022 •

edited

Loading

Support reading from PyArrow datasets #10

Support reading from PyArrow datasets #10

Comments

wjones127 commented Jan 9, 2022

houqp commented Jan 9, 2022 • edited Loading

houqp commented Jan 9, 2022 •

edited

Loading