-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support apache arrow #45
Comments
Whether & when I work on Arrow support will depend on the fate of this currently-deprecated part of the C API. Based on this PR, it looks like they will remain deprecated. I'm not sure what the long-term plan is. |
It seems the Arrow support in the C API has a "DEPRECATION NOTICE", which is distinct from "DEPRECATED". Apparently this means the functions are likely to change, but the functionality itself will likely be preserved. So it seems possible I could expose Arrow support using the current functions with some confidence I can preserve that support when the C API is changed. |
I reviewed the Arrow support in the C API, and, while I believe I can expose it, I'm unsure whether it provides the desired functionality. In particular, I'm unsure it provides access to the binary IPC format. For folks interested in Arrow support: What functionality is useful? What parts of the Arrow C API would you use, and what would you like that's missing from that API? |
I mostly need IPC which I could then read with flechette or arrow js. In most cases I'll send the arrow straight over the wire somewhere else. Arrow seems like the ideal format for streaming data from a server to a client. I'm not sure what alternative there is so this seems like an essential functionality almost. |
While it's admittedly more awkward than direct API support, have you tried using the It's not very well documented, but the tests are informative. |
Oh, fun idea. This should be easy to wrap as well. It would be nice to have something built in, though, to make sure we didn't fall into some perf trap with a workaround. |
Chiming in: the core feature that is keeping us on the old node client is the ability to have some kind of iterator over arrow record batches. Basically wanting to be able to stream data through duckdb and keep memory usage low and the types as close as possible, like @domoritz for subsequent passing across the network. E.g. we have essentially the following case running now. import { RecordBatchStreamReader } from 'apache-arrow';
const stream = db.arrowIPCStream("SELECT * FROM 's3://huggingface-datasets/somebigtable-part*.parquet'")
const reader = await RecordBatchStreamReader.from(stream);
for await (const batch of reader) {
await myUploadFunction(batch)
} To be fair, we (@RLesser) have found some problems with this workflow in the node client so it may not be easy! |
I'd be curious to see how It does seems that Example:
|
It's on the roadmap but I wanted to create an issue for it so I can see when it may be supported. I'm very interested in adopting this package but need arrow support (I just need access to the binary ipc).
The text was updated successfully, but these errors were encountered: