-
Notifications
You must be signed in to change notification settings - Fork 12
Feature: allow reading from stdin (ideally with schema inference) #40
Comments
This feature request makes sense to me but I'm not sure I want to hack around arrow-rs when the functionality could be implemented there. Can you link to the issue you filed? |
Here we go! apache/arrow-rs#1059 OK, duckdb just lost because they also require insane amounts of memory. Your implementation is super lean, on track to be the winner if we can just get it to read from stdin then everything is perfect. Alternatively: what about reading from a compressed file? If we can read from a zstd compressed file then I'd be happy without zstd. That shouldn't be impossible should it? After all, this would support seek, wouldn't it? |
Thanks for the issue. I'd be curious what the issues with duckdb were. Is it that csv files can't be read from a stream? Idk whether supporting compressed data helps since we would still need to uncompress in memory. Maybe it would be best to use add a wrapper to the reader so that we support streams but still can seek a bit to infer the schema. |
Oh duckdb just seems to read everything into memory, then write out rather than stream. It's kind of a hack how I'm using it for ETL - so not surprising it doesn't work. Why do you think we'd need to uncompress in memory? Here's a seekable zstd decompressor that could be used as a wrapper around the compressed file input to present seekable uncompressed bytes to the arrow csv reader: https://docs.rs/zstd-seekable/latest/zstd_seekable/struct.Seekable.html#impl-1 |
Ok that seekable thing doesn't work. So back to inferring schema when reading from stdin. One could multipeek on the stdin reader for a few lines, collect into a vec, turn it into read/seekable, feed it into ReaderBuilder just with the purpose of inferring schema. Keep the schema, throw away the builder. Set up a reader without automatic inference using the previously inferred schema. Now I think that's a bit above my Rust level to implement for the moment. |
In the short term you might want to try Unix split with a helper script to write a file block for csv2 parquet in a tmpfs. Then call csv2parquet still on a file, but with no real writes to disk or ssd. The /tmp is normally tmpfs. At some point the parquet compressor will need a block of data that's reasonable per the machine's memory. Then hopefully you get the same schema inference on each file part. But in general inference is hit or miss. If I cannot trust the data I would force string by using zero sample rows. Then deal with the types as late as possible. https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html |
As a workaround (for now) I can just do the parquet conversion on a cluster where I have TB of disk space. Should be possible to feed the 100GB file into csv2parquet that way. Another way we could read from stdin: allow a third input just for schema inference. That could be say the first 100 rows of the data to be compressed. Then call the non-readerbuilder reader with the inferred schema. That schema inference part could first be read from file, but also from stdin -> then collect into vec do inference on the block in memory. I'm quite surprised really there's no CLI that streams to parquet. Would have thought it's an obvious use case. Am I missing something? |
We could support explicit schema input. Then you could use csv2parquet with -n to get the schema on a subset and then pass that to a second process that works on the whole file. I think that would be a very useful feature by itself. Alternatively, you could add the reader from apache/arrow-rs#1059 (comment) here or in arrow-rs (preferably the latter). |
Yes, explicit schema input could be very useful in itself. That way one could accept from stdin without problem because if one uses explicit schema the reader does not require |
Let's start with that. Can you send a pull request? I can help review and advise if you have questions. Note that we will want the same functionality for the other three libraries linked from the readme as well so once we are done here we need to copy the functionality. |
Interested too; when downloading datasets from eurostat you get these HUGE csv files (500MB), but compressed they are much smaller, and it woul dbe nice to be able to do
|
I would like to read 100GB of SARS-CoV-2 sequence data (6m records) into .parquet from .tsv
Unfortunately, csv2parquet doesn't allow me to provide a stream through stdin (I've tried to hack this in but the issue was that the csv reader builder requires the seek trait which is not available for stdin).
One would have to work around the csv reader limitation by using a non-builder version which requires an explicit schema. The schema could either be user supplied or inferred. Inference would be convenient but not trivial to implement.
I've opened a feature request in arrow-rs to ask for this to be implemented upstream but it will probably not happen for a while.
The text was updated successfully, but these errors were encountered: