Feature: allow reading from stdin (ideally with schema inference) #40

corneliusroemer · 2021-12-18T02:47:05Z

I would like to read 100GB of SARS-CoV-2 sequence data (6m records) into .parquet from .tsv

Unfortunately, csv2parquet doesn't allow me to provide a stream through stdin (I've tried to hack this in but the issue was that the csv reader builder requires the seek trait which is not available for stdin).

One would have to work around the csv reader limitation by using a non-builder version which requires an explicit schema. The schema could either be user supplied or inferred. Inference would be convenient but not trivial to implement.

I've opened a feature request in arrow-rs to ask for this to be implemented upstream but it will probably not happen for a while.

domoritz · 2021-12-18T03:02:50Z

This feature request makes sense to me but I'm not sure I want to hack around arrow-rs when the functionality could be implemented there.

Can you link to the issue you filed?

corneliusroemer · 2021-12-18T03:42:06Z

Here we go! apache/arrow-rs#1059

OK, duckdb just lost because they also require insane amounts of memory. Your implementation is super lean, on track to be the winner if we can just get it to read from stdin then everything is perfect.

Alternatively: what about reading from a compressed file? If we can read from a zstd compressed file then I'd be happy without zstd. That shouldn't be impossible should it? After all, this would support seek, wouldn't it?

domoritz · 2021-12-18T04:50:46Z

Thanks for the issue.

I'd be curious what the issues with duckdb were. Is it that csv files can't be read from a stream?

Idk whether supporting compressed data helps since we would still need to uncompress in memory. Maybe it would be best to use add a wrapper to the reader so that we support streams but still can seek a bit to infer the schema.

corneliusroemer · 2021-12-18T06:00:33Z

Oh duckdb just seems to read everything into memory, then write out rather than stream. It's kind of a hack how I'm using it for ETL - so not surprising it doesn't work.

Why do you think we'd need to uncompress in memory? Here's a seekable zstd decompressor that could be used as a wrapper around the compressed file input to present seekable uncompressed bytes to the arrow csv reader: https://docs.rs/zstd-seekable/latest/zstd_seekable/struct.Seekable.html#impl-1

corneliusroemer · 2021-12-18T07:36:15Z

Ok that seekable thing doesn't work. So back to inferring schema when reading from stdin.

One could multipeek on the stdin reader for a few lines, collect into a vec, turn it into read/seekable, feed it into ReaderBuilder just with the purpose of inferring schema. Keep the schema, throw away the builder. Set up a reader without automatic inference using the previously inferred schema.

Now I think that's a bit above my Rust level to implement for the moment.

snoe925 · 2021-12-19T20:19:56Z

In the short term you might want to try Unix split with a helper script to write a file block for csv2 parquet in a tmpfs. Then call csv2parquet still on a file, but with no real writes to disk or ssd. The /tmp is normally tmpfs.

At some point the parquet compressor will need a block of data that's reasonable per the machine's memory.

Then hopefully you get the same schema inference on each file part. But in general inference is hit or miss. If I cannot trust the data I would force string by using zero sample rows. Then deal with the types as late as possible.

https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html

corneliusroemer · 2021-12-20T00:30:14Z

As a workaround (for now) I can just do the parquet conversion on a cluster where I have TB of disk space.

Should be possible to feed the 100GB file into csv2parquet that way.

Another way we could read from stdin: allow a third input just for schema inference. That could be say the first 100 rows of the data to be compressed. Then call the non-readerbuilder reader with the inferred schema.

That schema inference part could first be read from file, but also from stdin -> then collect into vec do inference on the block in memory.

I'm quite surprised really there's no CLI that streams to parquet. Would have thought it's an obvious use case. Am I missing something?

domoritz · 2021-12-20T22:19:57Z

We could support explicit schema input. Then you could use csv2parquet with -n to get the schema on a subset and then pass that to a second process that works on the whole file. I think that would be a very useful feature by itself.

Alternatively, you could add the reader from apache/arrow-rs#1059 (comment) here or in arrow-rs (preferably the latter).

corneliusroemer · 2021-12-20T23:49:24Z

Yes, explicit schema input could be very useful in itself. That way one could accept from stdin without problem because if one uses explicit schema the reader does not require seek.

domoritz · 2021-12-21T00:05:53Z

Let's start with that. Can you send a pull request? I can help review and advise if you have questions. Note that we will want the same functionality for the other three libraries linked from the readme as well so once we are done here we need to copy the functionality.

Fil · 2022-11-29T11:07:28Z

Interested too; when downloading datasets from eurostat you get these HUGE csv files (500MB), but compressed they are much smaller, and it woul dbe nice to be able to do

gunzip -c ~/Downloads/migr_asyappctza.csv.gz | csv2parquet - /path/to/output.parquet

domoritz mentioned this issue Apr 29, 2022

Option to override column schema type inferred from csv. #73

Closed

domoritz mentioned this issue Oct 15, 2022

read jsonl from stdin domoritz/json2parquet#100

Closed

domoritz mentioned this issue Feb 2, 2023

allow reading from stdin (ideally with schema inference) domoritz/arrow-tools#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: allow reading from stdin (ideally with schema inference) #40

Feature: allow reading from stdin (ideally with schema inference) #40

corneliusroemer commented Dec 18, 2021

domoritz commented Dec 18, 2021

corneliusroemer commented Dec 18, 2021 •

edited

Loading

domoritz commented Dec 18, 2021

corneliusroemer commented Dec 18, 2021

corneliusroemer commented Dec 18, 2021

snoe925 commented Dec 19, 2021

corneliusroemer commented Dec 20, 2021

domoritz commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021

domoritz commented Dec 21, 2021 •

edited

Loading

Fil commented Nov 29, 2022

Feature: allow reading from stdin (ideally with schema inference) #40

Feature: allow reading from stdin (ideally with schema inference) #40

Comments

corneliusroemer commented Dec 18, 2021

domoritz commented Dec 18, 2021

corneliusroemer commented Dec 18, 2021 • edited Loading

domoritz commented Dec 18, 2021

corneliusroemer commented Dec 18, 2021

corneliusroemer commented Dec 18, 2021

snoe925 commented Dec 19, 2021

corneliusroemer commented Dec 20, 2021

domoritz commented Dec 20, 2021

corneliusroemer commented Dec 20, 2021

domoritz commented Dec 21, 2021 • edited Loading

Fil commented Nov 29, 2022

corneliusroemer commented Dec 18, 2021 •

edited

Loading

domoritz commented Dec 21, 2021 •

edited

Loading