allow reading from stdin (ideally with schema inference) #3

domoritz · 2023-02-02T18:26:52Z

See domoritz/csv2parquet#40 by @corneliusroemer

Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3

Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3 feat: refactor SeekableReader into arrow-tools lib create Also refactor schema matching to make it less verbose by using map_err instead of match, see json2parquet for before/after

corneliusroemer mentioned this issue Mar 4, 2023

feat: Allow reading from stdin with schema inference #10

Merged

domoritz added the enhancement New feature or request label Mar 7, 2023

domoritz closed this as completed in #10 Apr 12, 2023

felix-hh mentioned this issue Oct 10, 2023

Add example to documentation #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow reading from stdin (ideally with schema inference) #3

allow reading from stdin (ideally with schema inference) #3

domoritz commented Feb 2, 2023

allow reading from stdin (ideally with schema inference) #3

allow reading from stdin (ideally with schema inference) #3

Comments

domoritz commented Feb 2, 2023