-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow reading from stdin (ideally with schema inference) #3
Labels
enhancement
New feature or request
Comments
corneliusroemer
added a commit
to corneliusroemer/arrow-tools
that referenced
this issue
Mar 4, 2023
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3
corneliusroemer
added a commit
to corneliusroemer/arrow-tools
that referenced
this issue
Mar 4, 2023
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3
corneliusroemer
added a commit
to corneliusroemer/arrow-tools
that referenced
this issue
Mar 4, 2023
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3
corneliusroemer
added a commit
to corneliusroemer/arrow-tools
that referenced
this issue
Mar 5, 2023
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3
corneliusroemer
added a commit
to corneliusroemer/arrow-tools
that referenced
this issue
Mar 5, 2023
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3 feat: refactor SeekableReader into arrow-tools lib create Also refactor schema matching to make it less verbose by using map_err instead of match, see json2parquet for before/after
corneliusroemer
added a commit
to corneliusroemer/arrow-tools
that referenced
this issue
Mar 6, 2023
Piped input does not support Seek out of the box Seek is required to infer the schema To work around this, we buffer the input iff input file does not support seek Only the number of lines actually used to infer the schema are buffered to allow reading of files larger than memory This works, because the arrow crate only seeks twice: 1. To check whether seek is supported at the start 2. To reset to the start of the file after schem inference The seekable buffer wrapper is only used when necessary There should be no performance penalty for currently supported use cases Use cases: ```sh cat test.csv | csv2parquet /dev/stdin test.parquet zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet ``` Resolves domoritz#3 feat: refactor SeekableReader into arrow-tools lib create Also refactor schema matching to make it less verbose by using map_err instead of match, see json2parquet for before/after
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
See domoritz/csv2parquet#40 by @corneliusroemer
The text was updated successfully, but these errors were encountered: