Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow reading from stdin (ideally with schema inference) #3

Closed
domoritz opened this issue Feb 2, 2023 · 0 comments · Fixed by #10
Closed

allow reading from stdin (ideally with schema inference) #3

domoritz opened this issue Feb 2, 2023 · 0 comments · Fixed by #10
Labels
enhancement New feature or request

Comments

@domoritz
Copy link
Owner

domoritz commented Feb 2, 2023

See domoritz/csv2parquet#40 by @corneliusroemer

corneliusroemer added a commit to corneliusroemer/arrow-tools that referenced this issue Mar 4, 2023
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3
corneliusroemer added a commit to corneliusroemer/arrow-tools that referenced this issue Mar 4, 2023
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3
corneliusroemer added a commit to corneliusroemer/arrow-tools that referenced this issue Mar 4, 2023
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3
corneliusroemer added a commit to corneliusroemer/arrow-tools that referenced this issue Mar 5, 2023
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3
corneliusroemer added a commit to corneliusroemer/arrow-tools that referenced this issue Mar 5, 2023
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3

feat: refactor SeekableReader into arrow-tools lib create

Also refactor schema matching to make it less verbose by using map_err
instead of match, see json2parquet for before/after
corneliusroemer added a commit to corneliusroemer/arrow-tools that referenced this issue Mar 6, 2023
Piped input does not support Seek out of the box
Seek is required to infer the schema
To work around this, we buffer the input iff input file
does not support seek
Only the number of lines actually used to infer the schema
are buffered to allow reading of files larger than memory
This works, because the arrow crate only seeks twice:
1. To check whether seek is supported at the start
2. To reset to the start of the file after schem inference

The seekable buffer wrapper is only used when necessary

There should be no performance penalty for currently supported
use cases

Use cases:
```sh
cat test.csv | csv2parquet /dev/stdin test.parquet
zstdcat test.csv.zst | csv2parquet /dev/stdin test.parquet
```

Resolves domoritz#3

feat: refactor SeekableReader into arrow-tools lib create

Also refactor schema matching to make it less verbose by using map_err
instead of match, see json2parquet for before/after
@domoritz domoritz added the enhancement New feature or request label Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant