|
1 |
| -# datafusion-orc |
2 |
| -Implementation of ORC file format read/write with Arrow in-memory format |
3 |
| - |
4 | 1 | [](https://github.com/datafusion-contrib/datafusion-orc/actions/workflows/ci.yml)
|
5 | 2 | [](https://codecov.io/gh/WenyXu/orc-rs)
|
6 | 3 | [](https://crates.io/crates/orc-rust)
|
7 | 4 | [](https://crates.io/crates/orc-rust)
|
8 | 5 |
|
9 |
| -Read [Apache ORC](https://orc.apache.org/) in Rust. |
10 |
| - |
11 |
| -* Read ORC files |
12 |
| -* Read stripes (the conversion from proto metadata to memory regions) |
13 |
| -* Decode stripes (the math of decode stripes into e.g. booleans, runs of RLE, etc.) |
14 |
| -* Decode ORC data to [Arrow Datatypes](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html) (Async/Sync) |
15 |
| - |
16 |
| - |
17 |
| -## Current Support |
18 |
| - |
19 |
| -| Column Encoding | Read | Write | Arrow DataType | |
20 |
| -| ------------------------- | ---- | ----- | -------------------------- | |
21 |
| -| SmallInt, Int, BigInt | ✓ | | Int16, Int32, Int64 | |
22 |
| -| Float, Double | ✓ | | Float32, Float64 | |
23 |
| -| String, Char, and VarChar | ✓ | | Utf8 | |
24 |
| -| Boolean | ✓ | | Boolean | |
25 |
| -| TinyInt | ✓ | | Int8 | |
26 |
| -| Binary | ✓ | | Binary | |
27 |
| -| Decimal | ✓ | | Decimal128 | |
28 |
| -| Date | ✓ | | Date32 | |
29 |
| -| Timestamp | ✓ | | Timestamp(Nanosecond,_) | |
30 |
| -| Timestamp instant | ✓ | | Timestamp(Nanosecond, UTC) | |
31 |
| -| Struct | ✓ | | Struct | |
32 |
| -| List | ✓ | | List | |
33 |
| -| Map | ✓ | | Map | |
34 |
| -| Union | ✓ | | Union(_, Sparse) | |
35 |
| - |
36 |
| -## Compression Support |
37 |
| - |
38 |
| -| Compression | Read | Write | |
39 |
| -| ----------- | ---- | ----- | |
40 |
| -| None | ✓ | ✗ | |
41 |
| -| ZLIB | ✓ | ✗ | |
42 |
| -| SNAPPY | ✓ | ✗ | |
43 |
| -| LZO | ✓ | ✗ | |
44 |
| -| LZ4 | ✓ | ✗ | |
45 |
| -| ZSTD | ✓ | ✗ | |
46 |
| - |
47 |
| -## Benchmark |
48 |
| - |
49 |
| -Run `cargo bench` for simple benchmarks. |
| 6 | +# orc-rust |
| 7 | + |
| 8 | +A native Rust implementation of the [Apache ORC](https://orc.apache.org) file format, |
| 9 | +providing API's to read data into [Apache Arrow](https://arrow.apache.org) in-memory arrays. |
| 10 | + |
| 11 | +See the [documentation](https://docs.rs/orc-rust/latest/orc_rust/) for examples on how to use this crate. |
| 12 | + |
| 13 | +## Supported features |
| 14 | + |
| 15 | +This crate currently only supports reading ORC files into Arrow arrays. Write support is planned |
| 16 | +(see [Roadmap](#roadmap)). The below features listed relate only to reading ORC files. |
| 17 | +At this time, we aim to support the [ORCv1](https://orc.apache.org/specification/ORCv1/) specification only. |
| 18 | + |
| 19 | +- Read synchronously & asynchronously (using Tokio) |
| 20 | +- All compression types (Zlib, Snappy, Lzo, Lz4, Zstd) |
| 21 | +- All ORC data types |
| 22 | +- All encodings |
| 23 | +- Rudimentary support for retrieving statistics |
| 24 | +- Retrieving user metadata into Arrow schema metadata |
| 25 | + |
| 26 | +## Roadmap |
| 27 | + |
| 28 | +The long term vision for this crate is to be feature complete enough to be donated to the |
| 29 | +[arrow-rs](https://github.com/apache/arrow-rs) project. |
| 30 | + |
| 31 | +The following lists the rough roadmap for features to be implemented, from highest to lowest priority. |
| 32 | + |
| 33 | +- Performance enhancements |
| 34 | +- DataFusion integration |
| 35 | +- Predicate pushdown |
| 36 | +- Row indices |
| 37 | +- Bloom filters |
| 38 | +- Write from Arrow arrays |
| 39 | +- Encryption |
| 40 | + |
| 41 | +A non-Arrow API interface is not planned at the moment. Feel free to raise an issue if there is such |
| 42 | +a use case. |
| 43 | + |
| 44 | +## Version compatibility |
| 45 | + |
| 46 | +No guarantees are provided about stability across versions. We will endeavour to keep the top level API's |
| 47 | +(`ArrowReader` and `ArrowStreamReader`) as stable as we can, but other API's provided may change as we |
| 48 | +explore the interface we want the library to expose. |
| 49 | + |
| 50 | +Versions will be released on an ad-hoc basis (with no fixed schedule). |
| 51 | + |
| 52 | +## Mapping ORC types to Arrow types |
| 53 | + |
| 54 | +The following table lists how ORC data types are read into Arrow data types: |
| 55 | + |
| 56 | +| ORC Data Type | Arrow Data Type | |
| 57 | +| ----------------- | -------------------------- | |
| 58 | +| Boolean | Boolean | |
| 59 | +| TinyInt | Int8 | |
| 60 | +| SmallInt | Int16 | |
| 61 | +| Int | Int32 | |
| 62 | +| BigInt | Int64 | |
| 63 | +| Float | Float32 | |
| 64 | +| Double | Float64 | |
| 65 | +| String | Utf8 | |
| 66 | +| Char | Utf8 | |
| 67 | +| VarChar | Utf8 | |
| 68 | +| Binary | Binary | |
| 69 | +| Decimal | Decimal128 | |
| 70 | +| Date | Date32 | |
| 71 | +| Timestamp | Timestamp(Nanosecond, None) | |
| 72 | +| Timestamp instant | Timestamp(Nanosecond, UTC) | |
| 73 | +| Struct | Struct | |
| 74 | +| List | List | |
| 75 | +| Map | Map | |
| 76 | +| Union | Union(_, Sparse) | |
| 77 | + |
| 78 | +## Contributing |
| 79 | + |
| 80 | +All contributions are welcome! Feel free to raise an issue if you have a feature request, bug report, |
| 81 | +or a question. Feel free to raise a Pull Request without raising an issue first, as long as the Pull |
| 82 | +Request is descriptive enough. |
| 83 | + |
| 84 | +Some tools we use in addition to the standard `cargo` that require installation are: |
| 85 | + |
| 86 | +- [taplo](https://taplo.tamasfe.dev/) |
| 87 | +- [typos](https://crates.io/crates/typos) |
| 88 | + |
| 89 | +```shell |
| 90 | +cargo install typos-cli |
| 91 | +cargo install taplo-cli |
| 92 | +``` |
| 93 | + |
| 94 | +```shell |
| 95 | +# Building the crate |
| 96 | +cargo build |
| 97 | + |
| 98 | +# Running the test suite |
| 99 | +cargo test |
| 100 | + |
| 101 | +# Simple benchmarks |
| 102 | +cargo bench |
| 103 | + |
| 104 | +# Formatting TOML files |
| 105 | +taplo format |
| 106 | + |
| 107 | +# Detect any typos in the codebase |
| 108 | +typos |
| 109 | +``` |
| 110 | + |
| 111 | +To regenerate/update the [proto.rs](src/proto.rs) file, execute the [regen.sh](regen.sh) script. |
| 112 | + |
| 113 | +```shell |
| 114 | +./regen.sh |
| 115 | +``` |
50 | 116 |
|
0 commit comments