Skip to content

Commit 1ff6b32

Browse files
committed
Update documentation and cleanup root level files
1 parent 95998fa commit 1ff6b32

File tree

8 files changed

+132
-57
lines changed

8 files changed

+132
-57
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ jobs:
8484
- name: Install taplo
8585
run: cargo install taplo-cli --version ^0.8 --locked
8686
- name: Run taplo
87-
run: taplo format --check --option "indent_string= "
87+
run: taplo format --check
8888

8989
fmt:
9090
name: Rustfmt

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/target
1+
**/target
22
Cargo.lock
33

44
venv

Cargo.toml

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,16 @@
1-
[workspace]
2-
members = ["gen"]
3-
41
[package]
5-
name = "datafusion-orc"
6-
version = "0.2.43"
2+
name = "orc-rust"
3+
version = "0.3.0"
74
edition = "2021"
85
homepage = "https://github.com/datafusion-contrib/datafusion-orc"
96
repository = "https://github.com/datafusion-contrib/datafusion-orc"
10-
authors = ["Weny <[email protected]>"]
7+
authors = ["Weny <[email protected]>", "Jeffrey Vo <[email protected]>"]
118
license = "Apache-2.0"
12-
description = "Implementation of ORC file format"
9+
description = "Implementation of Apache ORC file format using Apache Arrow in-memory format"
1310
keywords = ["arrow", "orc", "arrow-rs", "datafusion"]
1411
include = ["src/**/*.rs", "Cargo.toml"]
1512
rust-version = "1.70"
1613

17-
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
18-
1914
[dependencies]
2015
arrow = { version = "50", features = ["prettyprint"] }
2116
bytes = "1.4"

README.md

Lines changed: 110 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,116 @@
1-
# datafusion-orc
2-
Implementation of ORC file format read/write with Arrow in-memory format
3-
41
[![test](https://github.com/datafusion-contrib/datafusion-orc/actions/workflows/ci.yml/badge.svg)](https://github.com/datafusion-contrib/datafusion-orc/actions/workflows/ci.yml)
52
[![codecov](https://codecov.io/gh/WenyXu/orc-rs/branch/main/graph/badge.svg?token=2CSHZX02XM)](https://codecov.io/gh/WenyXu/orc-rs)
63
[![Crates.io](https://img.shields.io/crates/v/orc-rust)](https://crates.io/crates/orc-rust)
74
[![Crates.io](https://img.shields.io/crates/d/orc-rust)](https://crates.io/crates/orc-rust)
85

9-
Read [Apache ORC](https://orc.apache.org/) in Rust.
10-
11-
* Read ORC files
12-
* Read stripes (the conversion from proto metadata to memory regions)
13-
* Decode stripes (the math of decode stripes into e.g. booleans, runs of RLE, etc.)
14-
* Decode ORC data to [Arrow Datatypes](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html) (Async/Sync)
15-
16-
17-
## Current Support
18-
19-
| Column Encoding | Read | Write | Arrow DataType |
20-
| ------------------------- | ---- | ----- | -------------------------- |
21-
| SmallInt, Int, BigInt || | Int16, Int32, Int64 |
22-
| Float, Double || | Float32, Float64 |
23-
| String, Char, and VarChar || | Utf8 |
24-
| Boolean || | Boolean |
25-
| TinyInt || | Int8 |
26-
| Binary || | Binary |
27-
| Decimal || | Decimal128 |
28-
| Date || | Date32 |
29-
| Timestamp || | Timestamp(Nanosecond,_) |
30-
| Timestamp instant || | Timestamp(Nanosecond, UTC) |
31-
| Struct || | Struct |
32-
| List || | List |
33-
| Map || | Map |
34-
| Union || | Union(_, Sparse) |
35-
36-
## Compression Support
37-
38-
| Compression | Read | Write |
39-
| ----------- | ---- | ----- |
40-
| None |||
41-
| ZLIB |||
42-
| SNAPPY |||
43-
| LZO |||
44-
| LZ4 |||
45-
| ZSTD |||
46-
47-
## Benchmark
48-
49-
Run `cargo bench` for simple benchmarks.
6+
# orc-rust
7+
8+
A native Rust implementation of the [Apache ORC](https://orc.apache.org) file format,
9+
providing API's to read data into [Apache Arrow](https://arrow.apache.org) in-memory arrays.
10+
11+
See the [documentation](https://docs.rs/orc-rust/latest/orc_rust/) for examples on how to use this crate.
12+
13+
## Supported features
14+
15+
This crate currently only supports reading ORC files into Arrow arrays. Write support is planned
16+
(see [Roadmap](#roadmap)). The below features listed relate only to reading ORC files.
17+
At this time, we aim to support the [ORCv1](https://orc.apache.org/specification/ORCv1/) specification only.
18+
19+
- Read synchronously & asynchronously (using Tokio)
20+
- All compression types (Zlib, Snappy, Lzo, Lz4, Zstd)
21+
- All ORC data types
22+
- All encodings
23+
- Rudimentary support for retrieving statistics
24+
- Retrieving user metadata into Arrow schema metadata
25+
26+
## Roadmap
27+
28+
The long term vision for this crate is to be feature complete enough to be donated to the
29+
[arrow-rs](https://github.com/apache/arrow-rs) project.
30+
31+
The following lists the rough roadmap for features to be implemented, from highest to lowest priority.
32+
33+
- Performance enhancements
34+
- DataFusion integration
35+
- Predicate pushdown
36+
- Row indices
37+
- Bloom filters
38+
- Write from Arrow arrays
39+
- Encryption
40+
41+
A non-Arrow API interface is not planned at the moment. Feel free to raise an issue if there is such
42+
a use case.
43+
44+
## Version compatibility
45+
46+
No guarantees are provided about stability across versions. We will endeavour to keep the top level API's
47+
(`ArrowReader` and `ArrowStreamReader`) as stable as we can, but other API's provided may change as we
48+
explore the interface we want the library to expose.
49+
50+
Versions will be released on an ad-hoc basis (with no fixed schedule).
51+
52+
## Mapping ORC types to Arrow types
53+
54+
The following table lists how ORC data types are read into Arrow data types:
55+
56+
| ORC Data Type | Arrow Data Type |
57+
| ----------------- | -------------------------- |
58+
| Boolean | Boolean |
59+
| TinyInt | Int8 |
60+
| SmallInt | Int16 |
61+
| Int | Int32 |
62+
| BigInt | Int64 |
63+
| Float | Float32 |
64+
| Double | Float64 |
65+
| String | Utf8 |
66+
| Char | Utf8 |
67+
| VarChar | Utf8 |
68+
| Binary | Binary |
69+
| Decimal | Decimal128 |
70+
| Date | Date32 |
71+
| Timestamp | Timestamp(Nanosecond, None) |
72+
| Timestamp instant | Timestamp(Nanosecond, UTC) |
73+
| Struct | Struct |
74+
| List | List |
75+
| Map | Map |
76+
| Union | Union(_, Sparse) |
77+
78+
## Contributing
79+
80+
All contributions are welcome! Feel free to raise an issue if you have a feature request, bug report,
81+
or a question. Feel free to raise a Pull Request without raising an issue first, as long as the Pull
82+
Request is descriptive enough.
83+
84+
Some tools we use in addition to the standard `cargo` that require installation are:
85+
86+
- [taplo](https://taplo.tamasfe.dev/)
87+
- [typos](https://crates.io/crates/typos)
88+
89+
```shell
90+
cargo install typos-cli
91+
cargo install taplo-cli
92+
```
93+
94+
```shell
95+
# Building the crate
96+
cargo build
97+
98+
# Running the test suite
99+
cargo test
100+
101+
# Simple benchmarks
102+
cargo bench
103+
104+
# Formatting TOML files
105+
taplo format
106+
107+
# Detect any typos in the codebase
108+
typos
109+
```
110+
111+
To regenerate/update the [proto.rs](src/proto.rs) file, execute the [regen.sh](regen.sh) script.
112+
113+
```shell
114+
./regen.sh
115+
```
50116

regen.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,4 @@
1919

2020
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
2121
cd $SCRIPT_DIR && cargo run --manifest-path gen/Cargo.toml
22+
rustfmt src/proto.rs

rustfmt.toml

Lines changed: 0 additions & 1 deletion
This file was deleted.

src/lib.rs

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
//! A native Rust implementation of the [Apache ORC](https://orc.apache.org) file format,
2+
//! providing API's to read data into [Apache Arrow](https://arrow.apache.org) in-memory arrays.
3+
//!
4+
//! # Example usage
5+
//!
6+
//! ```no_run
7+
//! # use std::fs::File;
8+
//! # use datafusion_orc::arrow_reader::{ArrowReader, ArrowReaderBuilder};
9+
//! let file = File::open("/path/to/file.orc").unwrap();
10+
//! let reader = ArrowReaderBuilder::try_new(file).unwrap().build();
11+
//! let record_batches = reader.collect::<Result<Vec<_>, _>>().unwrap();
12+
//! ```
13+
114
pub mod arrow_reader;
215
#[cfg(feature = "async")]
316
pub mod async_arrow_reader;

typos.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
[default.extend-words]
22
ue = "ue"
33
datas = "datas"
4+
45
[files]
56
extend-exclude = [
6-
"corrupted",
7+
"tests/**/data/**",
78
"format/orc_proto.proto",
89
"src/proto.rs"
910
]

0 commit comments

Comments
 (0)