Skip to content

Commit b756d05

Browse files
alambandygrovejackwenerviirya
authored
Update main DataFusion README (#4903)
* Update main DataFusion README * todos * add kamu * Apply suggestions from code review Co-authored-by: Andy Grove <[email protected]> Co-authored-by: jakevin <[email protected]> * Add note about databend * Wordsmithing * Update README.md Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: jakevin <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>
1 parent aa8f139 commit b756d05

File tree

2 files changed

+94
-28
lines changed

2 files changed

+94
-28
lines changed

README.md

Lines changed: 89 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -21,45 +21,91 @@
2121

2222
<img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256" alt="logo"/>
2323

24-
DataFusion is an extensible query planning, optimization, and execution framework, written in
25-
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
24+
DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in
25+
[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org)
2626
in-memory format.
2727

28+
DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet JSON, and Avro, extensive customization, and a great community.
29+
2830
[![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master)
2931

3032
## Features
3133

32-
- SQL query planner with support for multiple SQL dialects
33-
- DataFrame API
34-
- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
35-
file formats can be supported by implementing a `TableProvider` trait.
36-
- Supports popular object stores, including AWS S3, Azure Blob
37-
Storage, and Google Cloud Storage. There are extension points for implementing
38-
custom object stores.
34+
- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
35+
- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
36+
- Native support for Parquet, CSV, JSON, and Avro file formats. Support
37+
for custom file formats and non file datasources via the `TableProvider` trait.
38+
- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
39+
other query languages, custom plan and execution nodes, optimizer passes, and more.
40+
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
41+
Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
42+
`ObjectStore` trait.
43+
- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
44+
[welcoming community](https://arrow.apache.org/datafusion/community/communication.html).
45+
- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
46+
automatic join reordering, expression coercion, and more.
47+
- Permissive Apache 2.0 License, Apache Software Foundation governance
48+
- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
49+
productivity similar to Java or Golang, the performance of C++, and
50+
[loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
3951

4052
## Use Cases
4153

42-
DataFusion is modular in design with many extension points and can be
43-
used without modification as an embedded query engine and can also provide
44-
a foundation for building new systems. Here are some example use cases:
54+
DataFusion can be used without modification as an embedded SQL
55+
engine or can be customized and used as a foundation for
56+
building new systems. Here are some examples of systems built using DataFusion:
57+
58+
- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista].
59+
- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
60+
- Research platform for new Database Systems, such as [Flock]
61+
- SQL support to another library, such as [dask sql]
62+
- Streaming data platforms such as [Synnada]
63+
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
64+
- A faster Spark runtime replacement (blaze-rs)
4565

46-
- DataFusion can be used as a SQL query planner and query optimizer, providing
47-
optimized logical plans that can then be mapped to other execution engines.
48-
- DataFusion is used to create modern, fast and efficient data
49-
pipelines, ETL processes, and database systems, which need the
50-
performance of Rust and Apache Arrow and want to provide their users
51-
the convenience of an SQL interface or a DataFrame API.
66+
By using DataFusion, the projects are freed to focus on their specific
67+
features, and avoid reimplementing general (but still necessary)
68+
features such as an expression representation, standard optimizations,
69+
execution plans, file format support, etc.
5270

5371
## Why DataFusion?
5472

55-
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
73+
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
5674
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
5775
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
5876
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
5977

78+
## Comparisons with other projects
79+
80+
Here is a comparison with similar projects that may help understand
81+
when DataFusion might be be suitable and unsuitable for your needs:
82+
83+
- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
84+
Like DataFusion, it supports very fast execution, both from its custom file format
85+
and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
86+
is primarily used directly by users as a serverless database and query system rather
87+
than as a library for building such database systems.
88+
89+
- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
90+
libraries at the time of writing. Like DataFusion, it is also
91+
written in Rust and uses the Apache Arrow memory model, but unlike
92+
DataFusion it does not provide SQL nor as many extension points.
93+
94+
- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)
95+
is an execution engine. Like DataFusion, Velox aims to
96+
provide a reusable foundation for building database-like systems. Unlike DataFusion,
97+
it is written in C/C++ and does not include a SQL frontend or planning /optimization
98+
framework.
99+
100+
- [DataBend](https://github.com/datafuselabs/databend) is a complete,
101+
database system. Like DataFusion it is also written in Rust and
102+
utilizes the Apache Arrow memory model, but unlike DataFusion it
103+
targets end-users rather than developers of other database systems.
104+
60105
## DataFusion Community Extensions
61106

62-
There are a number of community projects that extend DataFusion or provide integrations with other systems.
107+
There are a number of community projects that extend DataFusion or
108+
provide integrations with other systems.
63109

64110
### Language Bindings
65111

@@ -99,9 +145,29 @@ Here are some of the projects known to use DataFusion:
99145
- [Tensorbase](https://github.com/tensorbase/tensorbase)
100146
- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
101147

102-
(if you know of another project, please submit a PR to add a link!)
103-
104-
## Example Usage
148+
[ballista]: https://github.com/apache/arrow-ballista
149+
[blaze]: https://github.com/blaze-init/blaze
150+
[ceresdb]: https://github.com/CeresDB/ceresdb
151+
[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
152+
[cnosdb]: https://github.com/cnosdb/cnosdb
153+
[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
154+
[dask sql]: https://github.com/dask-contrib/dask-sql
155+
[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui
156+
[delta-rs]: https://github.com/delta-io/delta-rs
157+
[flock]: https://github.com/flock-lab/flock
158+
[kamu]: https://github.com/kamu-data/kamu-cli
159+
[greptime db]: https://github.com/GreptimeTeam/greptimedb
160+
[influxdb iox]: https://github.com/influxdata/influxdb_iox
161+
[parseable]: https://github.com/parseablehq/parseable
162+
[prql-query]: https://github.com/prql/prql-query
163+
[qv]: https://github.com/timvw/qv
164+
[roapi]: https://github.com/roapi/roapi
165+
[seafowl]: https://github.com/splitgraph/seafowl
166+
[synnada]: https://synnada.ai/
167+
[tensorbase]: https://github.com/tensorbase/tensorbase
168+
[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!"
169+
170+
## Examples
105171

106172
Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion.
107173

docs/source/user-guide/introduction.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,10 @@ DataFusion is an extensible query execution framework, written in
2323
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
2424
in-memory format.
2525

26-
DataFusion supports both an SQL and a DataFrame API for building
27-
logical query plans as well as a query optimizer and execution engine
28-
capable of parallel execution against partitioned data sources (CSV
29-
and Parquet) using threads.
26+
DataFusion supports SQL and a DataFrame API for building logical query
27+
plans, an extensive query optimizer, and a multi-threaded parallel
28+
execution execution engine for processing partitioned data sources
29+
such as CSV and Parquet files extremely quickly.
3030

3131
## Use Cases
3232

@@ -37,7 +37,7 @@ the convenience of an SQL interface or a DataFrame API.
3737

3838
## Why DataFusion?
3939

40-
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
40+
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
4141
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
4242
- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
4343
- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

0 commit comments

Comments
 (0)