|
21 | 21 |
|
22 | 22 | <img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256" alt="logo"/>
|
23 | 23 |
|
24 |
| -DataFusion is an extensible query planning, optimization, and execution framework, written in |
25 |
| -Rust, that uses [Apache Arrow](https://arrow.apache.org) as its |
| 24 | +DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in |
| 25 | +[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org) |
26 | 26 | in-memory format.
|
27 | 27 |
|
| 28 | +DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet JSON, and Avro, extensive customization, and a great community. |
| 29 | + |
28 | 30 | [](https://codecov.io/gh/apache/arrow-datafusion?branch=master)
|
29 | 31 |
|
30 | 32 | ## Features
|
31 | 33 |
|
32 |
| -- SQL query planner with support for multiple SQL dialects |
33 |
| -- DataFrame API |
34 |
| -- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom |
35 |
| - file formats can be supported by implementing a `TableProvider` trait. |
36 |
| -- Supports popular object stores, including AWS S3, Azure Blob |
37 |
| - Storage, and Google Cloud Storage. There are extension points for implementing |
38 |
| - custom object stores. |
| 34 | +- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html) |
| 35 | +- Blazingly fast, vectorized, multi-threaded, streaming execution engine. |
| 36 | +- Native support for Parquet, CSV, JSON, and Avro file formats. Support |
| 37 | + for custom file formats and non file datasources via the `TableProvider` trait. |
| 38 | +- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, |
| 39 | + other query languages, custom plan and execution nodes, optimizer passes, and more. |
| 40 | +- Streaming, asynchronous IO directly from popular object stores, including AWS S3, |
| 41 | + Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the |
| 42 | + `ObjectStore` trait. |
| 43 | +- [Excellent Documentation](https://docs.rs/datafusion/latest) and a |
| 44 | + [welcoming community](https://arrow.apache.org/datafusion/community/communication.html). |
| 45 | +- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations, |
| 46 | + automatic join reordering, expression coercion, and more. |
| 47 | +- Permissive Apache 2.0 License, Apache Software Foundation governance |
| 48 | +- Written in [Rust](https://www.rust-lang.org/), a modern system language with development |
| 49 | + productivity similar to Java or Golang, the performance of C++, and |
| 50 | + [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted). |
39 | 51 |
|
40 | 52 | ## Use Cases
|
41 | 53 |
|
42 |
| -DataFusion is modular in design with many extension points and can be |
43 |
| -used without modification as an embedded query engine and can also provide |
44 |
| -a foundation for building new systems. Here are some example use cases: |
| 54 | +DataFusion can be used without modification as an embedded SQL |
| 55 | +engine or can be customized and used as a foundation for |
| 56 | +building new systems. Here are some examples of systems built using DataFusion: |
| 57 | + |
| 58 | +- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista]. |
| 59 | +- New query language engines such as [prql-query] and accelerators such as [VegaFusion] |
| 60 | +- Research platform for new Database Systems, such as [Flock] |
| 61 | +- SQL support to another library, such as [dask sql] |
| 62 | +- Streaming data platforms such as [Synnada] |
| 63 | +- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv] |
| 64 | +- A faster Spark runtime replacement (blaze-rs) |
45 | 65 |
|
46 |
| -- DataFusion can be used as a SQL query planner and query optimizer, providing |
47 |
| - optimized logical plans that can then be mapped to other execution engines. |
48 |
| -- DataFusion is used to create modern, fast and efficient data |
49 |
| - pipelines, ETL processes, and database systems, which need the |
50 |
| - performance of Rust and Apache Arrow and want to provide their users |
51 |
| - the convenience of an SQL interface or a DataFrame API. |
| 66 | +By using DataFusion, the projects are freed to focus on their specific |
| 67 | +features, and avoid reimplementing general (but still necessary) |
| 68 | +features such as an expression representation, standard optimizations, |
| 69 | +execution plans, file format support, etc. |
52 | 70 |
|
53 | 71 | ## Why DataFusion?
|
54 | 72 |
|
55 |
| -- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance |
| 73 | +- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. |
56 | 74 | - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
|
57 | 75 | - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
|
58 | 76 | - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
|
59 | 77 |
|
| 78 | +## Comparisons with other projects |
| 79 | + |
| 80 | +Here is a comparison with similar projects that may help understand |
| 81 | +when DataFusion might be be suitable and unsuitable for your needs: |
| 82 | + |
| 83 | +- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database. |
| 84 | + Like DataFusion, it supports very fast execution, both from its custom file format |
| 85 | + and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it |
| 86 | + is primarily used directly by users as a serverless database and query system rather |
| 87 | + than as a library for building such database systems. |
| 88 | + |
| 89 | +- [Polars](http://pola.rs): Polars is one of the fastest DataFrame |
| 90 | + libraries at the time of writing. Like DataFusion, it is also |
| 91 | + written in Rust and uses the Apache Arrow memory model, but unlike |
| 92 | + DataFusion it does not provide SQL nor as many extension points. |
| 93 | + |
| 94 | +- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) |
| 95 | + is an execution engine. Like DataFusion, Velox aims to |
| 96 | + provide a reusable foundation for building database-like systems. Unlike DataFusion, |
| 97 | + it is written in C/C++ and does not include a SQL frontend or planning /optimization |
| 98 | + framework. |
| 99 | + |
| 100 | +- [DataBend](https://github.com/datafuselabs/databend) is a complete, |
| 101 | + database system. Like DataFusion it is also written in Rust and |
| 102 | + utilizes the Apache Arrow memory model, but unlike DataFusion it |
| 103 | + targets end-users rather than developers of other database systems. |
| 104 | + |
60 | 105 | ## DataFusion Community Extensions
|
61 | 106 |
|
62 |
| -There are a number of community projects that extend DataFusion or provide integrations with other systems. |
| 107 | +There are a number of community projects that extend DataFusion or |
| 108 | +provide integrations with other systems. |
63 | 109 |
|
64 | 110 | ### Language Bindings
|
65 | 111 |
|
@@ -99,9 +145,29 @@ Here are some of the projects known to use DataFusion:
|
99 | 145 | - [Tensorbase](https://github.com/tensorbase/tensorbase)
|
100 | 146 | - [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
|
101 | 147 |
|
102 |
| -(if you know of another project, please submit a PR to add a link!) |
103 |
| - |
104 |
| -## Example Usage |
| 148 | +[ballista]: https://github.com/apache/arrow-ballista |
| 149 | +[blaze]: https://github.com/blaze-init/blaze |
| 150 | +[ceresdb]: https://github.com/CeresDB/ceresdb |
| 151 | +[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust |
| 152 | +[cnosdb]: https://github.com/cnosdb/cnosdb |
| 153 | +[cube store]: https://github.com/cube-js/cube.js/tree/master/rust |
| 154 | +[dask sql]: https://github.com/dask-contrib/dask-sql |
| 155 | +[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui |
| 156 | +[delta-rs]: https://github.com/delta-io/delta-rs |
| 157 | +[flock]: https://github.com/flock-lab/flock |
| 158 | +[kamu]: https://github.com/kamu-data/kamu-cli |
| 159 | +[greptime db]: https://github.com/GreptimeTeam/greptimedb |
| 160 | +[influxdb iox]: https://github.com/influxdata/influxdb_iox |
| 161 | +[parseable]: https://github.com/parseablehq/parseable |
| 162 | +[prql-query]: https://github.com/prql/prql-query |
| 163 | +[qv]: https://github.com/timvw/qv |
| 164 | +[roapi]: https://github.com/roapi/roapi |
| 165 | +[seafowl]: https://github.com/splitgraph/seafowl |
| 166 | +[synnada]: https://synnada.ai/ |
| 167 | +[tensorbase]: https://github.com/tensorbase/tensorbase |
| 168 | +[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!" |
| 169 | + |
| 170 | +## Examples |
105 | 171 |
|
106 | 172 | Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion.
|
107 | 173 |
|
|
0 commit comments