diff --git a/README.md b/README.md index ceceb4296f771..237c21aa41bc4 100644 --- a/README.md +++ b/README.md @@ -21,45 +21,91 @@ logo -DataFusion is an extensible query planning, optimization, and execution framework, written in -Rust, that uses [Apache Arrow](https://arrow.apache.org) as its +DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in +[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org) in-memory format. +DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet JSON, and Avro, extensive customization, and a great community. + [![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master) ## Features -- SQL query planner with support for multiple SQL dialects -- DataFrame API -- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom - file formats can be supported by implementing a `TableProvider` trait. -- Supports popular object stores, including AWS S3, Azure Blob - Storage, and Google Cloud Storage. There are extension points for implementing - custom object stores. +- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html) +- Blazingly fast, vectorized, multi-threaded, streaming execution engine. +- Native support for Parquet, CSV, JSON, and Avro file formats. Support + for custom file formats and non file datasources via the `TableProvider` trait. +- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, + other query languages, custom plan and execution nodes, optimizer passes, and more. +- Streaming, asynchronous IO directly from popular object stores, including AWS S3, + Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the + `ObjectStore` trait. +- [Excellent Documentation](https://docs.rs/datafusion/latest) and a + [welcoming community](https://arrow.apache.org/datafusion/community/communication.html). +- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations, + automatic join reordering, expression coercion, and more. +- Permissive Apache 2.0 License, Apache Software Foundation governance +- Written in [Rust](https://www.rust-lang.org/), a modern system language with development + productivity similar to Java or Golang, the performance of C++, and + [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted). ## Use Cases -DataFusion is modular in design with many extension points and can be -used without modification as an embedded query engine and can also provide -a foundation for building new systems. Here are some example use cases: +DataFusion can be used without modification as an embedded SQL +engine or can be customized and used as a foundation for +building new systems. Here are some examples of systems built using DataFusion: + +- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista]. +- New query language engines such as [prql-query] and accelerators such as [VegaFusion] +- Research platform for new Database Systems, such as [Flock] +- SQL support to another library, such as [dask sql] +- Streaming data platforms such as [Synnada] +- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv] +- A faster Spark runtime replacement (blaze-rs) -- DataFusion can be used as a SQL query planner and query optimizer, providing - optimized logical plans that can then be mapped to other execution engines. -- DataFusion is used to create modern, fast and efficient data - pipelines, ETL processes, and database systems, which need the - performance of Rust and Apache Arrow and want to provide their users - the convenience of an SQL interface or a DataFrame API. +By using DataFusion, the projects are freed to focus on their specific +features, and avoid reimplementing general (but still necessary) +features such as an expression representation, standard optimizations, +execution plans, file format support, etc. ## Why DataFusion? -- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance +- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. +## Comparisons with other projects + +Here is a comparison with similar projects that may help understand +when DataFusion might be be suitable and unsuitable for your needs: + +- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database. + Like DataFusion, it supports very fast execution, both from its custom file format + and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it + is primarily used directly by users as a serverless database and query system rather + than as a library for building such database systems. + +- [Polars](http://pola.rs): Polars is one of the fastest DataFrame + libraries at the time of writing. Like DataFusion, it is also + written in Rust and uses the Apache Arrow memory model, but unlike + DataFusion it does not provide SQL nor as many extension points. + +- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) + is an execution engine. Like DataFusion, Velox aims to + provide a reusable foundation for building database-like systems. Unlike DataFusion, + it is written in C/C++ and does not include a SQL frontend or planning /optimization + framework. + +- [DataBend](https://github.com/datafuselabs/databend) is a complete, + database system. Like DataFusion it is also written in Rust and + utilizes the Apache Arrow memory model, but unlike DataFusion it + targets end-users rather than developers of other database systems. + ## DataFusion Community Extensions -There are a number of community projects that extend DataFusion or provide integrations with other systems. +There are a number of community projects that extend DataFusion or +provide integrations with other systems. ### Language Bindings @@ -99,9 +145,29 @@ Here are some of the projects known to use DataFusion: - [Tensorbase](https://github.com/tensorbase/tensorbase) - [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar -(if you know of another project, please submit a PR to add a link!) - -## Example Usage +[ballista]: https://github.com/apache/arrow-ballista +[blaze]: https://github.com/blaze-init/blaze +[ceresdb]: https://github.com/CeresDB/ceresdb +[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust +[cnosdb]: https://github.com/cnosdb/cnosdb +[cube store]: https://github.com/cube-js/cube.js/tree/master/rust +[dask sql]: https://github.com/dask-contrib/dask-sql +[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui +[delta-rs]: https://github.com/delta-io/delta-rs +[flock]: https://github.com/flock-lab/flock +[kamu]: https://github.com/kamu-data/kamu-cli +[greptime db]: https://github.com/GreptimeTeam/greptimedb +[influxdb iox]: https://github.com/influxdata/influxdb_iox +[parseable]: https://github.com/parseablehq/parseable +[prql-query]: https://github.com/prql/prql-query +[qv]: https://github.com/timvw/qv +[roapi]: https://github.com/roapi/roapi +[seafowl]: https://github.com/splitgraph/seafowl +[synnada]: https://synnada.ai/ +[tensorbase]: https://github.com/tensorbase/tensorbase +[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!" + +## Examples Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion. diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md index e16504091571c..64b6be9d28128 100644 --- a/docs/source/user-guide/introduction.md +++ b/docs/source/user-guide/introduction.md @@ -23,10 +23,10 @@ DataFusion is an extensible query execution framework, written in Rust, that uses [Apache Arrow](https://arrow.apache.org) as its in-memory format. -DataFusion supports both an SQL and a DataFrame API for building -logical query plans as well as a query optimizer and execution engine -capable of parallel execution against partitioned data sources (CSV -and Parquet) using threads. +DataFusion supports SQL and a DataFrame API for building logical query +plans, an extensive query optimizer, and a multi-threaded parallel +execution execution engine for processing partitioned data sources +such as CSV and Parquet files extremely quickly. ## Use Cases @@ -37,7 +37,7 @@ the convenience of an SQL interface or a DataFrame API. ## Why DataFusion? -- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance +- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.