From 2474e9bfaf8c655f0cf77309e7b058736f7b2234 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Fri, 13 Jan 2023 15:04:47 +0100 Subject: [PATCH 1/7] Update main DataFusion README --- README.md | 140 ++++++++++++++++++------- docs/source/user-guide/introduction.md | 2 +- 2 files changed, 102 insertions(+), 40 deletions(-) diff --git a/README.md b/README.md index ec88e342583e..bdc2065ee878 100644 --- a/README.md +++ b/README.md @@ -21,34 +21,52 @@ logo -DataFusion is an extensible query planning, optimization, and execution framework, written in -Rust, that uses [Apache Arrow](https://arrow.apache.org) as its +DataFusion is very fast, extensible query engine, for building high quality data centric systems in +[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org) in-memory format. +DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built in support for CSV, Parquet Json, and Avro, extensive customization, and a great community. + [![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master) ## Features -- SQL query planner with support for multiple SQL dialects -- DataFrame API -- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom - file formats can be supported by implementing a `TableProvider` trait. -- Supports popular object stores, including AWS S3, Azure Blob - Storage, and Google Cloud Storage. There are extension points for implementing - custom object stores. +- Feature rich [SQL support](TODO LINK) and [DataFrame API](TODO LINK) +- Blazingly fast, vectorized, multi-threaded, streaming execution engine. +- Native support for Parquet, CSV, JSON, and Avro file formats. Support + for custom file formats and non file datasources via the `TableProvider` trait. +- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, + other query languages, custom plan and execution nodes, optimizer passes, and more. +- Streaming, asynchronous IO directly from popular object stores, including AWS S3, + Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the + `ObjectStore` trait. +- [Excellent Documentation](https://docs.rs/datafusion/latest) and a + [welcoming community](https://arrow.apache.org/datafusion/community/communication.html). +- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations, + automatic join reordering, expression coercion, and more. +- Permissive Apache 2.0 License, Apache Software Foundation governance +- Written in [Rust](https://www.rust-lang.org/), a modern system language with development + producticity similar to Java or golang, the performance of C++, and + [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted). ## Use Cases -DataFusion is modular in design with many extension points and can be -used without modification as an embedded query engine and can also provide -a foundation for building new systems. Here are some example use cases: +DataFusion can be used without modification as an embedded SQL +engine or can be customized and used as a foundation for +building new systems. Here are some examples of systems built using DataFusion: + +- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista]. +- New query language engines e.g. [PRQL](TODO GET LINK ) and accelerators such as [VegaFusion] +- Research platform Database Systems, e.g. [Flock] for testing new ideas +- SQL support to another library (e.g. [dask sql]); +- Streaming data platforms such as [Synnada] +- Toos for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv] +- A faster Spark runtime replacement (blaze-rs) -- DataFusion can be used as a SQL query planner and query optimizer, providing - optimized logical plans that can then be mapped to other execution engines. -- DataFusion is used to create modern, fast and efficient data - pipelines, ETL processes, and database systems, which need the - performance of Rust and Apache Arrow and want to provide their users - the convenience of an SQL interface or a DataFrame API. +By using DataFusion, the projects are freed to focus on their specific +features, and avoid reimplementing general (but still necessary) +features such as an expression representation, standard optimizations, +execution plans, file format support, etc. ## Why DataFusion? @@ -57,9 +75,30 @@ a foundation for building new systems. Here are some example use cases: - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. +## Comparisons with other projects + +Here is a comparison with similar projects that may help understand +when DataFusion might be be suitable and unsuitable for your needs: + +- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database. + Like DataFusion, it supports very fast execution, both from its custom file format + and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it + is primarily used directly by users as a serverless database and query system rather + than as a library for building such database systems. + +- [pola.rs](http://pola.rs): Polars is one of the fastest DataFrame libraries at the time + of writing. Like DataFusion, it is also written in Rust but unlike DataFusion + it does not provide SQL nor many extension points. + +- [Facebook Velox](TODO LINK) is an execution engine. Like DataFusion, Velox aims to + provide a reusable foundation for building database-like systems. Unlike DataFusion, + it is written in C/C++ and does not include a SQL frontend or planning /optimization + framework. + ## DataFusion Community Extensions -There are a number of community projects that extend DataFusion or provide integrations with other systems. +There are a number of community projects that extend DataFusion or +provide integrations with other systems. ### Language Bindings @@ -78,29 +117,52 @@ There are a number of community projects that extend DataFusion or provide integ Here are some of the projects known to use DataFusion: -- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine -- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core -- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database -- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust) -- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database -- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust) -- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python -- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion -- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake -- [Flock](https://github.com/flock-lab/flock) -- [Greptime DB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database -- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database -- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform -- [qv](https://github.com/timvw/qv) Quickly view your data -- [ROAPI](https://github.com/roapi/roapi) -- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database -- [Synnada](https://synnada.ai/) Streaming-first framework for data products -- [Tensorbase](https://github.com/tensorbase/tensorbase) -- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar +- [Ballista] Distributed SQL Query Engine +- [Blaze] Spark accelerator with DataFusion at its core +- [CeresDB] Distributed Time-Series Database +- [Cloudfuse Buzz] +- [CnosDB] Open Source Distributed Time Series Database +- [Cube Store] +- [Dask SQL] Distributed SQL query engine in Python +- [datafusion-tui] Text UI for DataFusion +- [delta-rs] Native Rust implementation of Delta Lake +- [Flock] Cloud database research system +- [Greptime DB] Open Source & Cloud Native Distributed Time Series Database +- [InfluxDB IOx] Time Series Database +- [Parseable] Log storage and observability platform +- [qv] Quickly view your data +- [PRQL] TODOO +- [ROAPI] +- [Seafowl] CDN-friendly analytical database +- [Synnada] Streaming-first framework for data products +- [Tensorbase] +- [VegaFusion] Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar + +[ballista]: https://github.com/apache/arrow-ballista +[blaze]: https://github.com/blaze-init/blaze +[ceresdb]: https://github.com/CeresDB/ceresdb +[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust +[cnosdb]: https://github.com/cnosdb/cnosdb +[cube store]: https://github.com/cube-js/cube.js/tree/master/rust +[dask sql]: https://github.com/dask-contrib/dask-sql +[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui +[delta-rs]: https://github.com/delta-io/delta-rs +[flock]: https://github.com/flock-lab/flock +[greptime db]: https://github.com/GreptimeTeam/greptimedb +[influxdb iox]: https://github.com/influxdata/influxdb_iox +[parseable]: https://github.com/parseablehq/parseable +[qv]: https://github.com/timvw/qv +[roapi]: https://github.com/roapi/roapi +[seafowl]: https://github.com/splitgraph/seafowl +[synnada]: https://synnada.ai/ +[tensorbase]: https://github.com/tensorbase/tensorbase +[vegafusion]: https://vegafusion.io/ + +[PRQL]: TODO GET LINK (if you know of another project, please submit a PR to add a link!) -## Example Usage +## Examples Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion. diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md index e16504091571..2de37d2ffc20 100644 --- a/docs/source/user-guide/introduction.md +++ b/docs/source/user-guide/introduction.md @@ -26,7 +26,7 @@ in-memory format. DataFusion supports both an SQL and a DataFrame API for building logical query plans as well as a query optimizer and execution engine capable of parallel execution against partitioned data sources (CSV -and Parquet) using threads. +and Parquet) using ## Use Cases From af1e2236f8e22b5168e7a1b1de128001370bfe16 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Fri, 13 Jan 2023 16:26:03 -0500 Subject: [PATCH 2/7] todos --- README.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index bdc2065ee878..1a396c0cc1f0 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm ## Features -- Feature rich [SQL support](TODO LINK) and [DataFrame API](TODO LINK) +- Feature rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html) - Blazingly fast, vectorized, multi-threaded, streaming execution engine. - Native support for Parquet, CSV, JSON, and Avro file formats. Support for custom file formats and non file datasources via the `TableProvider` trait. @@ -56,11 +56,11 @@ engine or can be customized and used as a foundation for building new systems. Here are some examples of systems built using DataFusion: - Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista]. -- New query language engines e.g. [PRQL](TODO GET LINK ) and accelerators such as [VegaFusion] -- Research platform Database Systems, e.g. [Flock] for testing new ideas -- SQL support to another library (e.g. [dask sql]); +- New query language engines such as [prql-query] and accelerators such as [VegaFusion] +- Research platform for new Database Systems, such as [Flock] +- SQL support to another library, such as [dask sql] - Streaming data platforms such as [Synnada] -- Toos for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv] +- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv] - A faster Spark runtime replacement (blaze-rs) By using DataFusion, the projects are freed to focus on their specific @@ -90,7 +90,8 @@ when DataFusion might be be suitable and unsuitable for your needs: of writing. Like DataFusion, it is also written in Rust but unlike DataFusion it does not provide SQL nor many extension points. -- [Facebook Velox](TODO LINK) is an execution engine. Like DataFusion, Velox aims to +- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) + is an execution engine. Like DataFusion, Velox aims to provide a reusable foundation for building database-like systems. Unlike DataFusion, it is written in C/C++ and does not include a SQL frontend or planning /optimization framework. @@ -131,8 +132,8 @@ Here are some of the projects known to use DataFusion: - [InfluxDB IOx] Time Series Database - [Parseable] Log storage and observability platform - [qv] Quickly view your data -- [PRQL] TODOO -- [ROAPI] +- [prql-query]: Query and transform data with PRQL +- [ROAPI]: Automatic read-only APIs for static datasets - [Seafowl] CDN-friendly analytical database - [Synnada] Streaming-first framework for data products - [Tensorbase] @@ -151,16 +152,13 @@ Here are some of the projects known to use DataFusion: [greptime db]: https://github.com/GreptimeTeam/greptimedb [influxdb iox]: https://github.com/influxdata/influxdb_iox [parseable]: https://github.com/parseablehq/parseable +[prql-query]: https://github.com/prql/prql-query [qv]: https://github.com/timvw/qv [roapi]: https://github.com/roapi/roapi [seafowl]: https://github.com/splitgraph/seafowl [synnada]: https://synnada.ai/ [tensorbase]: https://github.com/tensorbase/tensorbase -[vegafusion]: https://vegafusion.io/ - -[PRQL]: TODO GET LINK - -(if you know of another project, please submit a PR to add a link!) +[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!" ## Examples From 515ff86aad18a03af851865633b99c990717f31c Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Fri, 13 Jan 2023 16:27:47 -0500 Subject: [PATCH 3/7] add kamu --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 1a396c0cc1f0..e64b2562f609 100644 --- a/README.md +++ b/README.md @@ -128,6 +128,7 @@ Here are some of the projects known to use DataFusion: - [datafusion-tui] Text UI for DataFusion - [delta-rs] Native Rust implementation of Delta Lake - [Flock] Cloud database research system +- [Kamu] Planet-scale streaming data pipeline - [Greptime DB] Open Source & Cloud Native Distributed Time Series Database - [InfluxDB IOx] Time Series Database - [Parseable] Log storage and observability platform @@ -149,6 +150,7 @@ Here are some of the projects known to use DataFusion: [datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui [delta-rs]: https://github.com/delta-io/delta-rs [flock]: https://github.com/flock-lab/flock +[kamu]: https://github.com/kamu-data/kamu-cli [greptime db]: https://github.com/GreptimeTeam/greptimedb [influxdb iox]: https://github.com/influxdata/influxdb_iox [parseable]: https://github.com/parseablehq/parseable From 07fd7838d63e554a6d8f7ded00089692a643b2bd Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 15 Jan 2023 06:08:03 -0500 Subject: [PATCH 4/7] Apply suggestions from code review Co-authored-by: Andy Grove Co-authored-by: jakevin --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index e64b2562f609..ee781d84cb92 100644 --- a/README.md +++ b/README.md @@ -21,17 +21,17 @@ logo -DataFusion is very fast, extensible query engine, for building high quality data centric systems in +DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in [Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org) in-memory format. -DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built in support for CSV, Parquet Json, and Avro, extensive customization, and a great community. +DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet JSON, and Avro, extensive customization, and a great community. [![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master) ## Features -- Feature rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html) +- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html) - Blazingly fast, vectorized, multi-threaded, streaming execution engine. - Native support for Parquet, CSV, JSON, and Avro file formats. Support for custom file formats and non file datasources via the `TableProvider` trait. @@ -46,7 +46,7 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm automatic join reordering, expression coercion, and more. - Permissive Apache 2.0 License, Apache Software Foundation governance - Written in [Rust](https://www.rust-lang.org/), a modern system language with development - producticity similar to Java or golang, the performance of C++, and + productivity similar to Java or Golang, the performance of C++, and [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted). ## Use Cases From 020fe58a0214430acbdd300adc4dea10b2e423a8 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 15 Jan 2023 06:12:26 -0500 Subject: [PATCH 5/7] Add note about databend --- README.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index af1fbf542a4b..a1630f7c64af 100644 --- a/README.md +++ b/README.md @@ -86,9 +86,10 @@ when DataFusion might be be suitable and unsuitable for your needs: is primarily used directly by users as a serverless database and query system rather than as a library for building such database systems. -- [pola.rs](http://pola.rs): Polars is one of the fastest DataFrame libraries at the time - of writing. Like DataFusion, it is also written in Rust but unlike DataFusion - it does not provide SQL nor many extension points. +- [pola.rs](http://pola.rs): Polars is one of the fastest DataFrame + libraries at the time of writing. Like DataFusion, it is also + written in Rust and uses the Apache Arrow memory model, but unlike + DataFusion it does not provide SQL nor as many extension points. - [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) is an execution engine. Like DataFusion, Velox aims to @@ -96,6 +97,11 @@ when DataFusion might be be suitable and unsuitable for your needs: it is written in C/C++ and does not include a SQL frontend or planning /optimization framework. +- [DataBend](https://github.com/datafuselabs/databend) is a complete, + database system. Like DataFusion it is also written in Rust and + utilizes the Apache Arrow memory model, but unlike DataFusion it + targets end-users rather than developers of other database systems. + ## DataFusion Community Extensions There are a number of community projects that extend DataFusion or From ead2c7b9f6d3a90921980b248f76ddfd6d071068 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 15 Jan 2023 06:16:12 -0500 Subject: [PATCH 6/7] Wordsmithing --- README.md | 2 +- docs/source/user-guide/introduction.md | 10 +++++----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a1630f7c64af..e981b7b14db8 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ execution plans, file format support, etc. ## Why DataFusion? -- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance +- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md index 2de37d2ffc20..64b6be9d2812 100644 --- a/docs/source/user-guide/introduction.md +++ b/docs/source/user-guide/introduction.md @@ -23,10 +23,10 @@ DataFusion is an extensible query execution framework, written in Rust, that uses [Apache Arrow](https://arrow.apache.org) as its in-memory format. -DataFusion supports both an SQL and a DataFrame API for building -logical query plans as well as a query optimizer and execution engine -capable of parallel execution against partitioned data sources (CSV -and Parquet) using +DataFusion supports SQL and a DataFrame API for building logical query +plans, an extensive query optimizer, and a multi-threaded parallel +execution execution engine for processing partitioned data sources +such as CSV and Parquet files extremely quickly. ## Use Cases @@ -37,7 +37,7 @@ the convenience of an SQL interface or a DataFrame API. ## Why DataFusion? -- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance +- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast. - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems. From 754810aaa182226a4ae6c1ecf16ac6d272f63652 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 16 Jan 2023 20:06:30 -0500 Subject: [PATCH 7/7] Update README.md Co-authored-by: Liang-Chi Hsieh --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e981b7b14db8..237c21aa41bc 100644 --- a/README.md +++ b/README.md @@ -86,7 +86,7 @@ when DataFusion might be be suitable and unsuitable for your needs: is primarily used directly by users as a serverless database and query system rather than as a library for building such database systems. -- [pola.rs](http://pola.rs): Polars is one of the fastest DataFrame +- [Polars](http://pola.rs): Polars is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide SQL nor as many extension points.