Update main DataFusion README (#4903)

alamb · andygrove · jackwener · web-flow · commit b756d053e11e · 2023-01-17T16:03:52.000-05:00
* Update main DataFusion README

* todos

* add kamu

* Apply suggestions from code review

Co-authored-by: Andy Grove &lt;andygrove73@gmail.com&gt;
Co-authored-by: jakevin &lt;jakevingoo@gmail.com&gt;

* Add note about databend

* Wordsmithing

* Update README.md

Co-authored-by: Liang-Chi Hsieh &lt;viirya@gmail.com&gt;

Co-authored-by: Andy Grove &lt;andygrove73@gmail.com&gt;
Co-authored-by: jakevin &lt;jakevingoo@gmail.com&gt;
Co-authored-by: Liang-Chi Hsieh &lt;viirya@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -21,45 +21,91 @@
 
 <img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256" alt="logo"/>
 
-DataFusion is an extensible query planning, optimization, and execution framework, written in
-Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
+DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in
+[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org)
 in-memory format.
 
+DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet JSON, and Avro, extensive customization, and a great community.
+
 [![Coverage Status](https://codecov.io/gh/apache/arrow-datafusion/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-datafusion?branch=master)
 
 ## Features
 
-- SQL query planner with support for multiple SQL dialects
-- DataFrame API
-- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
-  file formats can be supported by implementing a `TableProvider` trait.
-- Supports popular object stores, including AWS S3, Azure Blob
-  Storage, and Google Cloud Storage. There are extension points for implementing
-  custom object stores.
+- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
+- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
+- Native support for Parquet, CSV, JSON, and Avro file formats. Support
+  for custom file formats and non file datasources via the `TableProvider` trait.
+- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
+  other query languages, custom plan and execution nodes, optimizer passes, and more.
+- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
+  Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
+  `ObjectStore` trait.
+- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
+  [welcoming community](https://arrow.apache.org/datafusion/community/communication.html).
+- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
+  automatic join reordering, expression coercion, and more.
+- Permissive Apache 2.0 License, Apache Software Foundation governance
+- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
+  productivity similar to Java or Golang, the performance of C++, and
+  [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
 
 ## Use Cases
 
-DataFusion is modular in design with many extension points and can be
-used without modification as an embedded query engine and can also provide
-a foundation for building new systems. Here are some example use cases:
+DataFusion can be used without modification as an embedded SQL
+engine or can be customized and used as a foundation for
+building new systems. Here are some examples of systems built using DataFusion:
+
+- Specialized Analytical Database systems such as [CeresDB] and more general spark like system such a [Ballista].
+- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
+- Research platform for new Database Systems, such as [Flock]
+- SQL support to another library, such as [dask sql]
+- Streaming data platforms such as [Synnada]
+- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
+- A faster Spark runtime replacement (blaze-rs)
 
-- DataFusion can be used as a SQL query planner and query optimizer, providing
-  optimized logical plans that can then be mapped to other execution engines.
-- DataFusion is used to create modern, fast and efficient data
-  pipelines, ETL processes, and database systems, which need the
-  performance of Rust and Apache Arrow and want to provide their users
-  the convenience of an SQL interface or a DataFrame API.
+By using DataFusion, the projects are freed to focus on their specific
+features, and avoid reimplementing general (but still necessary)
+features such as an expression representation, standard optimizations,
+execution plans, file format support, etc.
 
 ## Why DataFusion?
 
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
 - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
 - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
 - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
 
+## Comparisons with other projects
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic database.
+  Like DataFusion, it supports very fast execution, both from its custom file format
+  and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
+  is primarily used directly by users as a serverless database and query system rather
+  than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+  libraries at the time of writing. Like DataFusion, it is also
+  written in Rust and uses the Apache Arrow memory model, but unlike
+  DataFusion it does not provide SQL nor as many extension points.
+
+- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)
+  is an execution engine. Like DataFusion, Velox aims to
+  provide a reusable foundation for building database-like systems. Unlike DataFusion,
+  it is written in C/C++ and does not include a SQL frontend or planning /optimization
+  framework.
+
+- [DataBend](https://github.com/datafuselabs/databend) is a complete,
+  database system. Like DataFusion it is also written in Rust and
+  utilizes the Apache Arrow memory model, but unlike DataFusion it
+  targets end-users rather than developers of other database systems.
+
 ## DataFusion Community Extensions
 
-There are a number of community projects that extend DataFusion or provide integrations with other systems.
+There are a number of community projects that extend DataFusion or
+provide integrations with other systems.
 
 ### Language Bindings
 
@@ -99,9 +145,29 @@ Here are some of the projects known to use DataFusion:
 - [Tensorbase](https://github.com/tensorbase/tensorbase)
 - [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
 
-(if you know of another project, please submit a PR to add a link!)
-
-## Example Usage
+[ballista]: https://github.com/apache/arrow-ballista
+[blaze]: https://github.com/blaze-init/blaze
+[ceresdb]: https://github.com/CeresDB/ceresdb
+[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
+[cnosdb]: https://github.com/cnosdb/cnosdb
+[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
+[dask sql]: https://github.com/dask-contrib/dask-sql
+[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui
+[delta-rs]: https://github.com/delta-io/delta-rs
+[flock]: https://github.com/flock-lab/flock
+[kamu]: https://github.com/kamu-data/kamu-cli
+[greptime db]: https://github.com/GreptimeTeam/greptimedb
+[influxdb iox]: https://github.com/influxdata/influxdb_iox
+[parseable]: https://github.com/parseablehq/parseable
+[prql-query]: https://github.com/prql/prql-query
+[qv]: https://github.com/timvw/qv
+[roapi]: https://github.com/roapi/roapi
+[seafowl]: https://github.com/splitgraph/seafowl
+[synnada]: https://synnada.ai/
+[tensorbase]: https://github.com/tensorbase/tensorbase
+[vegafusion]: https://vegafusion.io/ "if you know of another project, please submit a PR to add a link!"
+
+## Examples
 
 Please see the [example usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html) in the user guide and the [datafusion-examples](https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples) crate for more information on how to use DataFusion.
 
diff --git a/docs/source/user-guide/introduction.md b/docs/source/user-guide/introduction.md
@@ -23,10 +23,10 @@ DataFusion is an extensible query execution framework, written in
 Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
 in-memory format.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+DataFusion supports SQL and a DataFrame API for building logical query
+plans, an extensive query optimizer, and a multi-threaded parallel
+execution execution engine for processing partitioned data sources
+such as CSV and Parquet files extremely quickly.
 
 ## Use Cases
 
@@ -37,7 +37,7 @@ the convenience of an SQL interface or a DataFrame API.
 
 ## Why DataFusion?
 
-- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
+- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
 - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
 - _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
 - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.