From 65a689f7d22ff4e449bfab0b367f666a776ee9c5 Mon Sep 17 00:00:00 2001 From: Matthew Turner Date: Fri, 1 Apr 2022 12:01:49 -0400 Subject: [PATCH 1/4] Update roadmap --- .../source/specification/quarterly_roadmap.md | 74 ++++++++++--------- 1 file changed, 40 insertions(+), 34 deletions(-) diff --git a/docs/source/specification/quarterly_roadmap.md b/docs/source/specification/quarterly_roadmap.md index d193952767af..3c900dead814 100644 --- a/docs/source/specification/quarterly_roadmap.md +++ b/docs/source/specification/quarterly_roadmap.md @@ -21,52 +21,58 @@ A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding. -## 2022 Q1 +## 2022 Q2 ### DataFusion Core -- Publish official Arrow2 branch -- Implementation of memory manager (i.e. to enable spilling to disk as needed) - -### Benchmarking - -- Inclusion in Db-Benchmark with all quries covered -- All TPCH queries covered - -### Performance Improvements - -- Predicate evaluation -- Improve multi-column comparisons (that can't be vectorized at the moment) -- Null constant support - -### New Features - -- Read JSON as table -- Simplify DDL with DataFusion-Cli -- Add Decimal128 data type and the attendant features such as Arrow Kernel and UDF support -- Add new experimental e-graph based optimizer +- IO Improvements + - Reading, registering, and writing more file formats from both DataFrame API and SQL +- Memory Management + - Add more operators for memory limited execution +- Performance + - Incorporate row-format into operators such as aggregate + - Add row-format benchmarks + - Explore LLVM for JIT, with inline Rust functions as the primary goal +- Documentation + - General improvements to DataFusion website + - Publish design documents ### Ballista -- Begin work on design documents and plan / priorities for development +- Make production ready + - Shuffle file cleanup + - Fill functional gaps between DataFusion and Ballista + - Improve task scheduling and data exchange efficiency + - Better error handling + - Task failure + - Executor lost + - Schedule restart + - Improve monitoring and logging + - Auto scaling support +- Support for multi-scheduler deployments. Initially for resiliency and fault tolerance but ultimately to support sharding for scalability and more efficient caching. +- Executor deployment grouping based on resource allocation ### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib])) -- Stable S3 support -- Begin design discussions and prototyping of a stream provider +#### [DataFusion-Python](https://github.com/datafusion-contrib/datafusion-python) -## Beyond 2022 Q1 +- Add missing functionality to DataFrame and SessionContext +- Improve documentation -There is no clear timeline for the below, but community members have expressed interest in working on these topics. +#### [DataFusion-S3](https://github.com/datafusion-contrib/datafusion-objectstore-s3) -### DataFusion Core +- Create Python bindings to use with datafusion-python -- Custom SQL support -- Split DataFusion into multiple crates -- Push based query execution and code generation +#### [DataFusion-Tui](https://github.com/datafusion-contrib/datafusion-tui) -### Ballista +- Create multiple SQL editors +- Expose more Context and query metadata +- Support new data sources + - BigTable, HDFS, HTTP APIs + +#### [DataFusion-BigTable](https://github.com/datafusion-contrib/datafusion-bigtable) -- Evolve architecture so that it can be deployed in a multi-tenant cloud native environment -- Ensure Ballista is scalable, elastic, and stable for production usage -- Develop distributed ML capabilities +- Python binding to use with datafusion-python +- Timestamp range predicate pushdown +- Multi-threaded partition aware execution +- Production ready Rust SDK From ee645f2cf400e47a67b2346d278335c924f4b951 Mon Sep 17 00:00:00 2001 From: Matthew Turner Date: Fri, 1 Apr 2022 12:19:19 -0400 Subject: [PATCH 2/4] IO options comment --- docs/source/specification/quarterly_roadmap.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/specification/quarterly_roadmap.md b/docs/source/specification/quarterly_roadmap.md index 3c900dead814..da23e8d3e808 100644 --- a/docs/source/specification/quarterly_roadmap.md +++ b/docs/source/specification/quarterly_roadmap.md @@ -27,6 +27,7 @@ A quarterly roadmap will be published to give the DataFusion community visibilit - IO Improvements - Reading, registering, and writing more file formats from both DataFrame API and SQL + - Additional options for IO including partitioning and metadata support - Memory Management - Add more operators for memory limited execution - Performance From fe2d2b4a3fdc7e4eff6c8fcf54ac023fc5569a0f Mon Sep 17 00:00:00 2001 From: Matthew Turner Date: Fri, 1 Apr 2022 15:49:39 -0400 Subject: [PATCH 3/4] Add streams --- docs/source/specification/quarterly_roadmap.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/specification/quarterly_roadmap.md b/docs/source/specification/quarterly_roadmap.md index da23e8d3e808..20eaa2e9137c 100644 --- a/docs/source/specification/quarterly_roadmap.md +++ b/docs/source/specification/quarterly_roadmap.md @@ -37,6 +37,8 @@ A quarterly roadmap will be published to give the DataFusion community visibilit - Documentation - General improvements to DataFusion website - Publish design documents +- Streaming + - Create `StreamProvider` trait ### Ballista @@ -77,3 +79,7 @@ A quarterly roadmap will be published to give the DataFusion community visibilit - Timestamp range predicate pushdown - Multi-threaded partition aware execution - Production ready Rust SDK + +#### [DataFusion-Streams](https://github.com/datafusion-contrib/datafusion-streams) + +- Create experimental implementation of `StreamProvider` trait From 489b3da9caa651aaff1f248f954a2680448a31ed Mon Sep 17 00:00:00 2001 From: Matthew Turner Date: Sat, 2 Apr 2022 12:17:46 -0400 Subject: [PATCH 4/4] Update with feedback --- docs/source/specification/quarterly_roadmap.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/source/specification/quarterly_roadmap.md b/docs/source/specification/quarterly_roadmap.md index 20eaa2e9137c..94c7dd9e2c18 100644 --- a/docs/source/specification/quarterly_roadmap.md +++ b/docs/source/specification/quarterly_roadmap.md @@ -28,12 +28,17 @@ A quarterly roadmap will be published to give the DataFusion community visibilit - IO Improvements - Reading, registering, and writing more file formats from both DataFrame API and SQL - Additional options for IO including partitioning and metadata support +- Work Scheduling + - Improve predictability, observability and performance of IO and CPU-bound work + - Develop a more explicit story for managing parallelism during plan execution - Memory Management - Add more operators for memory limited execution - Performance - Incorporate row-format into operators such as aggregate - Add row-format benchmarks + - Explore JIT-compiling complex expressions - Explore LLVM for JIT, with inline Rust functions as the primary goal + - Improve performance of Sort and Merge using Row Format / JIT expressions - Documentation - General improvements to DataFusion website - Publish design documents