Skip to content

[arrow2] Merge arrow2 and datafusion latest #1697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
Jan 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
b42ebe7
Clarify docs about `Accumulator::update` and `Accumulator::update_bat…
alamb Jan 11, 2022
b05feda
Mark ARRAY_AGG(DISTINCT ...) not implemented (#1534)
james727 Jan 11, 2022
06d147a
Add batch operations to stddev (#1547)
realno Jan 11, 2022
e1e7b86
Address clippy warnings (#1553)
sergey-melnychuk Jan 12, 2022
14176ff
Update to arrow-7.0.0 (#1523)
alamb Jan 12, 2022
794b92b
Remove unused `update` and `merge` implementations from Aggregates an…
alamb Jan 13, 2022
cf76969
Make call SchedulerServer::new once in ballista-scheduler process (#1…
Ted-Jiang Jan 13, 2022
b4c77e5
Add covar operators (#1551)
realno Jan 13, 2022
d7e465a
Initial MemoryManager and DiskManager APIs for query execution + Ext…
yjshen Jan 13, 2022
811bb51
Update to rust 1.58 (#1557)
xudong963 Jan 14, 2022
0bddfb7
support cast/try_cast for decimal: signed numeric to decimal (#1442)
liukun4515 Jan 14, 2022
1c39f5c
support comparison for decimal data type and refactor the binary coer…
liukun4515 Jan 14, 2022
b743610
add correlation function (#1561)
realno Jan 16, 2022
1dae7e2
Rename sql integration tests from `mod` to `sql_integration` (#1575)
alamb Jan 16, 2022
bbfc2c0
update reference to python and update readme (#1581)
jimexist Jan 16, 2022
278e859
minor: improve the benchmark readme (#1567)
xudong963 Jan 16, 2022
438b417
Tests for support try_cast/cast decimal to numeric (#1465)
liukun4515 Jan 16, 2022
6f7b2d2
implement Hash for various types and replace PartialOrd (#1580)
jimexist Jan 16, 2022
f027e5f
add from_slice trait to ease arrow2 migration (#1588)
jimexist Jan 17, 2022
92a3e45
Consolidate `batch_size` configuration in `ExecutionConfig`, `Runtime…
yjshen Jan 17, 2022
30df911
support from_slice for binary, string, and boolean array types (#1589)
jimexist Jan 17, 2022
059e52b
update nightly version (#1597)
jimexist Jan 17, 2022
82e8003
remove update and merge (#1582)
jimexist Jan 18, 2022
c549d51
support mathematics operation for decimal data type (#1554)
liukun4515 Jan 18, 2022
fefbfc8
add test for decimal to decimal (#1603)
liukun4515 Jan 18, 2022
8ebc94c
fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail (#1…
xudong963 Jan 18, 2022
444c153
Add support show tables and show columns for ballista (#1593)
EricJoy2048 Jan 18, 2022
ad392fd
Fix comparison of dictionary arrays (#1606)
alamb Jan 19, 2022
345f727
Replace Datafusion Error with Generic Error for Object store (#1541)
matthewmturner Jan 19, 2022
eb51fae
consolidate binary_expr coercion rule code into `binary_rule.rs` modu…
alamb Jan 20, 2022
a96bb5e
Implement ARRAY_AGG(DISTINCT ...) (#1579)
james727 Jan 20, 2022
d93cf79
Add roadmap to readme (#1616)
matthewmturner Jan 20, 2022
2f702e4
fix: sql planner creates cross join instead of inner join from select…
xudong963 Jan 21, 2022
e92225d
feat: Support complex interval via IntervalMonthDayNano (#1615)
ovr Jan 21, 2022
03075d5
Fix null comparison for Parquet pruning predicate (#1595)
viirya Jan 21, 2022
3c5a679
fix dependabot (#1625)
xudong963 Jan 21, 2022
7d819d1
Consolidate sort and external_sort (#1596)
yjshen Jan 21, 2022
62edddb
Optimize `SortPreservingMergeStream` to avoid `SortKeyCursor` sharing…
yjshen Jan 22, 2022
cc8f325
Update pyo3 requirement from 0.14 to 0.15 (#1627)
dependabot[bot] Jan 22, 2022
67a598c
Update etcd-client requirement from 0.7 to 0.8 (#1626)
dependabot[bot] Jan 22, 2022
0762bf0
Update hashbrown requirement from 0.11 to 0.12 (#1631)
dependabot[bot] Jan 22, 2022
af8786e
support hash decimal array and group by (#1640)
liukun4515 Jan 22, 2022
1c63759
Add spill_count and spilled_bytes to baseline metrics, test sort with…
yjshen Jan 22, 2022
15af24a
Add `DataFusionError` -> `ArrowError` conversion (#1643)
alamb Jan 22, 2022
9c5ccae
update md-5, sha2, blake2 (#1647)
xudong963 Jan 23, 2022
71757bb
Introduce push-based task scheduling for Ballista (#1560)
yahoNanJing Jan 23, 2022
4a2453a
fix a cte block with same name for many times (#1639)
xudong963 Jan 23, 2022
deaa8ac
Handle merging of evolved schemas in ParquetExec (#1622)
thinkharderdev Jan 23, 2022
01b5244
refine match pattern related code (#1650)
xudong963 Jan 23, 2022
6ec18bb
Consolidate Schema and RecordBatch projection (#1638)
alamb Jan 23, 2022
741df36
Remove DataFusionError::into_arrow_external_error (#1645)
alamb Jan 24, 2022
c63cfd4
Move AggregatedMetricsSet to metrics for further reuse (#1663)
yjshen Jan 24, 2022
97f95b3
Make `MemoryManager` and `MemoryStream` public (#1664)
yjshen Jan 24, 2022
618c1e8
feat: Support Substring(str [from int] [for int]) (#1621)
ovr Jan 24, 2022
2a9df64
[Ballista] Fix scheduler state mod bug (#1655)
EricJoy2048 Jan 24, 2022
992624a
Fix predicate pushdown for outer joins (#1618)
james727 Jan 24, 2022
271b6ba
feat: Support quarter granularity in date_trunct fn (#1667)
ovr Jan 25, 2022
6c8d642
Update to arrow 8.0.0 (#1673)
alamb Jan 25, 2022
bf68073
[Ballista] Add Decimal128, Date64, TimestampSecond, TimestampMillisec…
EricJoy2048 Jan 25, 2022
ee91c68
upgrade clap to version 3 (#1672)
jimexist Jan 25, 2022
7153fac
Improve configuration and resource use of `MemoryManager` and `DiskMa…
alamb Jan 25, 2022
bffa5e4
Use NamedTempFile rather than `String` in DiskManager (#1680)
alamb Jan 26, 2022
48ad975
Add VegaFusion as project that uses DataFusion (#1683)
jonmmease Jan 26, 2022
54da006
Stop merging avro schemas as it doesn't support list of lists
Igosuki Jan 26, 2022
d297540
Add a new metric type: `Gauge` + `CurrentMemoryUsage` to metrics (#1682)
yjshen Jan 26, 2022
fdbd608
enhance arithmetic operation for array with scalar (#1552)
liukun4515 Jan 26, 2022
bf71577
refactor array_agg to not to have `update` and `merge` (#1681)
jimexist Jan 27, 2022
2266474
Fix bug while merging `RecordBatch`, add `SortPreservingMerge` fuzz t…
alamb Jan 27, 2022
63d24bf
Make `SortPreservingMergeStream` stable on input stream order (#1687)
alamb Jan 27, 2022
97415ca
use the latest arrow2 with Chunk
Igosuki Jan 27, 2022
18918fa
Merge branch 'master' into i_arrow2
Igosuki Jan 27, 2022
a7ec38e
resolve up to last datafusion issue with SortColumn using a reference…
Igosuki Jan 28, 2022
eea061f
Fix other crates and address check warnings
Igosuki Jan 28, 2022
ab48bb2
clippy
Igosuki Jan 28, 2022
98f98d1
test fix #1, errors and debug strings
Igosuki Jan 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ FEDORA=33
PYTHON=3.6
LLVM=11
CLANG_TOOLS=8
RUST=nightly-2021-10-23
RUST=nightly-2022-01-17
GO=1.15
NODE=14
MAVEN=3.5.4
Expand Down
6 changes: 2 additions & 4 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@ updates:
- package-ecosystem: cargo
directory: "/"
schedule:
interval: weekly
day: sunday
time: "7:00"
interval: daily
open-pull-requests-limit: 10
target-branch: master
labels: [auto-dependencies]
labels: [auto-dependencies]
2 changes: 1 addition & 1 deletion .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ jobs:
strategy:
matrix:
arch: [amd64]
rust: [nightly-2021-10-23]
rust: [nightly-2022-01-17]
steps:
- uses: actions/checkout@v2
with:
Expand Down
4 changes: 1 addition & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,11 @@ members = [
"ballista-examples",
]

exclude = ["python"]

[profile.release]
lto = true
codegen-units = 1

[patch.crates-io]
arrow2 = { git = "https://github.com/jorgecarleitao/arrow2.git", rev = "ef7937dfe56033c2cc491482c67587b52cd91554" }
arrow2 = { git = "https://github.com/jorgecarleitao/arrow2.git", branch = "main" }
#arrow2 = { git = "https://github.com/blaze-init/arrow2.git", branch = "shuffle_ipc" }
#parquet2 = { git = "https://github.com/blaze-init/parquet2.git", branch = "meta_new" }
68 changes: 64 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,25 @@ the convenience of an SQL interface or a DataFrame API.

## Known Uses

Projects that adapt to or serve as plugins to DataFusion:

- [datafusion-python](https://github.com/datafusion-contrib/datafusion-python)
- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
- [datafusion-ruby](https://github.com/j-a-m-l/datafusion-ruby)
- [datafusion-objectstore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
- [datafusion-hdfs-native](https://github.com/datafusion-contrib/datafusion-hdfs-native)

Here are some of the projects known to use DataFusion:

- [Ballista](ballista) Distributed Compute Platform
- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
- [datafusion-python](https://pypi.org/project/datafusion)
- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
- [datafusion-ruby](https://github.com/j-a-m-l/datafusion-ruby)
- [delta-rs](https://github.com/delta-io/delta-rs)
- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
- [ROAPI](https://github.com/roapi/roapi)
- [Tensorbase](https://github.com/tensorbase/tensorbase)
- [Squirtle](https://github.com/DSLAM-UMD/Squirtle)
- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar

(if you know of another project, please submit a PR to add a link!)

Expand Down Expand Up @@ -134,6 +140,60 @@ datafusion = "6.0.0"

DataFusion also includes a simple command-line interactive SQL utility. See the [CLI reference](https://arrow.apache.org/datafusion/cli/index.html) for more information.

# Roadmap

A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.

## 2022 Q1

### DataFusion Core

- Publish official Arrow2 branch
- Implementation of memory manager (i.e. to enable spilling to disk as needed)

### Benchmarking

- Inclusion in Db-Benchmark with all quries covered
- All TPCH queries covered

### Performance Improvements

- Predicate evaluation
- Improve multi-column comparisons (that can't be vectorized at the moment)
- Null constant support

### New Features

- Read JSON as table
- Simplify DDL with Datafusion-Cli
- Add Decimal128 data type and the attendant features such as Arrow Kernel and UDF support
- Add new experimental e-graph based optimizer

### Ballista

- Begin work on design documents and plan / priorities for development

### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib]))

- Stable S3 support
- Begin design discussions and prototyping of a stream provider

## Beyond 2022 Q1

There is no clear timeline for the below, but community members have expressed interest in working on these topics.

### DataFusion Core

- Custom SQL support
- Split DataFusion into multiple crates
- Push based query execution and code generation

### Ballista

- Evolve architecture so that it can be deployed in a multi-tenant cloud native environment
- Ensure Ballista is scalable, elastic, and stable for production usage
- Develop distributed ML capabilities

# Status

## General
Expand Down Expand Up @@ -266,7 +326,7 @@ This library currently supports many SQL constructs, including
- `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
- Many mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
- `WHERE` to filter
- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`, `VAR`, `STDDEV` (sample and population)
- `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`, `CORR`, `VAR`, `COVAR`, `STDDEV` (sample and population)
- `ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`

## Supported Functions
Expand Down
2 changes: 1 addition & 1 deletion ballista-examples/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ license = "Apache-2.0"
keywords = [ "arrow", "distributed", "query", "sql" ]
edition = "2021"
publish = false
rust-version = "1.57"
rust-version = "1.58"

[dependencies]
datafusion = { path = "../datafusion" }
Expand Down
4 changes: 3 additions & 1 deletion ballista/rust/client/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ homepage = "https://github.com/apache/arrow-datafusion"
repository = "https://github.com/apache/arrow-datafusion"
authors = ["Apache Arrow <[email protected]>"]
edition = "2021"
rust-version = "1.57"
rust-version = "1.58"

[dependencies]
ballista-core = { path = "../core", version = "0.6.0" }
Expand All @@ -33,6 +33,8 @@ ballista-scheduler = { path = "../scheduler", version = "0.6.0", optional = true
futures = "0.3"
log = "0.4"
tokio = "1.0"
tempfile = "3"
sqlparser = "0.13"

datafusion = { path = "../../../datafusion", version = "6.0.0" }

Expand Down
7 changes: 4 additions & 3 deletions ballista/rust/client/src/columnar_batch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ use datafusion::arrow::{
array::ArrayRef,
compute::aggregate::estimated_bytes_size,
datatypes::{DataType, Schema},
record_batch::RecordBatch,
};
use datafusion::field_util::{FieldExt, SchemaExt};
use datafusion::record_batch::RecordBatch;
use datafusion::scalar::ScalarValue;

pub type MaybeColumnarBatch = Result<Option<ColumnarBatch>>;
Expand All @@ -44,7 +45,7 @@ impl ColumnarBatch {
.enumerate()
.map(|(i, array)| {
(
batch.schema().field(i).name().clone(),
batch.schema().field(i).name().to_string(),
ColumnarValue::Columnar(array.clone()),
)
})
Expand All @@ -61,7 +62,7 @@ impl ColumnarBatch {
.fields()
.iter()
.enumerate()
.map(|(i, f)| (f.name().clone(), values[i].clone()))
.map(|(i, f)| (f.name().to_string(), values[i].clone()))
.collect();

Self {
Expand Down
Loading