Skip to content

feat: enable block stream write #18285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

zhyass
Copy link
Member

@zhyass zhyass commented Jul 1, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduces two main features:

  1. Stream Write Functionality: This feature enables the stream write functionality and sets it as the default behavior for writes. Stream writes allow for more efficient and flexible handling of data insertion, especially useful for high-throughput scenarios.

  2. Block-Level Statistics: The PR adds support for block-level statistics, including the integration of HyperLogLog (HLL) for distinct count estimation. This allows for more accurate and efficient data analysis at the block level, which can be later aggregated for higher-level statistics (such as segment or table-level).

  3. New Table Option - approx_distinct_columns: A new table option approx_distinct_columns is introduced. This option allows users to specify which columns should have HyperLogLog (HLL) statistics for approximate distinct count calculation. By default, all eligible columns are considered for HLL statistics, but this option provides users more control over which columns to track.

root@localhost:8000/default/default> create table t1(a int, b string);

root@localhost:8000/default/default> insert into t1 values(1,'a'),(2,'b');

╭─────────────────────────╮
│ number of rows inserted │
│          UInt64         │
├─────────────────────────┤
│                       2 │
╰─────────────────────────╯
2 rows written in 0.075 sec. Processed 2 rows, 36 B (26.67 rows/s, 480 B/s)

root@localhost:8000/default/default> select * from fuse_block('default', 't1');


╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│    snapshot_id    │     timestamp     │   block_location  │ block_size │ file_size │ row_count │ bloom_filter_loca │ bloom_filter_siz │ inverted_index_s │ ngram_index_size │ vector_index_siz │ virtual_column_s │ block_stats_size │
│       String      │     Timestamp     │       String      │   UInt64   │   UInt64  │   UInt64  │        tion       │         e        │        ize       │ Nullable(UInt64) │         e        │        ize       │      UInt64      │
│                   │                   │                   │            │           │           │  Nullable(String) │      UInt64      │ Nullable(UInt64) │                  │ Nullable(UInt64) │ Nullable(UInt64) │                  │
├───────────────────┼───────────────────┼───────────────────┼────────────┼───────────┼───────────┼───────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┤
│ '01982733b8b37008 │ 2025-07-20        │ '1/3399/_b/h01982 │         365272'1/3399/_i_b_v2/0 │              646 │             NULL │             NULL │             NULL │             NULL │               92 │
│ 9f0bf16a62926b13'09:39:17.811000   │ c5a14b37d2fb24eba │            │           │           │ 1982c5a14b37d2fb2 │                  │                  │                  │                  │                  │                  │
│                   │                   │ f7bbbdd88f_v2.par │            │           │           │ 4ebaf7bbbdd88f_v4 │                  │                  │                  │                  │                  │                  │
│                   │                   │ quet'             │            │           │           │ .parquet'         │                  │                  │                  │                  │                  │                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
1 row read in 0.038 sec. Processed 1 row, 253 B (26.32 rows/s, 6.50 KiB/s)

root@localhost:8000/default/default> create table t2(a int, b string) approx_distinct_columns = '';

root@localhost:8000/default/default> insert into t2 values(1,'a'),(2,'b');

╭─────────────────────────╮
│ number of rows inserted │
│          UInt64         │
├─────────────────────────┤
│                       2 │
╰─────────────────────────╯
2 rows written in 0.076 sec. Processed 2 rows, 36 B (26.32 rows/s, 473 B/s)

root@localhost:8000/default/default> select * from fuse_block('default', 't2');

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│    snapshot_id    │     timestamp     │   block_location  │ block_size │ file_size │ row_count │ bloom_filter_loca │ bloom_filter_siz │ inverted_index_s │ ngram_index_size │ vector_index_siz │ virtual_column_s │ block_stats_size │
│       String      │     Timestamp     │       String      │   UInt64   │   UInt64  │   UInt64  │        tion       │         e        │        ize       │ Nullable(UInt64) │         e        │        ize       │      UInt64      │
│                   │                   │                   │            │           │           │  Nullable(String) │      UInt64      │ Nullable(UInt64) │                  │ Nullable(UInt64) │ Nullable(UInt64) │                  │
├───────────────────┼───────────────────┼───────────────────┼────────────┼───────────┼───────────┼───────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┼──────────────────┤
│ '019827349c3a789d │ 2025-07-20        │ '1/3413/_b/h01982 │         365272'1/3413/_i_b_v2/0 │              646 │             NULL │             NULL │             NULL │             NULL │                0 │
│ 913e29320ea41411'09:40:16.058000   │ c5af83a7b45a15dcb │            │           │           │ 1982c5af83a7b45a1 │                  │                  │                  │                  │                  │                  │
│                   │                   │ bc6ceac701_v2.par │            │           │           │ 5dcbbc6ceac701_v4 │                  │                  │                  │                  │                  │                  │
│                   │                   │ quet'             │            │           │           │ .parquet'         │                  │                  │                  │                  │                  │                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
1 row read in 0.039 sec. Processed 1 row, 253 B (25.64 rows/s, 6.33 KiB/s)

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@zhyass zhyass marked this pull request as draft July 1, 2025 10:50
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jul 1, 2025
@zhyass zhyass force-pushed the feat_stream branch 2 times, most recently from f79fe6f to 6fc4e0c Compare July 4, 2025 19:45
@zhyass zhyass added the ci-benchmark Benchmark: run all test label Jul 6, 2025
@zhyass zhyass force-pushed the feat_stream branch 2 times, most recently from 553a5ce to 26e8a74 Compare July 9, 2025 05:53
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Jul 9, 2025
@zhyass zhyass force-pushed the feat_stream branch 2 times, most recently from a9dfc76 to db64b98 Compare July 13, 2025 17:27
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Jul 13, 2025
@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-benchmark Benchmark: run all test labels Jul 14, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jul 14, 2025
@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 15, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18285-52e5e77-1752696188

note: this image tag is only available for internal use.

@databendlabs databendlabs deleted a comment from github-actions bot Jul 17, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jul 17, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jul 17, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-cloud Build docker image for cloud test labels Jul 17, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18285-83f3ff5-1752737265

note: this image tag is only available for internal use.

@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-benchmark Benchmark: run all test labels Jul 18, 2025
@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 18, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jul 18, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jul 18, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Jul 18, 2025
@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 18, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18285-520c95e-1752832956

note: this image tag is only available for internal use.

@zhyass zhyass marked this pull request as ready for review July 20, 2025 09:34
@dantengsky dantengsky added the ci-benchmark Benchmark: run all test label Jul 21, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18285-b7fbc19-1753081489

note: this image tag is only available for internal use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark Benchmark: run all test ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants