From fba115ef1899d6a64adc157ada4f50fd427e117f Mon Sep 17 00:00:00 2001 From: discord9 Date: Wed, 18 Jun 2025 13:09:40 +0800 Subject: [PATCH 01/13] docs: hll&udd Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 173 ++++++++++++++++++++ docs/reference/sql/functions/overview.md | 4 + 2 files changed, 177 insertions(+) create mode 100644 docs/reference/sql/functions/approximate.md diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md new file mode 100644 index 000000000..6fd54d091 --- /dev/null +++ b/docs/reference/sql/functions/approximate.md @@ -0,0 +1,173 @@ +--- +keywords: [Approximate functions, approximate count distinct, approximate quantile, SQL functions] +description: Lists and describes approximate functions available in GreptimeDB, including their usage and examples. +--- + +# Approximate Functions + +This page lists two approximate functions in GreptimeDB, `hll` and `uddsketch`, which are used for approximate data analysis. + +:::warning +The following approximate functions is currently experimental and may change in future releases. +::: + +## Approximate Count Distinct (HLL) + +The `hll` function is used to calculate the approximate count distinct of a set of values. It uses [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) algorithm for efficient memory usage and speed. Three functions are provided for this purpose: +- `hll(value)` to create a HyperLogLog state in binary from a given column. +- `hll_merge(hll_state)` to merge multiple HyperLogLog states into one. +- `hll_count(hll_state)` to get the approximate count distinct from a HyperLogLog state. + +Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct, and the larger the dataset, the more accurate the results will be. The relative error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative error of about 0.008125(or 0.8125%). + +### Usage Example +This example demonstrates how to use the `hll` functions to calculate the approximate count distinct of user visits. + +```sql +CREATE TABLE access_log ( + `url` STRING, + user_id BIGINT, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (`url`, `user_id`) +); + +CREATE TABLE access_log_10s ( + `url` STRING, + time_window timestamp time INDEX, + state BINARY, + PRIMARY KEY (`url`) +); + +-- Insert some sample data into access_log +INSERT INTO access_log VALUES + ("/dashboard", 1, "2025-03-04 00:00:00"), + ("/dashboard", 1, "2025-03-04 00:00:01"), + ("/dashboard", 2, "2025-03-04 00:00:05"), + ("/not_found", 3, "2025-03-04 00:00:11"), + ("/dashboard", 4, "2025-03-04 00:00:15"); + +-- Use a 10-second windowed query to calculate the HyperLogLog states +INSERT INTO + access_log_10s +SELECT + `url`, + date_bin("10s" :: INTERVAL, ts) AS time_window, + hll(`user_id`) AS state +FROM + access_log +GROUP BY + `url`, + time_window; + +-- use hll_count to query approximate data in access_log_10s +SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; + +-- results as follows: +-- +------------+---------------------+ +-- | url | time_window | +-- +------------+---------------------+ +-- | /dashboard | 2025-03-04 00:00:00 | +-- | /dashboard | 2025-03-04 00:00:10 | +-- | /not_found | 2025-03-04 00:00:10 | +-- +------------+---------------------+ + +-- in addition, we can aggregate the 10-second data to a 1-minute level by merging the HyperLogLog states using `hll_merge`. +SELECT + `url`, + date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, + hll_count(hll_merge(state)) as uv_per_min +FROM + access_log_10s +GROUP BY + `url`, + date_bin('1 minute' :: INTERVAL, `time_window`); + +-- results as follows: +-- +------------+---------------------+------------+ +-- | url | time_window_1m | uv_per_min | +-- +------------+---------------------+------------+ +-- | /dashboard | 2025-03-04 00:00:00 | 3 | +-- | /not_found | 2025-03-04 00:00:00 | 1 | +-- +------------+---------------------+------------+ +``` + +## Approximate Quantile (UDDSketch) + +Three functions are provided for approximate quantile calculation using the [UDDSketch](https://arxiv.org/abs/2004.08604) algorithm: +- `uddsketch(bucket_num, error_rate, value)` to create a UDDSketch state in binary from a given column, the `bucket_num` is the number of buckets to use for the sketch, and `error_rate` is the desired error rate for the quantile calculation. +- `uddsketch_merge(bucket_num, error_rate, uddsketch_state)` to merge multiple UDDSketch states into one, where `bucket_num` and `error_rate` must match the original sketch where the state was created. +- `uddsketch_calc(quantile, uddsketch_state)` to get the approximate quantile from a UDDSketch state. The `quantile` is a value between 0 and 1, representing the desired quantile to calculate. + +Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. + +### Usage Example +This example demonstrates how to use the `uddsketch` functions to calculate the approximate quantile of a set of values. + +```sql +CREATE TABLE percentile_base ( + `id` INT PRIMARY KEY, + `value` DOUBLE, + `ts` timestamp(0) time index +); + +CREATE TABLE percentile_5s ( + `percentile_state` BINARY, + `time_window` timestamp(0) time index +); + +-- Insert some sample data into percentile_base +INSERT INTO percentile_base (`id`, `value`, `ts`) VALUES + (1, 10.0, 1), + (2, 20.0, 2), + (3, 30.0, 3), + (4, 40.0, 4), + (5, 50.0, 5), + (6, 60.0, 6), + (7, 70.0, 7), + (8, 80.0, 8), + (9, 90.0, 9), + (10, 100.0, 10); + +-- Use a 5-second windowed query to calculate the UDDSketch states +INSERT INTO + percentile_5s +SELECT + uddsketch_state(128, 0.01, `value`) AS percentile_state, + date_bin('5 seconds' :: INTERVAL, `ts`) AS time_window +FROM + percentile_base +GROUP BY + time_window; + +-- query percentile_5s to get the approximate 99th percentile +SELECT + time_window, + uddsketch_calc(0.99, `percentile_state`) AS p99 +FROM + percentile_5s; + +-- results as follows: +-- +---------------------+--------------------+ +-- | time_window | p99 | +-- +---------------------+--------------------+ +-- | 1970-01-01 00:00:00 | 40.04777053326359 | +-- | 1970-01-01 00:00:05 | 89.13032933635911 | +-- | 1970-01-01 00:00:10 | 100.49456770856492 | +-- +---------------------+--------------------+ + +-- in addition, we can aggregate the 5-second data to a 1-minute level by merging the UDDSketch states using `uddsketch_merge`. +SELECT + date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, + uddsketch_calc(0.99, uddsketch_merge(128, 0.01, `percentile_state`)) AS p99 +FROM + percentile_5s +GROUP BY + time_window_1m; + +-- results as follows: +-- +---------------------+--------------------+ +-- | time_window_1m | p99 | +-- +---------------------+--------------------+ +-- | 1970-01-01 00:00:00 | 100.49456770856492 | +-- +---------------------+--------------------+ +``` \ No newline at end of file diff --git a/docs/reference/sql/functions/overview.md b/docs/reference/sql/functions/overview.md index 5d7b595ad..46d1300ec 100644 --- a/docs/reference/sql/functions/overview.md +++ b/docs/reference/sql/functions/overview.md @@ -261,3 +261,7 @@ about these functions](./geo.md) ## Vector Functions GreptimeDB supports vector functions for vector operations, such as distance calculation, similarity measurement, etc. [Learn more about these functions](./vector.md) + +## Approximate Functions + +GreptimeDB supports some approximate functions for data analysis, such as approximate count distinct(hll), approximate quantile(uddsketch), etc. [Learn more about these functions](./approximate.md) From 3fb42715a97944e528854a6fcd1ca244c85676af Mon Sep 17 00:00:00 2001 From: discord9 Date: Wed, 18 Jun 2025 15:30:48 +0800 Subject: [PATCH 02/13] refactor: split chapters Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 49 +++++++++++++++++---- 1 file changed, 41 insertions(+), 8 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 6fd54d091..1ee719c01 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -13,10 +13,21 @@ The following approximate functions is currently experimental and may change in ## Approximate Count Distinct (HLL) -The `hll` function is used to calculate the approximate count distinct of a set of values. It uses [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) algorithm for efficient memory usage and speed. Three functions are provided for this purpose: -- `hll(value)` to create a HyperLogLog state in binary from a given column. -- `hll_merge(hll_state)` to merge multiple HyperLogLog states into one. -- `hll_count(hll_state)` to get the approximate count distinct from a HyperLogLog state. +The `hll` function is used to calculate the approximate count distinct of a set of values. It uses [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) algorithm for efficient memory usage and speed. Three functions are provided for this purpose, described in following chapters: + +### `hll` + +`hll(value)` creates a HyperLogLog state in binary from a given column. The `value` can be any column that you want to calculate the approximate count distinct for. It returns a binary representation of the HLL state, which can be stored in a table or used in further calculations. + +### `hll_merge` + +`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by `hll`. The merged state can then be used to calculate the approximate count distinct across all the merged states. + +### `hll_count` + +`hll_count(hll_state)` retrieves the approximate count distinct from a HyperLogLog state. This function takes the HLL state created by `hll` or merged by `hll_merge` and returns the approximate count of distinct values. + +### Caveats Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct, and the larger the dataset, the more accurate the results will be. The relative error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative error of about 0.008125(or 0.8125%). @@ -93,10 +104,32 @@ GROUP BY ## Approximate Quantile (UDDSketch) -Three functions are provided for approximate quantile calculation using the [UDDSketch](https://arxiv.org/abs/2004.08604) algorithm: -- `uddsketch(bucket_num, error_rate, value)` to create a UDDSketch state in binary from a given column, the `bucket_num` is the number of buckets to use for the sketch, and `error_rate` is the desired error rate for the quantile calculation. -- `uddsketch_merge(bucket_num, error_rate, uddsketch_state)` to merge multiple UDDSketch states into one, where `bucket_num` and `error_rate` must match the original sketch where the state was created. -- `uddsketch_calc(quantile, uddsketch_state)` to get the approximate quantile from a UDDSketch state. The `quantile` is a value between 0 and 1, representing the desired quantile to calculate. +Three functions are provided for approximate quantile calculation using the [UDDSketch](https://arxiv.org/abs/2004.08604) algorithm. + +### `uddsketch_state` + +The `uddsketch_state` function is used to create a UDDSketch state in binary from a given column. It takes three parameters: +- `bucket_num`, which is the number of buckets to use for the sketch, +- `error_rate`, which is the desired error rate for the quantile calculation. +- `value` parameter is the column from which the sketch will be created. + +### `uddsketch_merge` + +The `uddsketch_merge` function is used to merge multiple UDDSketch states into one. It takes three parameters: +- `bucket_num`, which is the number of buckets to use for the sketch, +- `error_rate`, which is the desired error rate for the quantile calculation. +- `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state`. + +This is useful when you want to combine results from different time windows or sources. Notice that the `bucket_num` and `error_rate` must match the original sketch where the state was created, or else the merge will fail. + + +### `uddsketch_calc` + +The `uddsketch_calc` function is used to calculate the approximate quantile from a UDDSketch state. It takes two parameters: +- `quantile`, which is a value between 0 and 1 representing the desired quantile to calculate, i.e., 0.99 for the 99th percentile. +- `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state` or merged by `uddsketch_merge`. + +### Caveats Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. From c9e1d59c25ee9a11ccd0ec334769dcdd4b6dc8ba Mon Sep 17 00:00:00 2001 From: discord9 Date: Wed, 18 Jun 2025 16:14:30 +0800 Subject: [PATCH 03/13] more examples Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 192 +++++++++++++++++++- 1 file changed, 182 insertions(+), 10 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 1ee719c01..a3a4011a6 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -19,14 +19,86 @@ The `hll` function is used to calculate the approximate count distinct of a set `hll(value)` creates a HyperLogLog state in binary from a given column. The `value` can be any column that you want to calculate the approximate count distinct for. It returns a binary representation of the HLL state, which can be stored in a table or used in further calculations. +For example, for a simple table `access_log` shown below, we can create a `hll` state for the `user_id` column. The output will be a binary representation of the HLL state, which contains the necessary information to calculate approximate count distinct later. + +```sql +CREATE TABLE access_log ( + `url` STRING, + user_id BIGINT, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (`url`, `user_id`) +); + + +-- Insert some sample data into access_log +INSERT INTO access_log VALUES + ("/dashboard", 1, "2025-03-04 00:00:00"), + ("/dashboard", 1, "2025-03-04 00:00:01"), + ("/dashboard", 2, "2025-03-04 00:00:05"), + ("/not_found", 3, "2025-03-04 00:00:11"), + ("/dashboard", 4, "2025-03-04 00:00:15"); + +-- Use a 10-second windowed query to calculate the HyperLogLog states +-- The state column is a unreadable binary format, which can be stored in a table or used in further calculations. +SELECT + `url`, + date_bin("10s" :: INTERVAL, ts) AS time_window, + hll(`user_id`) AS state +FROM + access_log +GROUP BY + `url`, + time_window; +``` + ### `hll_merge` `hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by `hll`. The merged state can then be used to calculate the approximate count distinct across all the merged states. +For example, if you have multiple HLL states from different time windows, you can merge them into a single state to calculate the count distinct across all the data. + +```sql +CREATE TABLE access_log_10s ( + `url` STRING, + time_window timestamp time INDEX, + state BINARY, + PRIMARY KEY (`url`) +); + +-- Use a 10-second windowed query to calculate the HyperLogLog states +INSERT INTO + access_log_10s +SELECT + `url`, + date_bin("10s" :: INTERVAL, ts) AS time_window, + hll(`user_id`) AS state +FROM + access_log +GROUP BY + `url`, + time_window; + +-- merge the HyperLogLog states from the `access_log_10s` table, then calculate the approximate count distinct of user visits in the `access_log_10s` table. +SELECT + `url`, + hll_count(hll_merge(state)) as all_uv +FROM + access_log_10s +GROUP BY + `url`; +``` + ### `hll_count` `hll_count(hll_state)` retrieves the approximate count distinct from a HyperLogLog state. This function takes the HLL state created by `hll` or merged by `hll_merge` and returns the approximate count of distinct values. +For example, you can use `hll_count` to get the approximate count distinct of user visits in the `access_log_10s` table: + +```sql +-- use hll_count to query approximate data in access_log_10s +SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; +``` + ### Caveats Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct, and the larger the dataset, the more accurate the results will be. The relative error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative error of about 0.008125(or 0.8125%). @@ -70,17 +142,17 @@ GROUP BY `url`, time_window; --- use hll_count to query approximate data in access_log_10s +-- use hll_count to query approximate data in access_log_10s, notice for small datasets, the results may not be very accurate. SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; -- results as follows: --- +------------+---------------------+ --- | url | time_window | --- +------------+---------------------+ --- | /dashboard | 2025-03-04 00:00:00 | --- | /dashboard | 2025-03-04 00:00:10 | --- | /not_found | 2025-03-04 00:00:10 | --- +------------+---------------------+ +-- +------------+---------------------+---------------------------------+ +-- | url | time_window | hll_count(access_log_10s.state) | +-- +------------+---------------------+---------------------------------+ +-- | /dashboard | 2025-03-04 00:00:00 | 2 | +-- | /dashboard | 2025-03-04 00:00:10 | 1 | +-- | /not_found | 2025-03-04 00:00:10 | 1 | +-- +------------+---------------------+---------------------------------+ -- in addition, we can aggregate the 10-second data to a 1-minute level by merging the HyperLogLog states using `hll_merge`. SELECT @@ -113,6 +185,24 @@ The `uddsketch_state` function is used to create a UDDSketch state in binary fro - `error_rate`, which is the desired error rate for the quantile calculation. - `value` parameter is the column from which the sketch will be created. +for example, for a simple table `percentile_base` shown below, we can create a `uddsketch_state` for the `value` column with a bucket number of 128 and an error rate of 0.01 (1%). The output will be a binary representation of the UDDSketch state, which contains the necessary information to calculate approximate quantiles later. + +```sql +CREATE TABLE percentile_base ( + `id` INT PRIMARY KEY, + `value` DOUBLE, + `ts` timestamp(0) time index +); + +-- notice the output state is a unreadable binary format, which can be stored in a table or used in further calculations. +SELECT + uddsketch_state(128, 0.01, `value`) AS percentile_state, +FROM + percentile_base; +``` + +This output binary state can be think of as a histogram of the values in the `value` column, which can then be merged using `uddsketch_merge` or used to calculate quantiles using `uddsketch_calc` as shown later. + ### `uddsketch_merge` The `uddsketch_merge` function is used to merge multiple UDDSketch states into one. It takes three parameters: @@ -122,6 +212,52 @@ The `uddsketch_merge` function is used to merge multiple UDDSketch states into o This is useful when you want to combine results from different time windows or sources. Notice that the `bucket_num` and `error_rate` must match the original sketch where the state was created, or else the merge will fail. +For example, if you have multiple UDDSketch states from different time windows, you can merge them into a single state to calculate the overall quantile across all the data. + +```sql +CREATE TABLE percentile_base ( + `id` INT PRIMARY KEY, + `value` DOUBLE, + `ts` timestamp(0) time index +); + +CREATE TABLE percentile_5s ( + `percentile_state` BINARY, + `time_window` timestamp(0) time index +); + +-- Insert some sample data into percentile_base +INSERT INTO percentile_base (`id`, `value`, `ts`) VALUES + (1, 10.0, 1), + (2, 20.0, 2), + (3, 30.0, 3), + (4, 40.0, 4), + (5, 50.0, 5), + (6, 60.0, 6), + (7, 70.0, 7), + (8, 80.0, 8), + (9, 90.0, 9), + (10, 100.0, 10); + +INSERT INTO + percentile_5s +SELECT + uddsketch_state(128, 0.01, `value`) AS percentile_state, + date_bin('5 seconds' :: INTERVAL, `ts`) AS time_window +FROM + percentile_base +GROUP BY + time_window; + +-- This query creates a new UDDSketch state by merging the states from the `percentile_5s` table. +SELECT + uddsketch_merge(128, 0.01, `percentile_state`) +FROM + percentile_5s; +``` + +This output binary state can then be used to calculate quantiles using `uddsketch_calc`. + ### `uddsketch_calc` @@ -129,12 +265,48 @@ The `uddsketch_calc` function is used to calculate the approximate quantile from - `quantile`, which is a value between 0 and 1 representing the desired quantile to calculate, i.e., 0.99 for the 99th percentile. - `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state` or merged by `uddsketch_merge`. +For example, if you have a UDDSketch state from the previous steps, you can calculate the approximate 99th percentile as follows: + +```sql +-- calculate the approximate 99th percentile from the UDDSketch state in a 5-second windowed query +SELECT + time_window, + uddsketch_calc(0.99, `percentile_state`) AS p99 +FROM + percentile_5s; + +-- results as follows: +-- +---------------------+--------------------+ +-- | time_window | p99 | +-- +---------------------+--------------------+ +-- | 1970-01-01 00:00:00 | 40.04777053326359 | +-- | 1970-01-01 00:00:05 | 89.13032933635911 | +-- | 1970-01-01 00:00:10 | 100.49456770856492 | +-- +---------------------+--------------------+ + +-- calculate the approximate 99th percentile from the merged UDDSketch state in a 1-minute windowed query +SELECT + date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, + uddsketch_calc(0.99, uddsketch_merge(128, 0.01, `percentile_state`)) AS p99 +FROM + percentile_5s +GROUP BY + time_window_1m; + +-- results as follows: +-- +---------------------+--------------------+ +-- | time_window_1m | p99 | +-- +---------------------+--------------------+ +-- | 1970-01-01 00:00:00 | 100.49456770856492 | +-- +---------------------+--------------------+ +``` + ### Caveats Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. -### Usage Example -This example demonstrates how to use the `uddsketch` functions to calculate the approximate quantile of a set of values. +### Full Usage Example +This example demonstrates how to use three `uddsketch` functions describe above to calculate the approximate quantile of a set of values. ```sql CREATE TABLE percentile_base ( From 7c66b8cc1417f20fa02e48418dc896ba1eda11e9 Mon Sep 17 00:00:00 2001 From: discord9 Date: Wed, 18 Jun 2025 16:26:45 +0800 Subject: [PATCH 04/13] chore: per review Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index a3a4011a6..25c9b74e5 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -103,7 +103,7 @@ SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct, and the larger the dataset, the more accurate the results will be. The relative error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative error of about 0.008125(or 0.8125%). -### Usage Example +### Full Usage Example This example demonstrates how to use the `hll` functions to calculate the approximate count distinct of user visits. ```sql @@ -303,7 +303,7 @@ GROUP BY ### Caveats -Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. +Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. A recommended value for `bucket_num` is 128, which provides a good balance between accuracy and memory usage for most use cases. The error rate can be set to a small value, such as 0.01 (1%), to achieve high accuracy in the quantile calculations. ### Full Usage Example This example demonstrates how to use three `uddsketch` functions describe above to calculate the approximate quantile of a set of values. From b480f6aa01ecedf46352dc963d0368c57985408a Mon Sep 17 00:00:00 2001 From: discord9 <55937128+discord9@users.noreply.github.com> Date: Wed, 18 Jun 2025 17:32:25 +0800 Subject: [PATCH 05/13] Apply suggestions from code review Co-authored-by: Yiran --- docs/reference/sql/functions/approximate.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 25c9b74e5..829b82716 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -5,7 +5,7 @@ description: Lists and describes approximate functions available in GreptimeDB, # Approximate Functions -This page lists two approximate functions in GreptimeDB, `hll` and `uddsketch`, which are used for approximate data analysis. +This page lists approximate functions in GreptimeDB, which are used for approximate data analysis. :::warning The following approximate functions is currently experimental and may change in future releases. @@ -53,7 +53,7 @@ GROUP BY ### `hll_merge` -`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by `hll`. The merged state can then be used to calculate the approximate count distinct across all the merged states. +`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by [`hll`](#hll). The merged state can then be used to calculate the approximate count distinct across all the merged states. For example, if you have multiple HLL states from different time windows, you can merge them into a single state to calculate the count distinct across all the data. @@ -301,7 +301,7 @@ GROUP BY -- +---------------------+--------------------+ ``` -### Caveats +### How to determine `bucket_num` value Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. A recommended value for `bucket_num` is 128, which provides a good balance between accuracy and memory usage for most use cases. The error rate can be set to a small value, such as 0.01 (1%), to achieve high accuracy in the quantile calculations. From ee308469e129e142ee10ea46db4bb9fd6f7dbd6d Mon Sep 17 00:00:00 2001 From: discord9 Date: Wed, 18 Jun 2025 17:33:01 +0800 Subject: [PATCH 06/13] more per review Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 829b82716..225d93746 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -181,7 +181,7 @@ Three functions are provided for approximate quantile calculation using the [UDD ### `uddsketch_state` The `uddsketch_state` function is used to create a UDDSketch state in binary from a given column. It takes three parameters: -- `bucket_num`, which is the number of buckets to use for the sketch, +- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number](#How_to_determine_`bucket_num`_value) for how to decide the value of `bucket_num`. - `error_rate`, which is the desired error rate for the quantile calculation. - `value` parameter is the column from which the sketch will be created. From 02fdad4dbea2e230f9625fa4dfef2ded187d053d Mon Sep 17 00:00:00 2001 From: discord9 Date: Thu, 19 Jun 2025 14:14:09 +0800 Subject: [PATCH 07/13] refactor: cleanup Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 241 +++++--------------- 1 file changed, 59 insertions(+), 182 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 225d93746..6fed7cd3e 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -15,97 +15,27 @@ The following approximate functions is currently experimental and may change in The `hll` function is used to calculate the approximate count distinct of a set of values. It uses [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) algorithm for efficient memory usage and speed. Three functions are provided for this purpose, described in following chapters: -### `hll` - -`hll(value)` creates a HyperLogLog state in binary from a given column. The `value` can be any column that you want to calculate the approximate count distinct for. It returns a binary representation of the HLL state, which can be stored in a table or used in further calculations. - -For example, for a simple table `access_log` shown below, we can create a `hll` state for the `user_id` column. The output will be a binary representation of the HLL state, which contains the necessary information to calculate approximate count distinct later. - -```sql -CREATE TABLE access_log ( - `url` STRING, - user_id BIGINT, - ts TIMESTAMP TIME INDEX, - PRIMARY KEY (`url`, `user_id`) -); +:::warning +Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct. The relative standard error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative standard error of about 0.008125(or 0.8125%). +::: +### `hll` --- Insert some sample data into access_log -INSERT INTO access_log VALUES - ("/dashboard", 1, "2025-03-04 00:00:00"), - ("/dashboard", 1, "2025-03-04 00:00:01"), - ("/dashboard", 2, "2025-03-04 00:00:05"), - ("/not_found", 3, "2025-03-04 00:00:11"), - ("/dashboard", 4, "2025-03-04 00:00:15"); - --- Use a 10-second windowed query to calculate the HyperLogLog states --- The state column is a unreadable binary format, which can be stored in a table or used in further calculations. -SELECT - `url`, - date_bin("10s" :: INTERVAL, ts) AS time_window, - hll(`user_id`) AS state -FROM - access_log -GROUP BY - `url`, - time_window; -``` +`hll(value)` creates a HyperLogLog state in binary from a given column. The `value` can be any column that you want to calculate the approximate count distinct for. It returns a binary representation of the HLL state, which can be stored in a table or used in further calculations. See [Full Usage Example](#Full_Usage_Example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. ### `hll_merge` -`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by [`hll`](#hll). The merged state can then be used to calculate the approximate count distinct across all the merged states. - -For example, if you have multiple HLL states from different time windows, you can merge them into a single state to calculate the count distinct across all the data. - -```sql -CREATE TABLE access_log_10s ( - `url` STRING, - time_window timestamp time INDEX, - state BINARY, - PRIMARY KEY (`url`) -); - --- Use a 10-second windowed query to calculate the HyperLogLog states -INSERT INTO - access_log_10s -SELECT - `url`, - date_bin("10s" :: INTERVAL, ts) AS time_window, - hll(`user_id`) AS state -FROM - access_log -GROUP BY - `url`, - time_window; +`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by [`hll`](#hll). The merged state can then be used to calculate the approximate count distinct across all the merged states. See [Full Usage Example](#Full_Usage_Example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. --- merge the HyperLogLog states from the `access_log_10s` table, then calculate the approximate count distinct of user visits in the `access_log_10s` table. -SELECT - `url`, - hll_count(hll_merge(state)) as all_uv -FROM - access_log_10s -GROUP BY - `url`; -``` ### `hll_count` -`hll_count(hll_state)` retrieves the approximate count distinct from a HyperLogLog state. This function takes the HLL state created by `hll` or merged by `hll_merge` and returns the approximate count of distinct values. - -For example, you can use `hll_count` to get the approximate count distinct of user visits in the `access_log_10s` table: - -```sql --- use hll_count to query approximate data in access_log_10s -SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; -``` - -### Caveats - -Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct, and the larger the dataset, the more accurate the results will be. The relative error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative error of about 0.008125(or 0.8125%). +`hll_count(hll_state)` retrieves the approximate count distinct from a HyperLogLog state. This function takes the HLL state created by `hll` or merged by `hll_merge` and returns the approximate count of distinct values. See [Full Usage Example](#Full_Usage_Example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. ### Full Usage Example -This example demonstrates how to use the `hll` functions to calculate the approximate count distinct of user visits. +This example demonstrates how to use these functions in combination to calculate the approximate count distinct user id. +First create the base table `access_log` for storing user access logs, and the `access_log_10s` table for storing the HyperLogLog states within a 10-second time window. Notice the `state` column is of type `BINARY`, which will store the HyperLogLog state in binary format. ```sql CREATE TABLE access_log ( `url` STRING, @@ -120,15 +50,20 @@ CREATE TABLE access_log_10s ( state BINARY, PRIMARY KEY (`url`) ); +``` --- Insert some sample data into access_log +Insert some sample data into access_log: +```sql INSERT INTO access_log VALUES ("/dashboard", 1, "2025-03-04 00:00:00"), ("/dashboard", 1, "2025-03-04 00:00:01"), ("/dashboard", 2, "2025-03-04 00:00:05"), ("/not_found", 3, "2025-03-04 00:00:11"), ("/dashboard", 4, "2025-03-04 00:00:15"); +``` +Now we can use the `hll` function to create a HyperLogLog state for the `user_id` column with a 10-second time window. The output will be a binary representation of the HLL state, which contains the necessary information to calculate approximate count distinct later. The `date_bin` function is used to group the data into 10-second time windows. Hence this `INSERT INTO` statement will create a HyperLogLog state for each 10-second time window in the `access_log` table, and insert it into the `access_log_10s` table: +```sql -- Use a 10-second windowed query to calculate the HyperLogLog states INSERT INTO access_log_10s @@ -141,7 +76,12 @@ FROM GROUP BY `url`, time_window; +-- results will be similar to this: +-- Query OK, 3 rows affected (0.05 sec) +``` +Then we can use the `hll_count` function to retrieve the approximate count distinct from the HyperLogLog state(which is the `state` column). For example, to get the approximate count distinct of user visits for each 10-second time window, we can run the following query: +```sql -- use hll_count to query approximate data in access_log_10s, notice for small datasets, the results may not be very accurate. SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; @@ -153,8 +93,11 @@ SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; -- | /dashboard | 2025-03-04 00:00:10 | 1 | -- | /not_found | 2025-03-04 00:00:10 | 1 | -- +------------+---------------------+---------------------------------+ +``` --- in addition, we can aggregate the 10-second data to a 1-minute level by merging the HyperLogLog states using `hll_merge`. +In addition, we can aggregate the 10-second data to a 1-minute level by merging the HyperLogLog states using `hll_merge`. This allows us to calculate the approximate count distinct for a larger time window, which can be useful for analyzing trends over time. The following query demonstrates how to do this: +```sql +-- aggregate the 10-second data to a 1-minute level by merging the HyperLogLog states using `hll_merge`. SELECT `url`, date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, @@ -174,89 +117,37 @@ GROUP BY -- +------------+---------------------+------------+ ``` +Note how the `hll_merge` function is used to merge the HyperLogLog states from the `access_log_10s` table, and then the `hll_count` function is used to calculate the approximate count distinct for each 1-minute time window. If only use `hll_merge` without `hll_count`, the result will just be a unreadable binary representation of the merged HyperLogLog state, which is not very useful for analysis. Hence we use `hll_count` to retrieve the approximate count distinct from the merged state. + ## Approximate Quantile (UDDSketch) Three functions are provided for approximate quantile calculation using the [UDDSketch](https://arxiv.org/abs/2004.08604) algorithm. +:::warning +Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The results may not be exact but are usually very close to the actual quantiles. +::: + ### `uddsketch_state` The `uddsketch_state` function is used to create a UDDSketch state in binary from a given column. It takes three parameters: -- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number](#How_to_determine_`bucket_num`_value) for how to decide the value of `bucket_num`. +- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number and error rate](#How_to_determine_`bucket_num`_value_and_`error_rate`) for how to decide the value. - `error_rate`, which is the desired error rate for the quantile calculation. - `value` parameter is the column from which the sketch will be created. for example, for a simple table `percentile_base` shown below, we can create a `uddsketch_state` for the `value` column with a bucket number of 128 and an error rate of 0.01 (1%). The output will be a binary representation of the UDDSketch state, which contains the necessary information to calculate approximate quantiles later. -```sql -CREATE TABLE percentile_base ( - `id` INT PRIMARY KEY, - `value` DOUBLE, - `ts` timestamp(0) time index -); - --- notice the output state is a unreadable binary format, which can be stored in a table or used in further calculations. -SELECT - uddsketch_state(128, 0.01, `value`) AS percentile_state, -FROM - percentile_base; -``` - -This output binary state can be think of as a histogram of the values in the `value` column, which can then be merged using `uddsketch_merge` or used to calculate quantiles using `uddsketch_calc` as shown later. +This output binary state can be think of as a histogram of the values in the `value` column, which can then be merged using `uddsketch_merge` or used to calculate quantiles using `uddsketch_calc` as shown later. See [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full example of how to use these functions in combination to calculate approximate quantiles. ### `uddsketch_merge` The `uddsketch_merge` function is used to merge multiple UDDSketch states into one. It takes three parameters: -- `bucket_num`, which is the number of buckets to use for the sketch, +- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number and error rate](#How_to_determine_`bucket_num`_value_and_`error_rate`) for how to decide the value. - `error_rate`, which is the desired error rate for the quantile calculation. - `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state`. This is useful when you want to combine results from different time windows or sources. Notice that the `bucket_num` and `error_rate` must match the original sketch where the state was created, or else the merge will fail. -For example, if you have multiple UDDSketch states from different time windows, you can merge them into a single state to calculate the overall quantile across all the data. - -```sql -CREATE TABLE percentile_base ( - `id` INT PRIMARY KEY, - `value` DOUBLE, - `ts` timestamp(0) time index -); - -CREATE TABLE percentile_5s ( - `percentile_state` BINARY, - `time_window` timestamp(0) time index -); - --- Insert some sample data into percentile_base -INSERT INTO percentile_base (`id`, `value`, `ts`) VALUES - (1, 10.0, 1), - (2, 20.0, 2), - (3, 30.0, 3), - (4, 40.0, 4), - (5, 50.0, 5), - (6, 60.0, 6), - (7, 70.0, 7), - (8, 80.0, 8), - (9, 90.0, 9), - (10, 100.0, 10); - -INSERT INTO - percentile_5s -SELECT - uddsketch_state(128, 0.01, `value`) AS percentile_state, - date_bin('5 seconds' :: INTERVAL, `ts`) AS time_window -FROM - percentile_base -GROUP BY - time_window; - --- This query creates a new UDDSketch state by merging the states from the `percentile_5s` table. -SELECT - uddsketch_merge(128, 0.01, `percentile_state`) -FROM - percentile_5s; -``` - -This output binary state can then be used to calculate quantiles using `uddsketch_calc`. +For example, if you have multiple UDDSketch states from different time windows, you can merge them into a single state to calculate the overall quantile across all the data.This output binary state can then be used to calculate quantiles using `uddsketch_calc`. See [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full example of how to use these functions in combination to calculate approximate quantiles. ### `uddsketch_calc` @@ -265,49 +156,20 @@ The `uddsketch_calc` function is used to calculate the approximate quantile from - `quantile`, which is a value between 0 and 1 representing the desired quantile to calculate, i.e., 0.99 for the 99th percentile. - `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state` or merged by `uddsketch_merge`. -For example, if you have a UDDSketch state from the previous steps, you can calculate the approximate 99th percentile as follows: +see [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full example of how to use these functions in combination to calculate approximate quantiles. -```sql --- calculate the approximate 99th percentile from the UDDSketch state in a 5-second windowed query -SELECT - time_window, - uddsketch_calc(0.99, `percentile_state`) AS p99 -FROM - percentile_5s; +### How to determine `bucket_num` and `error_rate` --- results as follows: --- +---------------------+--------------------+ --- | time_window | p99 | --- +---------------------+--------------------+ --- | 1970-01-01 00:00:00 | 40.04777053326359 | --- | 1970-01-01 00:00:05 | 89.13032933635911 | --- | 1970-01-01 00:00:10 | 100.49456770856492 | --- +---------------------+--------------------+ +The `bucket_num` parameter sets the maximum number of internal containers the sketch can use, directly controlling its memory footprint. Think of it as the physical storage capacity for tracking different value ranges. A larger `bucket_num` allows the sketch to accurately represent a wider dynamic range of data (i.e., a larger ratio between the maximum and minimum values). If this limit is too small for your data, the sketch will be forced to merge very high or low values, which degrades its accuracy. A recommended value for `bucket_num` is 128, which provides a good balance between accuracy and memory usage for most use cases. --- calculate the approximate 99th percentile from the merged UDDSketch state in a 1-minute windowed query -SELECT - date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, - uddsketch_calc(0.99, uddsketch_merge(128, 0.01, `percentile_state`)) AS p99 -FROM - percentile_5s -GROUP BY - time_window_1m; +The `error_rate` defines the desired precision for your quantile calculations. It guarantees that any computed quantile (e.g., p99) is within a certain *relative* percentage of the true value. For example, an `error_rate` of `0.01` ensures the result is within 1% of the actual value. A smaller `error_rate` provides higher accuracy, as it forces the sketch to use more granular buckets to distinguish between closer numbers. --- results as follows: --- +---------------------+--------------------+ --- | time_window_1m | p99 | --- +---------------------+--------------------+ --- | 1970-01-01 00:00:00 | 100.49456770856492 | --- +---------------------+--------------------+ -``` - -### How to determine `bucket_num` value +These two parameters create a direct trade-off. To achieve the high precision promised by a small `error_rate`, the sketch needs a sufficient `bucket_num`, especially for data that spans a wide range. `bucket_num` acts as the physical limit on accuracy. If your `bucket_num` is restricted by memory constraints, setting the `error_rate` to an extremely small value will not improve precision beyond the limit imposed by the available buckets. -Notice that the UDDSketch algorithm is designed to provide approximate quantiles with a tunable error rate, which allows for efficient memory usage and fast calculations. The error rate is the maximum relative error allowed in the quantile calculation, and it can be adjusted based on the requirements of the application. The `bucket_num` parameter determines the number of buckets used in the sketch, which also affects the accuracy and memory usage of the algorithm. The larger the `bucket_num`, the more accurate the results will be, but it will also consume more memory. A recommended value for `bucket_num` is 128, which provides a good balance between accuracy and memory usage for most use cases. The error rate can be set to a small value, such as 0.01 (1%), to achieve high accuracy in the quantile calculations. - -### Full Usage Example +### UDDSketch Full Usage Example This example demonstrates how to use three `uddsketch` functions describe above to calculate the approximate quantile of a set of values. +First create the base table `percentile_base` for store the raw data, and the `percentile_5s` table for storing the UDDSketch states within a 5-second time window. notice the `percentile_state` column is of type `BINARY`, which will store the UDDSketch state in binary format. ```sql CREATE TABLE percentile_base ( `id` INT PRIMARY KEY, @@ -319,8 +181,10 @@ CREATE TABLE percentile_5s ( `percentile_state` BINARY, `time_window` timestamp(0) time index ); +``` --- Insert some sample data into percentile_base +Insert some sample data into `percentile_base` : +```sql INSERT INTO percentile_base (`id`, `value`, `ts`) VALUES (1, 10.0, 1), (2, 20.0, 2), @@ -332,8 +196,11 @@ INSERT INTO percentile_base (`id`, `value`, `ts`) VALUES (8, 80.0, 8), (9, 90.0, 9), (10, 100.0, 10); +``` --- Use a 5-second windowed query to calculate the UDDSketch states +Now we can use the `uddsketch_state` function to create a UDDSketch state for the `value` column with a bucket number of 128 and an error rate of 0.01 (1%). The output will be a binary representation of the UDDSketch state, which contains the necessary information to calculate approximate quantiles later, the `date_bin` function is used to group the data into 5-second time windows. Hence this `INSERT INTO` statement will create a UDDSketch state for each 5-second time window in the `percentile_base` table, and insert it into the `percentile_5s` table: + +```sql INSERT INTO percentile_5s SELECT @@ -343,7 +210,12 @@ FROM percentile_base GROUP BY time_window; +-- results will be similar to this: +-- Query OK, 3 rows affected (0.05 sec) +``` +Now we can use the `uddsketch_calc` function to calculate the approximate quantile from the UDDSketch state. For example, to get the approximate 99th percentile (p99) for each 5-second time window, we can run the following query: +```sql -- query percentile_5s to get the approximate 99th percentile SELECT time_window, @@ -359,7 +231,11 @@ FROM -- | 1970-01-01 00:00:05 | 89.13032933635911 | -- | 1970-01-01 00:00:10 | 100.49456770856492 | -- +---------------------+--------------------+ +``` +Notice in above query the `percentile_state` column is the UDDSketch state created by `uddsketch_state`. +In addition, we can aggregate the 5-second data to a 1-minute level by merging the UDDSketch states using `uddsketch_merge`. This allows us to calculate the approximate quantile for a larger time window, which can be useful for analyzing trends over time. The following query demonstrates how to do this: +```sql -- in addition, we can aggregate the 5-second data to a 1-minute level by merging the UDDSketch states using `uddsketch_merge`. SELECT date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, @@ -375,4 +251,5 @@ GROUP BY -- +---------------------+--------------------+ -- | 1970-01-01 00:00:00 | 100.49456770856492 | -- +---------------------+--------------------+ -``` \ No newline at end of file +``` +Notice how the `uddsketch_merge` function is used to merge the UDDSketch states from the `percentile_5s` table, and then the `uddsketch_calc` function is used to calculate the approximate 99th percentile (p99) for each 1-minute time window. \ No newline at end of file From 2f60b4497bd61b838ed13297ee068a97cbfa335b Mon Sep 17 00:00:00 2001 From: discord9 Date: Thu, 19 Jun 2025 16:13:52 +0800 Subject: [PATCH 08/13] refactor: rephrase&update hll sample data to more representive Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 6fed7cd3e..40a814d1d 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -13,7 +13,7 @@ The following approximate functions is currently experimental and may change in ## Approximate Count Distinct (HLL) -The `hll` function is used to calculate the approximate count distinct of a set of values. It uses [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) algorithm for efficient memory usage and speed. Three functions are provided for this purpose, described in following chapters: +The [HyperLogLog]((https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)) (HLL) algorithm is used to calculate the approximate count distinct of a set of values. It provides efficient memory usage and speed for this purpose. Three functions are provided to work with the HLL algorithm, described in following chapters: :::warning Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct. The relative standard error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative standard error of about 0.008125(or 0.8125%). @@ -58,8 +58,12 @@ INSERT INTO access_log VALUES ("/dashboard", 1, "2025-03-04 00:00:00"), ("/dashboard", 1, "2025-03-04 00:00:01"), ("/dashboard", 2, "2025-03-04 00:00:05"), + ("/dashboard", 2, "2025-03-04 00:00:10"), + ("/dashboard", 2, "2025-03-04 00:00:13"), + ("/dashboard", 4, "2025-03-04 00:00:15"), + ("/not_found", 1, "2025-03-04 00:00:10"), ("/not_found", 3, "2025-03-04 00:00:11"), - ("/dashboard", 4, "2025-03-04 00:00:15"); + ("/not_found", 4, "2025-03-04 00:00:12"); ``` Now we can use the `hll` function to create a HyperLogLog state for the `user_id` column with a 10-second time window. The output will be a binary representation of the HLL state, which contains the necessary information to calculate approximate count distinct later. The `date_bin` function is used to group the data into 10-second time windows. Hence this `INSERT INTO` statement will create a HyperLogLog state for each 10-second time window in the `access_log` table, and insert it into the `access_log_10s` table: @@ -90,8 +94,8 @@ SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; -- | url | time_window | hll_count(access_log_10s.state) | -- +------------+---------------------+---------------------------------+ -- | /dashboard | 2025-03-04 00:00:00 | 2 | --- | /dashboard | 2025-03-04 00:00:10 | 1 | --- | /not_found | 2025-03-04 00:00:10 | 1 | +-- | /dashboard | 2025-03-04 00:00:10 | 2 | +-- | /not_found | 2025-03-04 00:00:10 | 3 | -- +------------+---------------------+---------------------------------+ ``` @@ -113,7 +117,7 @@ GROUP BY -- | url | time_window_1m | uv_per_min | -- +------------+---------------------+------------+ -- | /dashboard | 2025-03-04 00:00:00 | 3 | --- | /not_found | 2025-03-04 00:00:00 | 1 | +-- | /not_found | 2025-03-04 00:00:00 | 3 | -- +------------+---------------------+------------+ ``` From d5c8f3334e5db124bd07b326915732f431c63238 Mon Sep 17 00:00:00 2001 From: discord9 Date: Thu, 19 Jun 2025 16:59:27 +0800 Subject: [PATCH 09/13] docs: two flowchart to explain more clearly Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 8 +- static/hll.svg | 102 ++++++++++++++++++++ static/udd.svg | 102 ++++++++++++++++++++ 3 files changed, 211 insertions(+), 1 deletion(-) create mode 100644 static/hll.svg create mode 100644 static/udd.svg diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 40a814d1d..406bdbc44 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -123,6 +123,9 @@ GROUP BY Note how the `hll_merge` function is used to merge the HyperLogLog states from the `access_log_10s` table, and then the `hll_count` function is used to calculate the approximate count distinct for each 1-minute time window. If only use `hll_merge` without `hll_count`, the result will just be a unreadable binary representation of the merged HyperLogLog state, which is not very useful for analysis. Hence we use `hll_count` to retrieve the approximate count distinct from the merged state. +This following flowchart illustrates above usage of the HyperLogLog functions. First raw event data is first group by time window and url, then the `hll` function is used to create a HyperLogLog state for each time window and url, then the `hll_count` function is used to retrieve the approximate count distinct for each time window and url. Finally, the `hll_merge` function is used to merge the HyperLogLog states for each url, and then the `hll_count` function is used again to retrieve the approximate count distinct for the 1-minute time window. +![HLL Usage Flowchart](/hll.svg) + ## Approximate Quantile (UDDSketch) Three functions are provided for approximate quantile calculation using the [UDDSketch](https://arxiv.org/abs/2004.08604) algorithm. @@ -256,4 +259,7 @@ GROUP BY -- | 1970-01-01 00:00:00 | 100.49456770856492 | -- +---------------------+--------------------+ ``` -Notice how the `uddsketch_merge` function is used to merge the UDDSketch states from the `percentile_5s` table, and then the `uddsketch_calc` function is used to calculate the approximate 99th percentile (p99) for each 1-minute time window. \ No newline at end of file +Notice how the `uddsketch_merge` function is used to merge the UDDSketch states from the `percentile_5s` table, and then the `uddsketch_calc` function is used to calculate the approximate 99th percentile (p99) for each 1-minute time window. + +This following flowchart illustrates above usage of the UDDSketch functions. First raw event data is first group by time window, then the `uddsketch_state` function is used to create a UDDSketch state for each time window, then the `uddsketch_calc` function is used to retrieve the approximate 99th quantile for each time window. Finally, the `uddsketch_merge` function is used to merge the UDDSketch states for each time window, and then the `uddsketch_calc` function is used again to retrieve the approximate 99th quantile for the 1-minute time window. +![UDDSketch Usage Flowchart](/udd.svg) diff --git a/static/hll.svg b/static/hll.svg new file mode 100644 index 000000000..d30864345 --- /dev/null +++ b/static/hll.svg @@ -0,0 +1,102 @@ +

Final Outputs

hll_count
~2

hll_count
~2

hll_count
~3

/dashboard: ~3
/not_found: ~3
00:00:00-00:01:00(approx)

hll_count()
group by
user_id, 1min

00:00:00-00:00:10
/dashboard

user_id: 1,1,2
(3 records)

HLL 1
(Binary)

hll_merge()

00:00:10-00:00:20
/dashboard

user_id: 2,2,3
(3 records)

HLL 2
(Binary)

00:00:10-00:00:20
/not_found

user_id: 1,3,4
(3 records)

HLL 3
(Binary)

Merged HLL
(Binary)

\ No newline at end of file diff --git a/static/udd.svg b/static/udd.svg new file mode 100644 index 000000000..fbb29e546 --- /dev/null +++ b/static/udd.svg @@ -0,0 +1,102 @@ +

Final Outputs

udd_calc(0.99)
~40.0

udd_calc(0.99)
~90.0

udd_calc(0.99)
~100.0

udd_calc(0.99)
group by
1min

~100.0
00:00:00-00:01:00

00:00:00-00:00:05

values: 10.0 to 40.0
(4 records)

uddsketch_state 1
(Binary)

uddsketch_merge()

00:00:05-00:00:10

values: 50.0 to 90.0
(5 records)

uddsketch_state 2
(Binary)

00:00:10-00:00:15

values: 100.0
(1 records)

uddsketch_state 3
(Binary)

Merged UDDSketch
(Binary)

\ No newline at end of file From 81aaa7160bcb3ac6a8d742cfae800c589c4227f2 Mon Sep 17 00:00:00 2001 From: discord9 Date: Thu, 19 Jun 2025 18:49:29 +0800 Subject: [PATCH 10/13] chore: typos Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index 406bdbc44..f07153479 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -13,7 +13,7 @@ The following approximate functions is currently experimental and may change in ## Approximate Count Distinct (HLL) -The [HyperLogLog]((https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)) (HLL) algorithm is used to calculate the approximate count distinct of a set of values. It provides efficient memory usage and speed for this purpose. Three functions are provided to work with the HLL algorithm, described in following chapters: +The [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) algorithm is used to calculate the approximate count distinct of a set of values. It provides efficient memory usage and speed for this purpose. Three functions are provided to work with the HLL algorithm, described in following chapters: :::warning Notice that due to the approximate nature of the algorithm, the results may not be exact but are usually very close to the actual count distinct. The relative standard error of the HyperLogLog algorithm is about 1.04/sqrt(m), where m is the number of registers used in the algorithm. GreptimeDB uses 16384 registers by default, which gives a relative standard error of about 0.008125(or 0.8125%). @@ -167,11 +167,11 @@ see [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full exa ### How to determine `bucket_num` and `error_rate` -The `bucket_num` parameter sets the maximum number of internal containers the sketch can use, directly controlling its memory footprint. Think of it as the physical storage capacity for tracking different value ranges. A larger `bucket_num` allows the sketch to accurately represent a wider dynamic range of data (i.e., a larger ratio between the maximum and minimum values). If this limit is too small for your data, the sketch will be forced to merge very high or low values, which degrades its accuracy. A recommended value for `bucket_num` is 128, which provides a good balance between accuracy and memory usage for most use cases. +The `bucket_num` parameter sets the maximum number of internal containers the sketch can use, directly controlling its memory footprint. Think of it as the physical storage capacity for tracking different value ranges. A larger `bucket_num` allows the sketch to accurately represent a wider dynamic range of data (i.e. a larger ratio between the maximum and minimum values). If this limit is too small for your data, the sketch will be forced to merge very high or low values, which degrades its accuracy. A recommended value for `bucket_num` is 128, which provides a good balance between accuracy and memory usage for most use cases. The `error_rate` defines the desired precision for your quantile calculations. It guarantees that any computed quantile (e.g., p99) is within a certain *relative* percentage of the true value. For example, an `error_rate` of `0.01` ensures the result is within 1% of the actual value. A smaller `error_rate` provides higher accuracy, as it forces the sketch to use more granular buckets to distinguish between closer numbers. -These two parameters create a direct trade-off. To achieve the high precision promised by a small `error_rate`, the sketch needs a sufficient `bucket_num`, especially for data that spans a wide range. `bucket_num` acts as the physical limit on accuracy. If your `bucket_num` is restricted by memory constraints, setting the `error_rate` to an extremely small value will not improve precision beyond the limit imposed by the available buckets. +These two parameters create a direct trade-off. To achieve the high precision promised by a small `error_rate`, the sketch needs a sufficient `bucket_num`, especially for data that spans a wide range. `bucket_num` acts as the physical limit on accuracy. If your `bucket_num` is restricted by memory constraints, setting the `error_rate` to an extremely small value will not improve precision because the limit imposed by the available buckets. ### UDDSketch Full Usage Example This example demonstrates how to use three `uddsketch` functions describe above to calculate the approximate quantile of a set of values. From 7bdb23d5db541274398d94e06d998713d6ee33e1 Mon Sep 17 00:00:00 2001 From: discord9 Date: Thu, 19 Jun 2025 18:53:23 +0800 Subject: [PATCH 11/13] fix inner link Signed-off-by: discord9 --- docs/reference/sql/functions/approximate.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/reference/sql/functions/approximate.md b/docs/reference/sql/functions/approximate.md index f07153479..e24dff6a2 100644 --- a/docs/reference/sql/functions/approximate.md +++ b/docs/reference/sql/functions/approximate.md @@ -21,16 +21,16 @@ Notice that due to the approximate nature of the algorithm, the results may not ### `hll` -`hll(value)` creates a HyperLogLog state in binary from a given column. The `value` can be any column that you want to calculate the approximate count distinct for. It returns a binary representation of the HLL state, which can be stored in a table or used in further calculations. See [Full Usage Example](#Full_Usage_Example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. +`hll(value)` creates a HyperLogLog state in binary from a given column. The `value` can be any column that you want to calculate the approximate count distinct for. It returns a binary representation of the HLL state, which can be stored in a table or used in further calculations. See [Full Usage Example](#full-usage-example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. ### `hll_merge` -`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by [`hll`](#hll). The merged state can then be used to calculate the approximate count distinct across all the merged states. See [Full Usage Example](#Full_Usage_Example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. +`hll_merge(hll_state)` merges multiple HyperLogLog states into one. This is useful when you want to combine the results of multiple HLL calculations, such as when aggregating data from different time windows or sources. The `hll_state` parameter is the binary representation of the HLL state created by [`hll`](#hll). The merged state can then be used to calculate the approximate count distinct across all the merged states. See [Full Usage Example](#full-usage-example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. ### `hll_count` -`hll_count(hll_state)` retrieves the approximate count distinct from a HyperLogLog state. This function takes the HLL state created by `hll` or merged by `hll_merge` and returns the approximate count of distinct values. See [Full Usage Example](#Full_Usage_Example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. +`hll_count(hll_state)` retrieves the approximate count distinct from a HyperLogLog state. This function takes the HLL state created by `hll` or merged by `hll_merge` and returns the approximate count of distinct values. See [Full Usage Example](#full-usage-example) for a full example of how to use this function in combination with other functions to calculate approximate count distinct. ### Full Usage Example This example demonstrates how to use these functions in combination to calculate the approximate count distinct user id. @@ -137,24 +137,24 @@ Notice that the UDDSketch algorithm is designed to provide approximate quantiles ### `uddsketch_state` The `uddsketch_state` function is used to create a UDDSketch state in binary from a given column. It takes three parameters: -- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number and error rate](#How_to_determine_`bucket_num`_value_and_`error_rate`) for how to decide the value. +- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number and error rate](#how-to-determine-bucket_num-and-error_rate) for how to decide the value. - `error_rate`, which is the desired error rate for the quantile calculation. - `value` parameter is the column from which the sketch will be created. for example, for a simple table `percentile_base` shown below, we can create a `uddsketch_state` for the `value` column with a bucket number of 128 and an error rate of 0.01 (1%). The output will be a binary representation of the UDDSketch state, which contains the necessary information to calculate approximate quantiles later. -This output binary state can be think of as a histogram of the values in the `value` column, which can then be merged using `uddsketch_merge` or used to calculate quantiles using `uddsketch_calc` as shown later. See [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full example of how to use these functions in combination to calculate approximate quantiles. +This output binary state can be think of as a histogram of the values in the `value` column, which can then be merged using `uddsketch_merge` or used to calculate quantiles using `uddsketch_calc` as shown later. See [UDDSketch Full Usage Example](#uddsketch-full-usage-example) for a full example of how to use these functions in combination to calculate approximate quantiles. ### `uddsketch_merge` The `uddsketch_merge` function is used to merge multiple UDDSketch states into one. It takes three parameters: -- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number and error rate](#How_to_determine_`bucket_num`_value_and_`error_rate`) for how to decide the value. +- `bucket_num`, which is the number of buckets to use for the sketch, see [How to determine bucket number and error rate](#how-to-determine-bucket_num-and-error_rate) for how to decide the value. - `error_rate`, which is the desired error rate for the quantile calculation. - `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state`. This is useful when you want to combine results from different time windows or sources. Notice that the `bucket_num` and `error_rate` must match the original sketch where the state was created, or else the merge will fail. -For example, if you have multiple UDDSketch states from different time windows, you can merge them into a single state to calculate the overall quantile across all the data.This output binary state can then be used to calculate quantiles using `uddsketch_calc`. See [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full example of how to use these functions in combination to calculate approximate quantiles. +For example, if you have multiple UDDSketch states from different time windows, you can merge them into a single state to calculate the overall quantile across all the data.This output binary state can then be used to calculate quantiles using `uddsketch_calc`. See [UDDSketch Full Usage Example](#uddsketch-full-usage-example) for a full example of how to use these functions in combination to calculate approximate quantiles. ### `uddsketch_calc` @@ -163,7 +163,7 @@ The `uddsketch_calc` function is used to calculate the approximate quantile from - `quantile`, which is a value between 0 and 1 representing the desired quantile to calculate, i.e., 0.99 for the 99th percentile. - `udd_state`, which is the binary representation of the UDDSketch state created by `uddsketch_state` or merged by `uddsketch_merge`. -see [UDDSketch Full Usage Example](#UDDSketch_Full_Usage_Example) for a full example of how to use these functions in combination to calculate approximate quantiles. +see [UDDSketch Full Usage Example](#uddsketch-full-usage-example) for a full example of how to use these functions in combination to calculate approximate quantiles. ### How to determine `bucket_num` and `error_rate` From 28d2150cfbc3b5761fb6b94ac6d1ba8712e6fe88 Mon Sep 17 00:00:00 2001 From: discord9 Date: Thu, 19 Jun 2025 19:30:48 +0800 Subject: [PATCH 12/13] fix: add to sidebar Signed-off-by: discord9 --- sidebars.ts | 1 + 1 file changed, 1 insertion(+) diff --git a/sidebars.ts b/sidebars.ts index 7c9209dee..7134e8aa9 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -627,6 +627,7 @@ const sidebars: SidebarsConfig = { 'reference/sql/functions/ip', 'reference/sql/functions/json', 'reference/sql/functions/vector', + 'reference/sql/functions/approximate', ] }, 'reference/sql/admin', From f98ec16bdde76ee9c0aa584f7fe2836197c9b538 Mon Sep 17 00:00:00 2001 From: discord9 Date: Fri, 20 Jun 2025 15:17:52 +0800 Subject: [PATCH 13/13] docs: update zh Signed-off-by: discord9 --- docs/reference/sql/functions/overview.md | 6 +- .../reference/sql/functions/approximate.md | 265 ++++++++++++++++++ .../reference/sql/functions/overview.md | 5 + 3 files changed, 273 insertions(+), 3 deletions(-) create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/approximate.md diff --git a/docs/reference/sql/functions/overview.md b/docs/reference/sql/functions/overview.md index 46d1300ec..4000a8d0b 100644 --- a/docs/reference/sql/functions/overview.md +++ b/docs/reference/sql/functions/overview.md @@ -253,15 +253,15 @@ GreptimeDB provides `ADMIN` statement to run the administration functions, pleas GreptimeDB provide functions for jsons. [Learn more about these functions](./json.md) -## Geospatial Functions +### Geospatial Functions GreptimeDB provide functions for geo-index, trajectory analytics. [Learn more about these functions](./geo.md) -## Vector Functions +### Vector Functions GreptimeDB supports vector functions for vector operations, such as distance calculation, similarity measurement, etc. [Learn more about these functions](./vector.md) -## Approximate Functions +### Approximate Functions GreptimeDB supports some approximate functions for data analysis, such as approximate count distinct(hll), approximate quantile(uddsketch), etc. [Learn more about these functions](./approximate.md) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/approximate.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/approximate.md new file mode 100644 index 000000000..2e782bab6 --- /dev/null +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/approximate.md @@ -0,0 +1,265 @@ +--- +keywords: [近似函数, 近似去重计数, 近似分位线, SQL 函数] +description: 列出和描述 GreptimeDB 中可用的近似函数,包括它们的用法和示例。 +--- + +# 近似函数 + +本页面列出了 GreptimeDB 中的近似函数,这些函数用于近似数据分析。 + +:::warning +下述的近似函数目前仍处于实验阶段,可能会在未来的版本中发生变化。 +::: + +## 近似去重计数 (HLL) + +这里使用了 [HyperLogLog](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) (HLL) 算法来计算一组值的近似去重计数。它在内存使用和速度方面提供了高效的性能。GreptimeDB 提供了三个函数来处理 HLL 算法,具体描述如下: + +:::warning +由于算法的近似性质,结果可能不完全精确,但通常非常接近实际的去重计数。HyperLogLog 算法的相对标准误差约为 1.04/√m,其中 m 是算法中使用的寄存器数量。GreptimeDB 默认使用 16384 个寄存器,这使得相对标准误差约为 0.008125(即 0.8125%)。 +::: + +### `hll` + +`hll(value)` 从给定列创建二进制的 HyperLogLog 状态。`value` 可以是你希望计算近似去重计数的任何列。它返回 HLL 状态的二进制表示,可以存储在表中或用于进一步计算。有关如何结合其他函数使用此函数计算近似去重计数的完整示例,请参阅 [完整使用示例](#完整使用示例)。 + +### `hll_merge` + +`hll_merge(hll_state)` 将多个 HyperLogLog 状态合并为一个。当你需要合并多个 HLL 计算结果时,例如聚合来自不同时间窗口或来源的数据时,这非常有用。`hll_state` 参数是由 [`hll`](#hll) 创建的 HLL 状态的二进制表示。合并后的状态可用于计算所有合并状态的近似去重计数。有关如何结合其他函数使用此函数计算近似去重计数的完整示例,请参阅 [完整使用示例](#full-usage-example)。 + + +### `hll_count` + +`hll_count(hll_state)` 从 HyperLogLog 状态中计算得到近似去重计数的结果。此函数接受由 `hll` 创建或由 `hll_merge` 合并的 HLL 状态,并返回近似的去重值计数。有关如何结合其他函数使用此函数计算近似去重计数的完整示例,请参阅 [完整使用示例](#full-usage-example)。 + +### 完整使用示例 + +此示例演示了如何组合使用这些函数来计算近似的去重的用户 ID 的数量。 + +首先创建用于存储用户访问日志的基础表 `access_log`,以及用于在 10 秒时间窗口内存储 HyperLogLog 状态的 `access_log_10s` 表。请注意,`state` 列的类型为 `BINARY`,它将以二进制格式存储 HyperLogLog 状态。 +```sql +CREATE TABLE access_log ( + `url` STRING, + user_id BIGINT, + ts TIMESTAMP TIME INDEX, + PRIMARY KEY (`url`, `user_id`) +); + +CREATE TABLE access_log_10s ( + `url` STRING, + time_window timestamp time INDEX, + state BINARY, + PRIMARY KEY (`url`) +); +``` + +将一些示例数据插入到 `access_log` 中: +```sql +INSERT INTO access_log VALUES + ("/dashboard", 1, "2025-03-04 00:00:00"), + ("/dashboard", 1, "2025-03-04 00:00:01"), + ("/dashboard", 2, "2025-03-04 00:00:05"), + ("/dashboard", 2, "2025-03-04 00:00:10"), + ("/dashboard", 2, "2025-03-04 00:00:13"), + ("/dashboard", 4, "2025-03-04 00:00:15"), + ("/not_found", 1, "2025-03-04 00:00:10"), + ("/not_found", 3, "2025-03-04 00:00:11"), + ("/not_found", 4, "2025-03-04 00:00:12"); +``` + +现在我们可以使用 `hll` 函数为 `user_id` 列创建 10 秒时间窗口的 HyperLogLog 状态。输出将是 HLL 状态的二进制表示,其中包含计算后续近似去重计数所需的信息。`date_bin` 函数用于将数据分组到 10 秒的时间窗口中。因此,此 `INSERT INTO` 语句将为 `access_log` 表中每个 10 秒时间窗口创建 HyperLogLog 状态,并将其插入到 `access_log_10s` 表中: +```sql +-- 使用 10 秒窗口查询来计算 HyperLogLog 状态: +INSERT INTO + access_log_10s +SELECT + `url`, + date_bin("10s" :: INTERVAL, ts) AS time_window, + hll(`user_id`) AS state +FROM + access_log +GROUP BY + `url`, + time_window; +-- 结果类似: +-- Query OK, 3 rows affected (0.05 sec) +``` +然后我们可以使用 `hll_count` 函数从 HyperLogLog 状态(即 `state` 列)中检索近似去重计数。例如,要获取每个 10 秒时间窗口的用户访问近似去重计数,我们可以运行以下查询: +```sql +-- 使用 `hll_count` 查询 `access_log_10s` 中的近似数据,请注意对于小型数据集,结果可能不是很准确。 +SELECT `url`, `time_window`, hll_count(state) FROM access_log_10s; + +-- 结果如下: +-- +------------+---------------------+---------------------------------+ +-- | url | time_window | hll_count(access_log_10s.state) | +-- +------------+---------------------+---------------------------------+ +-- | /dashboard | 2025-03-04 00:00:00 | 2 | +-- | /dashboard | 2025-03-04 00:00:10 | 2 | +-- | /not_found | 2025-03-04 00:00:10 | 3 | +-- +------------+---------------------+---------------------------------+ +``` + +此外,我们可以通过使用 `hll_merge` 合并 HyperLogLog 状态,将 10 秒的数据聚合到 1 分钟级别。这使我们能够计算更大时间窗口的近似去重计数,这对于分析随时间变化的趋势非常有用。以下查询演示了如何实现: +```sql +-- 使用 `hll_merge` 合并 HyperLogLog 状态,将 10 秒的数据聚合到 1 分钟级别。 +SELECT + `url`, + date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, + hll_count(hll_merge(state)) as uv_per_min +FROM + access_log_10s +GROUP BY + `url`, + date_bin('1 minute' :: INTERVAL, `time_window`); + +-- 结果如下: +-- +------------+---------------------+------------+ +-- | url | time_window_1m | uv_per_min | +-- +------------+---------------------+------------+ +-- | /dashboard | 2025-03-04 00:00:00 | 3 | +-- | /not_found | 2025-03-04 00:00:00 | 3 | +-- +------------+---------------------+------------+ +``` + +请注意 `hll_merge` 函数如何用于合并 `access_log_10s` 表中的 HyperLogLog 状态,然后 `hll_count` 函数用于计算每个 1 分钟时间窗口的近似去重计数。如果只使用 `hll_merge` 而不使用 `hll_count`,结果将只是合并后的 HyperLogLog 状态的不可读二进制表示,这对于分析没有太大用处。因此,我们使用 `hll_count` 从合并后的状态中计算得到近似去重计数。 + +以下流程图说明了 HyperLogLog 函数的上述用法。首先,原始事件数据按时间窗口和 URL 分组,然后使用 `hll` 函数为每个时间窗口和 URL 创建一个 HyperLogLog 状态,接着使用 `hll_count` 函数检索每个时间窗口和 URL 的近似去重计数。最后,使用 `hll_merge` 函数合并每个 URL 的 HyperLogLog 状态,然后再次使用 `hll_count` 函数检索 1 分钟时间窗口的近似去重计数。 +![HLL 用例流程图](/hll.svg) + +## 近似分位线(UDDSketch) + +使用 [UDDSketch](https://arxiv.org/abs/2004.08604) 算法提供了三个函数用于近似分位数计算。 + +:::warning +值得注意的是,UDDSketch 算法旨在提供具有可调误差率的近似分位数,这有助于实现高效的内存使用和快速计算。结果可能并非完全精确,但通常非常接近实际分位数。 +::: + +### `uddsketch_state` + +`uddsketch_state` 函数用于从给定列创建二进制格式的 UDDSketch 状态。它接受三个参数: +- `bucket_num`:用于记录分位线信息的桶数量。关于如何确定该值,请参阅[如何确定桶数量和误差率](#如何确定桶数量和误差率)。 +- `error_rate`:分位数计算所需的误差率。 +- `value`:用于计算分位线的列。 + +例如,对于下述表 `percentile_base`,我们可以为 `value` 列创建一个 `uddsketch_state`,其中桶数量为 128,误差率为 0.01 (1%)。输出将是 UDDSketch 状态的二进制表示,其中包含后续计算近似分位数所需的信息。 + +该输出的二进制状态可被视为 `value` 列中值的直方图,后续可使用 `uddsketch_merge` 进行合并,或使用 `uddsketch_calc` 计算分位数。有关如何结合使用这些函数来计算近似分位数的完整示例,请参阅[UDDSketch 完整使用示例](#uddsketch-完整使用示例)。 + +### `uddsketch_merge` + +`uddsketch_merge` 函数用于将多个 UDDSketch 状态合并为一个。它接受三个参数: +- `bucket_num`:用于记录分位线信息的桶数量。关于如何确定该值,请参阅[如何确定桶数量和误差率](#如何确定桶数量和误差率)。 +- `error_rate`:分位数计算所需的误差率。 +- `udd_state`:由 `uddsketch_state` 创建的 UDDSketch 状态的二进制表示。 + +当您需要合并来自不同时间窗口或来源的结果时,此函数非常有用。请注意,`bucket_num` 和 `error_rate` 必须与创建原始状态时使用的参数匹配,否则合并将失败。 + +例如,如果您有来自不同时间窗口的多个 UDDSketch 状态,您可以将它们合并为一个状态,以计算所有数据的整体分位数。该输出的二进制状态随后可用于使用 `uddsketch_calc` 计算分位数。有关如何结合使用这些函数来计算近似分位数的完整示例,请参阅[UDDSketch 完整使用示例](#uddsketch-完整使用示例)。 + + +### `uddsketch_calc` + +`uddsketch_calc` 函数用于从 UDDSketch 状态计算近似分位数。它接受两个参数: +- `quantile`:一个介于 0 和 1 之间的值,表示要计算的目标分位数,例如,0.99 代表第 99 百分位数。 +- `udd_state`:由 `uddsketch_state` 创建或由 `uddsketch_merge` 合并的 UDDSketch 状态的二进制表示。 + +有关如何结合使用这些函数来计算近似分位数的完整示例,请参阅[UDDSketch 完整使用示例](#uddsketch-full-usage-example)。 + +### 如何确定桶数量和误差率 + +`bucket_num` 参数设置了 UDDSketch 算法可使用的内部容器的最大数量,直接控制其内存占用。可以将其视为跟踪不同值范围的物理存储容量。更大的 `bucket_num` 允许草图更准确地表示更宽的数据动态范围(即最大值和最小值之间更大的比率)。如果此限制对于您的数据而言过小,草图将被迫合并非常高或非常低的值,从而降低其准确性。对于大多数用例,`bucket_num` 的推荐值为 128,这在准确性和内存使用之间提供了良好的平衡。 + +`error_rate` 定义了分位数计算所需的精度。它保证任何计算出的分位数(例如 p99)都在真实值的某个*相对*百分比范围内。例如,`error_rate` 为 `0.01` 确保结果在实际值的 1% 以内。较小的 `error_rate` 提供更高的准确性,因为它强制UDDSketch 算法使用更细粒度的桶来区分更接近的数字。 + +这两个参数之间存在直接的权衡关系。为了达到小 `error_rate` 所承诺的高精度,UDDSketch 算法需要足够的 `bucket_num`,特别是对于跨度较大的数据。`bucket_num` 充当了精度的物理限制。如果您的 `bucket_num` 受到内存限制,那么将 `error_rate` 设置为极小值并不会提高精度,因为受到可用桶数量的限制。 + +### UDDSketch 完整使用示例 +本示例演示了如何使用上述三个 `uddsketch` 函数来计算一组值的近似分位数。 + +首先创建用于存储原始数据的基表 `percentile_base`,以及用于存储 5 秒时间窗口内 UDDSketch 状态的 `percentile_5s` 表。请注意,`percentile_state` 列的类型为 `BINARY`,它将以二进制格式存储 UDDSketch 状态。 +```sql +CREATE TABLE percentile_base ( + `id` INT PRIMARY KEY, + `value` DOUBLE, + `ts` timestamp(0) time index +); + +CREATE TABLE percentile_5s ( + `percentile_state` BINARY, + `time_window` timestamp(0) time index +); +``` + +向 `percentile_base` 插入一些示例数据: +```sql +INSERT INTO percentile_base (`id`, `value`, `ts`) VALUES + (1, 10.0, 1), + (2, 20.0, 2), + (3, 30.0, 3), + (4, 40.0, 4), + (5, 50.0, 5), + (6, 60.0, 6), + (7, 70.0, 7), + (8, 80.0, 8), + (9, 90.0, 9), + (10, 100.0, 10); +``` + +现在我们可以使用 `uddsketch_state` 函数为 `value` 列创建一个 UDDSketch 状态,其中桶数量为 128,误差率为 0.01 (1%)。输出将是 UDDSketch 状态的二进制表示,其中包含后续计算近似分位数所需的信息。`date_bin` 函数用于将数据分到 5 秒的时间窗口中。因此,此 `INSERT INTO` 语句将为 `percentile_base` 表中每个 5 秒时间窗口创建 UDDSketch 状态,并将其插入到 `percentile_5s` 表中: + +```sql +INSERT INTO + percentile_5s +SELECT + uddsketch_state(128, 0.01, `value`) AS percentile_state, + date_bin('5 seconds' :: INTERVAL, `ts`) AS time_window +FROM + percentile_base +GROUP BY + time_window; +-- 结果类似: +-- Query OK, 3 rows affected (0.05 sec) +``` + +现在我们可以使用 `uddsketch_calc` 函数从 UDDSketch 状态中计算近似分位数。例如,要获取每个 5 秒时间窗口的近似第 99 百分位数 (p99),我们可以运行以下查询: +```sql +-- 查询 percentile_5s 以获取近似第 99 百分位数 +SELECT + time_window, + uddsketch_calc(0.99, `percentile_state`) AS p99 +FROM + percentile_5s; + +-- 结果如下: +-- +---------------------+--------------------+ +-- | time_window | p99 | +-- +---------------------+--------------------+ +-- | 1970-01-01 00:00:00 | 40.04777053326359 | +-- | 1970-01-01 00:00:05 | 89.13032933635911 | +-- | 1970-01-01 00:00:10 | 100.49456770856492 | +-- +---------------------+--------------------+ +``` +请注意,在上述查询中,`percentile_state` 列是由 `uddsketch_state` 创建的 UDDSketch 状态。 + +此外,我们可以通过使用 `uddsketch_merge` 合并 UDDSketch 状态,将 5 秒的数据聚合到 1 分钟级别。这使我们能够计算更大时间窗口的近似分位数,这对于分析随时间变化的趋势非常有用。以下查询演示了如何实现: +```sql +-- 此外,我们可以通过使用 `uddsketch_merge` 合并 UDDSketch 状态,将 5 秒的数据聚合到 1 分钟级别。 +SELECT + date_bin('1 minute' :: INTERVAL, `time_window`) AS time_window_1m, + uddsketch_calc(0.99, uddsketch_merge(128, 0.01, `percentile_state`)) AS p99 +FROM + percentile_5s +GROUP BY + time_window_1m; + +-- 结果如下: +-- +---------------------+--------------------+ +-- | time_window_1m | p99 | +-- +---------------------+--------------------+ +-- | 1970-01-01 00:00:00 | 100.49456770856492 | +-- +---------------------+--------------------+ +``` +请注意 `uddsketch_merge` 函数是如何用于合并 `percentile_5s` 表中的 UDDSketch 状态,然后 `uddsketch_calc` 函数用于计算每个 1 分钟时间窗口的近似第 99 百分位数 (p99)。 + +以下流程图说明了 UDDSketch 函数的上述用法。首先,原始事件数据按时间窗口分组,然后使用 `uddsketch_state` 函数为每个时间窗口创建一个 UDDSketch 状态,接着使用 `uddsketch_calc` 函数检索每个时间窗口的近似第 99 分位数。最后,使用 `uddsketch_merge` 函数合并每个时间窗口的 UDDSketch 状态,然后再次使用 `uddsketch_calc` 函数检索 1 分钟时间窗口的近似第 99 分位数。 +![UDDSketch 用例流程图](/udd.svg) diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md index df9f422a6..277ef61a3 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/reference/sql/functions/overview.md @@ -255,3 +255,8 @@ GreptimeDB 提供了 `ADMIN` 语句来执行管理函数,请阅读 [ADMIN](/re ### 向量函数 [了解 GreptimeDB 中向量相关的函数](./vector.md)。 + +### 近似函数 + +GreptimeDB 支持一些近似函数用于数据分析,例如近似去重计数(hll)、近似分位数(uddsketch)等。[了解 GreptimeDB 中近似函数](./approximate.md)。 +