Skip to content

Commit

Permalink
[Doc] Remove tables (StarRocks#41010)
Browse files Browse the repository at this point in the history
  • Loading branch information
DanRoscigno authored Feb 9, 2024
1 parent 31061cf commit 6c944db
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 21 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -69,28 +69,24 @@ Materialized views, Data Cache, and native tables in StarRocks are all effective

Compared to directly querying lake data or loading data into native tables, materialized views offer several unique advantages:

- **Local storage acceleration**: Materialized views can leverage StarRocks' acceleration advantages with local storage, such as indexes, partitioning, bucketing, and collocate groups, resulting in better query performance compared to querying data from the data lake directly.
- **Local storage acceleration**: Materialized views can leverage StarRocks' acceleration advantages with local storage, such as indexes, partitioning, bucketing, and collocate groups, resulting in better query performance compared to querying data from the data lake directly.
- **Zero maintenance for loading tasks**: Materialized views update data transparently via automatic refresh tasks. There's no need to maintain loading tasks to perform scheduled data updates. Additionally, Hive, Iceberg, and Paimon catalog-based materialized views can detect data changes and perform incremental refreshes at the partition level.
- **Intelligent query rewrite**: Queries can be transparently rewritten to use materialized views. You can benefit from acceleration instantly without the need to modify the query statements your application uses.

<br />
- **Intelligent query rewrite**: Queries can be transparently rewritten to use materialized views. You can benefit from acceleration instantly without the need to modify the query statements your application uses.

Therefore, we recommend using materialized views in the following scenarios:

- Even when Data Cache is enabled, query performance does not meet your requirements for query latency and concurrency.
- Queries involve reusable components, such as fixed aggregation functions or join patterns.
- Data is organized in partitions, while queries involve aggregation on a relatively high level (e.g., aggregating by day).

<br />

In the following scenarios, we recommend prioritizing acceleration through Data Cache:

- Queries do not have many reusable components and may scan any data from the data lake.
- Remote storage has significant fluctuations or instability, which could potentially impact access.

## Create external catalog-based materialized views

Creating a materialized view on tables in external catalogs is similar to creating a materialized view on StarRiocks' native tables. You only need to set a suitable refresh strategy in accordance with the data source you are using, and manually enable query rewrite for external catalog-based materialized views.
Creating a materialized view on tables in external catalogs is similar to creating a materialized view on StarRocks native tables. You only need to set a suitable refresh strategy in accordance with the data source you are using, and manually enable query rewrite for external catalog-based materialized views.

### Choose a suitable refresh strategy

Expand All @@ -102,23 +98,36 @@ For Hive Catalog, Iceberg Catalog (starting from v3.1.4), JDBC catalog (starting

- Ensure data consistency to some extent during query rewrite. If there are data changes in the base table in the data lake, the query will not be rewritten to use the materialized view.

> **NOTE**
>
> You can still choose to tolerate a certain level of data inconsistency by setting the property `mv_rewrite_staleness_second` when creating the materialized view. For more information, see [CREATE MATERIALIZED VIEW](../sql-reference/sql-statements/data-definition/CREATE_MATERIALIZED_VIEW.md).
:::tip

You can still choose to tolerate a certain level of data inconsistency by setting the property `mv_rewrite_staleness_second` when creating the materialized view. For more information, see [CREATE MATERIALIZED VIEW](../sql-reference/sql-statements/data-definition/CREATE_MATERIALIZED_VIEW.md).

:::

Please note that if you need to refresh by partition, the partitioning keys of the materialized view must be included in that of the base table.

For Hive catalogs, you can enable the Hive metadata cache refresh feature to allow StarRocks to detect data changes at the partition level. When this feature is enabled, StarRocks periodically accesses the Hive Metastore Service (HMS) or AWS Glue to check the metadata information of recently queried hot data.

To enable the Hive metadata cache refresh feature, you can set the following FE dynamic configuration item using [ADMIN SET FRONTEND CONFIG](../sql-reference/sql-statements/Administration/ADMIN_SET_CONFIG.md):

| **Configuration item** | **Default** | **Description** |
| ------------------------------------------------------------ | -------------------------- | ------------------------------------------------------------ |
| enable_background_refresh_connector_metadata | true in v3.0 false in v2.5 | Whether to enable the periodic Hive metadata cache refresh. After it is enabled, StarRocks polls the metastore (Hive Metastore or AWS Glue) of your Hive cluster, and refreshes the cached metadata of the frequently accessed Hive catalogs to perceive data changes. true indicates to enable the Hive metadata cache refresh, and false indicates to disable it. |
| background_refresh_metadata_interval_millis | 600000 (10 minutes) | The interval between two consecutive Hive metadata cache refreshes. Unit: millisecond. |
| background_refresh_metadata_time_secs_since_last_access_secs | 86400 (24 hours) | The expiration time of a Hive metadata cache refresh task. For the Hive catalog that has been accessed, if it has not been accessed for more than the specified time, StarRocks stops refreshing its cached metadata. For the Hive catalog that has not been accessed, StarRocks will not refresh its cached metadata. Unit: second. |
### Configuration items

#### enable_background_refresh_connector_metadata

**Default**: true in v3.0 false in v2.5<br/>
**Description**: Whether to enable the periodic Hive metadata cache refresh. After it is enabled, StarRocks polls the metastore (Hive Metastore or AWS Glue) of your Hive cluster, and refreshes the cached metadata of the frequently accessed Hive catalogs to perceive data changes. True indicates to enable the Hive metadata cache refresh, and false indicates to disable it.<br/>

#### background_refresh_metadata_interval_millis

**Default**: 600000 (10 minutes)<br/>
**Description**: The interval between two consecutive Hive metadata cache refreshes. Unit: millisecond.<br/>

#### background_refresh_metadata_time_secs_since_last_access_secs

**Default**: 86400 (24 hours)<br/>
**Description**: The expiration time of a Hive metadata cache refresh task. For the Hive catalog that has been accessed, if it has not been accessed for more than the specified time, StarRocks stops refreshing its cached metadata. For the Hive catalog that has not been accessed, StarRocks will not refresh its cached metadata. Unit: second.

From v3.1.4, StarRocks supports detecting data changes for Iceberg Catalog at the partition level. Currently only Iceberg V1 tables are supported.
From v3.1.4, StarRocks supports detecting data changes for Iceberg Catalog at the partition level. Currently, only Iceberg V1 tables are supported.

### Enable query rewrite for external catalog-based materialized views

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,11 +109,23 @@ StarRocks 支持基于 External Catalog,如 Hive Catalog、Iceberg Catalog、H

要启用 Hive 元数据缓存刷新功能,您可以使用 [ADMIN SET FRONTEND CONFIG](../sql-reference/sql-statements/Administration/ADMIN_SET_CONFIG.md) 设置以下 FE 动态配置项:

| **配置名称** | **默认值** | **说明** |
| ------------------------------------------------------------ | ------------------------------- | ------------------------------------------------------------ |
| enable_background_refresh_connector_metadata | v3.0 为 `true`,v2.5 为 `false` | 是否开启 Hive 元数据缓存周期性刷新。开启后,StarRocks 会轮询 Hive 集群的元数据服务(HMS 或 AWS Glue),并刷新经常访问的 Hive 外部数据目录的元数据缓存,以感知数据更新。`true` 代表开启,`false` 代表关闭。 |
| background_refresh_metadata_interval_millis | 600000(10 分钟) | 接连两次 Hive 元数据缓存刷新之间的间隔。单位:毫秒。 |
| background_refresh_metadata_time_secs_since_last_access_secs | 86400(24 小时) | Hive 元数据缓存刷新任务过期时间。对于已被访问过的 Hive Catalog,如果超过该时间没有被访问,则停止刷新其元数据缓存。对于未被访问过的 Hive Catalog,StarRocks 不会刷新其元数据缓存。单位:秒。 |
### 配置名称

#### enable_background_refresh_connector_metadata

**Default**: v3.0 为 `true`,v2.5 为 `false` <br/>
**Description**: 是否开启 Hive 元数据缓存周期性刷新。开启后,StarRocks 会轮询 Hive 集群的元数据服务(HMS 或 AWS Glue),并刷新经常访问的 Hive 外部数据目录的元数据缓存,以感知数据更新。`true` 代表开启,`false` 代表关闭。 <br/>

#### background_refresh_metadata_interval_millis

**Default**: 600000(10 分钟) <br/>
**Description**: 接连两次 Hive 元数据缓存刷新之间的间隔。单位:毫秒。 <br/>

#### background_refresh_metadata_time_secs_since_last_access_secs

**Default**: 86400(24 小时) <br/>
**Description**: Hive 元数据缓存刷新任务过期时间。对于已被访问过的 Hive Catalog,如果超过该时间没有被访问,则停止刷新其元数据缓存。对于未被访问过的 Hive Catalog,StarRocks 不会刷新其元数据缓存。单位:秒。 <br/>


对于 Iceberg Catalog, 从 v3.1.4 版本开始,StarRocks 支持检测分区级别的数据更改,当前只支持 Iceberg V1 表。

Expand Down

0 comments on commit 6c944db

Please sign in to comment.