You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/connector-read.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ You can also map the StarRocks table to a Spark DataFrame or a Spark RDD, and th
10
10
11
11
> **NOTICE**
12
12
>
13
-
> Reading data from StarRocks tables with Spark connector needs SELECT privilege. If you do not have the privilege, follow the instructions provided in [GRANT](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/account-management/GRANT) to grant the privilege to the user that you use to connect to your StarRocks cluster.
13
+
> Reading data from StarRocks tables with Spark connector needs SELECT privilege. If you do not have the privilege, follow the instructions provided in [GRANT](https://docs.starrocks.io/docs/sql-reference/sql-statements/account-management/GRANT) to grant the privilege to the user that you use to connect to your StarRocks cluster.
Copy file name to clipboardExpand all lines: docs/connector-write.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
# Load data using Spark connector
2
2
3
-
StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you load data into a StarRocks table by using Spark. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-manipulation/STREAM%20LOAD). The Spark connector is implemented based on Spark DataSource V2. A DataSource can be created by using Spark DataFrames or Spark SQL. And both batch and structured streaming modes are supported.
3
+
StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you load data into a StarRocks table by using Spark. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-manipulation/STREAM_LOAD). The Spark connector is implemented based on Spark DataSource V2. A DataSource can be created by using Spark DataFrames or Spark SQL. And both batch and structured streaming modes are supported.
4
4
5
5
> **NOTICE**
6
6
>
7
-
> Loading data into StarRocks tables with Spark connector needs SELECT and INSERT privileges. If you do not have these privileges, follow the instructions provided in [GRANT](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/account-management/GRANT) to grant these privileges to the user that you use to connect to your StarRocks cluster.
7
+
> Loading data into StarRocks tables with Spark connector needs SELECT and INSERT privileges. If you do not have these privileges, follow the instructions provided in [GRANT](https://docs.starrocks.io/docs/sql-reference/sql-statements/account-management/GRANT) to grant these privileges to the user that you use to connect to your StarRocks cluster.
8
8
9
9
## Version requirements
10
10
@@ -92,15 +92,15 @@ Directly download the corresponding version of the Spark connector JAR from the
92
92
| starrocks.user | YES | None | The username of your StarRocks cluster account. |
93
93
| starrocks.password | YES | None | The password of your StarRocks cluster account. |
94
94
| starrocks.write.label.prefix | NO | spark- | The label prefix used by Stream Load. |
95
-
| starrocks.write.enable.transaction-stream-load | NO | TRUE | Whether to use [Stream Load transaction interface](https://docs.starrocks.io/en-us/latest/loading/Stream_Load_transaction_interface) to load data. It requires StarRocks v2.5 or later. This feature can load more data in a transaction with less memory usage, and improve performance. <br/> **NOTICE:** Since 1.1.1, this parameter takes effect only when the value of `starrocks.write.max.retries` is non-positive because Stream Load transaction interface does not support retry. |
95
+
| starrocks.write.enable.transaction-stream-load | NO | TRUE | Whether to use [Stream Load transaction interface](https://docs.starrocks.io/docs/loading/Stream_Load_transaction_interface) to load data. It requires StarRocks v2.5 or later. This feature can load more data in a transaction with less memory usage, and improve performance. <br/> **NOTICE:** Since 1.1.1, this parameter takes effect only when the value of `starrocks.write.max.retries` is non-positive because Stream Load transaction interface does not support retry. |
96
96
| starrocks.write.buffer.size | NO | 104857600 | The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. Setting this parameter to a larger value can improve loading performance but may increase loading latency. |
97
97
| starrocks.write.buffer.rows | NO | Integer.MAX_VALUE | Supported since version 1.1.1. The maximum number of rows that can be accumulated in memory before being sent to StarRocks at a time. |
98
98
| starrocks.write.flush.interval.ms | NO | 300000 | The interval at which data is sent to StarRocks. This parameter is used to control the loading latency. |
99
99
| starrocks.write.max.retries | NO | 3 | Supported since version 1.1.1. The number of times that the connector retries to perform the Stream Load for the same batch of data if the load fails. <br/> **NOTICE:** Because Stream Load transaction interface does not support retry. If this parameter is positive, the connector always use Stream Load interface and ingnore the value of `starrocks.write.enable.transaction-stream-load`. |
100
100
| starrocks.write.retry.interval.ms | NO | 10000 | Supported since version 1.1.1. The interval to retry the Stream Load for the same batch of data if the load fails. |
101
101
| starrocks.columns | NO | None | The StarRocks table column into which you want to load data. You can specify multiple columns, which must be separated by commas (,), for example, `"col0,col1,col2"`. |
102
102
| starrocks.column.types | NO | None | Supported since version 1.1.1. Customize the column data types for Spark instead of using the defaults inferred from the StarRocks table and the [default mapping](#data-type-mapping-between-spark-and-starrocks). The parameter value is a schema in DDL format same as the output of Spark [StructType#toDDL](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala#L449) , such as `col0 INT, col1 STRING, col2 BIGINT`. Note that you only need to specify columns that need customization. One use case is to load data into columns of [BITMAP](#load-data-into-columns-of-bitmap-type) or [HLL](#load-data-into-columns-of-HLL-type) type.|
103
-
| starrocks.write.properties.* | NO | None | The parameters that are used to control Stream Load behavior. For example, the parameter `starrocks.write.properties.format` specifies the format of the data to be loaded, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-manipulation/STREAM%20LOAD). |
103
+
| starrocks.write.properties.* | NO | None | The parameters that are used to control Stream Load behavior. For example, the parameter `starrocks.write.properties.format` specifies the format of the data to be loaded, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-manipulation/STREAM_LOAD). |
104
104
| starrocks.write.properties.format | NO | CSV | The file format based on which the Spark connector transforms each batch of data before the data is sent to StarRocks. Valid values: CSV and JSON. |
105
105
| starrocks.write.properties.row_delimiter | NO | \n | The row delimiter for CSV-formatted data. |
106
106
| starrocks.write.properties.column_separator | NO | \t | The column separator for CSV-formatted data. |
@@ -385,7 +385,7 @@ The following example explains how to load data with Spark SQL by using the `INS
385
385
###Load data to primary key table
386
386
387
387
This section will show how to load data to StarRocks primary key table to achieve partial update, and conditional update.
388
-
You can see [Change data through loading](https://docs.starrocks.io/en-us/latest/loading/Load_to_Primary_Key_tables) for the introduction of those features.
388
+
You can see [Change data through loading](https://docs.starrocks.io/docs/loading/Load_to_Primary_Key_tables) for the introduction of those features.
389
389
These examples use SparkSQL.
390
390
391
391
####Preparations
@@ -517,7 +517,7 @@ takes effect only when the new value for `score` is has a greater or equal to th
517
517
518
518
### Load data into columns of BITMAP type
519
519
520
-
[`BITMAP`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-types/BITMAP) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](https://docs.starrocks.io/en-us/latest/using_starrocks/Using_bitmap).
520
+
[`BITMAP`](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-types/BITMAP) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](https://docs.starrocks.io/docs/using_starrocks/Using_bitmap).
521
521
Here we take the counting of UV as an example to show how to load data into columns of the `BITMAP` type.
522
522
523
523
1. Create a StarRocks Aggregate table
@@ -536,7 +536,7 @@ Here we take the counting of UV as an example to show how to load data into colu
536
536
537
537
3. Create a Spark table
538
538
539
-
The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `BITMAP` type. So you need to customize the corresponding column data type in Spark, for example as`BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`to_bitmap`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/bitmap-functions/to_bitmap) function to convert the data of `BIGINT` type into `BITMAP` type.
539
+
The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `BITMAP` type. So you need to customize the corresponding column data type in Spark, for example as`BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`to_bitmap`](https://docs.starrocks.io/docs/sql-reference/sql-functions/bitmap-functions/to_bitmap) function to convert the data of `BIGINT` type into `BITMAP` type.
540
540
541
541
Run the following DDL in`spark-sql`:
542
542
@@ -580,13 +580,13 @@ Here we take the counting of UV as an example to show how to load data into colu
580
580
```
581
581
>**NOTICE:**
582
582
>
583
-
> The connector uses [`to_bitmap`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/bitmap-functions/to_bitmap)
583
+
> The connector uses [`to_bitmap`](https://docs.starrocks.io/docs/sql-reference/sql-functions/bitmap-functions/to_bitmap)
584
584
> function to convert data of the `TINYINT`, `SMALLINT`, `INTEGER`, and`BIGINT` types in Spark to the `BITMAP` type in StarRocks, and uses
585
585
> [`bitmap_hash`](https://docs.starrocks.io/zh-cn/latest/sql-reference/sql-functions/bitmap-functions/bitmap_hash) function for other Spark data types.
586
586
587
587
### Load data into columns of HLL type
588
588
589
-
[`HLL`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-types/HLL) can be used for approximate count distinct, see [Use HLL for approximate count distinct](https://docs.starrocks.io/en-us/latest/using_starrocks/Using_HLL).
589
+
[`HLL`](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-types/HLL) can be used for approximate count distinct, see [Use HLL for approximate count distinct](https://docs.starrocks.io/docs/using_starrocks/Using_HLL).
590
590
591
591
Here we take the counting of UV as an example to show how to load data into columns of the `HLL` type. **`HLL` is supported since version 1.1.1**.
592
592
@@ -606,7 +606,7 @@ DISTRIBUTED BY HASH(`page_id`);
606
606
607
607
2. Create a Spark table
608
608
609
-
The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `HLL` type. So you need to customize the corresponding column data type in Spark, for example as`BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`hll_hash`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/aggregate-functions/hll_hash) function to convert the data of `BIGINT` type into `HLL` type.
609
+
The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `HLL` type. So you need to customize the corresponding column data type in Spark, for example as`BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`hll_hash`](https://docs.starrocks.io/docs/sql-reference/sql-functions/aggregate-functions/hll_hash) function to convert the data of `BIGINT` type into `HLL` type.
610
610
611
611
Run the following DDL in`spark-sql`:
612
612
@@ -651,7 +651,7 @@ DISTRIBUTED BY HASH(`page_id`);
651
651
652
652
653
653
654
-
The following example explains how to load data into columns of the [`ARRAY`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-types/Array) type.
654
+
The following example explains how to load data into columns of the [`ARRAY`](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-types/Array) type.
0 commit comments