Skip to content

Commit

Permalink
[MINOR] Fix broken urls
Browse files Browse the repository at this point in the history
Signed-off-by: Zhen Wang <[email protected]>
  • Loading branch information
wForget committed May 13, 2024
1 parent 846f1e4 commit 02dddd5
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 14 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ The connector supports to read from and write to StarRocks through Apache Spark
## Documentation

For the user manual of the released version of the Spark connector, please visit the StarRocks official documentation.
* [Read data from StarRocks using Spark connector](https://docs.starrocks.io/en-us/latest/loading/Spark-connector-starrocks)
* [Load data using Spark connector](https://docs.starrocks.io/en-us/latest/unloading/Spark_connector)
* [Read data from StarRocks using Spark connector](https://docs.starrocks.io/docs/loading/Spark-connector-starrocks)
* [Load data using Spark connector](https://docs.starrocks.io/docs/unloading/Spark_connector)

For the new features in the snapshot version of the Spark connector, please see the docs in this repo.
* [Read from StarRocks](docs/connector-read.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/connector-read.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You can also map the StarRocks table to a Spark DataFrame or a Spark RDD, and th

> **NOTICE**
>
> Reading data from StarRocks tables with Spark connector needs SELECT privilege. If you do not have the privilege, follow the instructions provided in [GRANT](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/account-management/GRANT) to grant the privilege to the user that you use to connect to your StarRocks cluster.
> Reading data from StarRocks tables with Spark connector needs SELECT privilege. If you do not have the privilege, follow the instructions provided in [GRANT](https://docs.starrocks.io/docs/sql-reference/sql-statements/account-management/GRANT) to grant the privilege to the user that you use to connect to your StarRocks cluster.
## Usage notes

Expand Down
22 changes: 11 additions & 11 deletions docs/connector-write.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Load data using Spark connector

StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you load data into a StarRocks table by using Spark. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-manipulation/STREAM%20LOAD). The Spark connector is implemented based on Spark DataSource V2. A DataSource can be created by using Spark DataFrames or Spark SQL. And both batch and structured streaming modes are supported.
StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you load data into a StarRocks table by using Spark. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-manipulation/STREAM_LOAD). The Spark connector is implemented based on Spark DataSource V2. A DataSource can be created by using Spark DataFrames or Spark SQL. And both batch and structured streaming modes are supported.

> **NOTICE**
>
> Loading data into StarRocks tables with Spark connector needs SELECT and INSERT privileges. If you do not have these privileges, follow the instructions provided in [GRANT](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/account-management/GRANT) to grant these privileges to the user that you use to connect to your StarRocks cluster.
> Loading data into StarRocks tables with Spark connector needs SELECT and INSERT privileges. If you do not have these privileges, follow the instructions provided in [GRANT](https://docs.starrocks.io/docs/sql-reference/sql-statements/account-management/GRANT) to grant these privileges to the user that you use to connect to your StarRocks cluster.
## Version requirements

Expand Down Expand Up @@ -92,15 +92,15 @@ Directly download the corresponding version of the Spark connector JAR from the
| starrocks.user | YES | None | The username of your StarRocks cluster account. |
| starrocks.password | YES | None | The password of your StarRocks cluster account. |
| starrocks.write.label.prefix | NO | spark- | The label prefix used by Stream Load. |
| starrocks.write.enable.transaction-stream-load | NO | TRUE | Whether to use [Stream Load transaction interface](https://docs.starrocks.io/en-us/latest/loading/Stream_Load_transaction_interface) to load data. It requires StarRocks v2.5 or later. This feature can load more data in a transaction with less memory usage, and improve performance. <br/> **NOTICE:** Since 1.1.1, this parameter takes effect only when the value of `starrocks.write.max.retries` is non-positive because Stream Load transaction interface does not support retry. |
| starrocks.write.enable.transaction-stream-load | NO | TRUE | Whether to use [Stream Load transaction interface](https://docs.starrocks.io/docs/loading/Stream_Load_transaction_interface) to load data. It requires StarRocks v2.5 or later. This feature can load more data in a transaction with less memory usage, and improve performance. <br/> **NOTICE:** Since 1.1.1, this parameter takes effect only when the value of `starrocks.write.max.retries` is non-positive because Stream Load transaction interface does not support retry. |
| starrocks.write.buffer.size | NO | 104857600 | The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. Setting this parameter to a larger value can improve loading performance but may increase loading latency. |
| starrocks.write.buffer.rows | NO | Integer.MAX_VALUE | Supported since version 1.1.1. The maximum number of rows that can be accumulated in memory before being sent to StarRocks at a time. |
| starrocks.write.flush.interval.ms | NO | 300000 | The interval at which data is sent to StarRocks. This parameter is used to control the loading latency. |
| starrocks.write.max.retries | NO | 3 | Supported since version 1.1.1. The number of times that the connector retries to perform the Stream Load for the same batch of data if the load fails. <br/> **NOTICE:** Because Stream Load transaction interface does not support retry. If this parameter is positive, the connector always use Stream Load interface and ingnore the value of `starrocks.write.enable.transaction-stream-load`. |
| starrocks.write.retry.interval.ms | NO | 10000 | Supported since version 1.1.1. The interval to retry the Stream Load for the same batch of data if the load fails. |
| starrocks.columns | NO | None | The StarRocks table column into which you want to load data. You can specify multiple columns, which must be separated by commas (,), for example, `"col0,col1,col2"`. |
| starrocks.column.types | NO | None | Supported since version 1.1.1. Customize the column data types for Spark instead of using the defaults inferred from the StarRocks table and the [default mapping](#data-type-mapping-between-spark-and-starrocks). The parameter value is a schema in DDL format same as the output of Spark [StructType#toDDL](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala#L449) , such as `col0 INT, col1 STRING, col2 BIGINT`. Note that you only need to specify columns that need customization. One use case is to load data into columns of [BITMAP](#load-data-into-columns-of-bitmap-type) or [HLL](#load-data-into-columns-of-HLL-type) type.|
| starrocks.write.properties.* | NO | None | The parameters that are used to control Stream Load behavior. For example, the parameter `starrocks.write.properties.format` specifies the format of the data to be loaded, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-manipulation/STREAM%20LOAD). |
| starrocks.write.properties.* | NO | None | The parameters that are used to control Stream Load behavior. For example, the parameter `starrocks.write.properties.format` specifies the format of the data to be loaded, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-manipulation/STREAM_LOAD). |
| starrocks.write.properties.format | NO | CSV | The file format based on which the Spark connector transforms each batch of data before the data is sent to StarRocks. Valid values: CSV and JSON. |
| starrocks.write.properties.row_delimiter | NO | \n | The row delimiter for CSV-formatted data. |
| starrocks.write.properties.column_separator | NO | \t | The column separator for CSV-formatted data. |
Expand Down Expand Up @@ -385,7 +385,7 @@ The following example explains how to load data with Spark SQL by using the `INS
### Load data to primary key table

This section will show how to load data to StarRocks primary key table to achieve partial update, and conditional update.
You can see [Change data through loading](https://docs.starrocks.io/en-us/latest/loading/Load_to_Primary_Key_tables) for the introduction of those features.
You can see [Change data through loading](https://docs.starrocks.io/docs/loading/Load_to_Primary_Key_tables) for the introduction of those features.
These examples use Spark SQL.

#### Preparations
Expand Down Expand Up @@ -517,7 +517,7 @@ takes effect only when the new value for `score` is has a greater or equal to th

### Load data into columns of BITMAP type

[`BITMAP`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-types/BITMAP) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](https://docs.starrocks.io/en-us/latest/using_starrocks/Using_bitmap).
[`BITMAP`](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-types/BITMAP) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](https://docs.starrocks.io/docs/using_starrocks/Using_bitmap).
Here we take the counting of UV as an example to show how to load data into columns of the `BITMAP` type.

1. Create a StarRocks Aggregate table
Expand All @@ -536,7 +536,7 @@ Here we take the counting of UV as an example to show how to load data into colu

3. Create a Spark table

The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `BITMAP` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`to_bitmap`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/bitmap-functions/to_bitmap) function to convert the data of `BIGINT` type into `BITMAP` type.
The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `BITMAP` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`to_bitmap`](https://docs.starrocks.io/docs/sql-reference/sql-functions/bitmap-functions/to_bitmap) function to convert the data of `BIGINT` type into `BITMAP` type.

Run the following DDL in `spark-sql`:

Expand Down Expand Up @@ -580,13 +580,13 @@ Here we take the counting of UV as an example to show how to load data into colu
```
> **NOTICE:**
>
> The connector uses [`to_bitmap`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/bitmap-functions/to_bitmap)
> The connector uses [`to_bitmap`](https://docs.starrocks.io/docs/sql-reference/sql-functions/bitmap-functions/to_bitmap)
> function to convert data of the `TINYINT`, `SMALLINT`, `INTEGER`, and `BIGINT` types in Spark to the `BITMAP` type in StarRocks, and uses
> [`bitmap_hash`](https://docs.starrocks.io/zh-cn/latest/sql-reference/sql-functions/bitmap-functions/bitmap_hash) function for other Spark data types.

### Load data into columns of HLL type

[`HLL`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-types/HLL) can be used for approximate count distinct, see [Use HLL for approximate count distinct](https://docs.starrocks.io/en-us/latest/using_starrocks/Using_HLL).
[`HLL`](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-types/HLL) can be used for approximate count distinct, see [Use HLL for approximate count distinct](https://docs.starrocks.io/docs/using_starrocks/Using_HLL).

Here we take the counting of UV as an example to show how to load data into columns of the `HLL` type. **`HLL` is supported since version 1.1.1**.

Expand All @@ -606,7 +606,7 @@ DISTRIBUTED BY HASH(`page_id`);

2. Create a Spark table

The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `HLL` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`hll_hash`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/aggregate-functions/hll_hash) function to convert the data of `BIGINT` type into `HLL` type.
The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `HLL` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`hll_hash`](https://docs.starrocks.io/docs/sql-reference/sql-functions/aggregate-functions/hll_hash) function to convert the data of `BIGINT` type into `HLL` type.

Run the following DDL in `spark-sql`:

Expand Down Expand Up @@ -651,7 +651,7 @@ DISTRIBUTED BY HASH(`page_id`);



The following example explains how to load data into columns of the [`ARRAY`](https://docs.starrocks.io/en-us/latest/sql-reference/sql-statements/data-types/Array) type.
The following example explains how to load data into columns of the [`ARRAY`](https://docs.starrocks.io/docs/sql-reference/sql-statements/data-types/Array) type.

1. Create a StarRocks table

Expand Down

0 comments on commit 02dddd5

Please sign in to comment.