Designing StarRocks for High Availability (WIP) #26600
Closed
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The architecture diagram of StarRocks is shown above, and its high availability mechanism can be briefly introduced from the following two aspects:
Metadata High Availability
Overview
FE nodes are divided into three roles:
StarRocks uses the Finite-State Machine model (state machine) to ensure the high availability of Metadata. Firstly, nodes select the leader through the paxos protocol, and all write operations are initiated by the leader (if the node visited by the user is not the leader, the internal write request will be forwarded to the leader). When the leader executes the write, he will first write a LOG ( wal ) locally, and then synchronize it to other nodes through the replication protocol. When most nodes confirm receipt of the log, the leader returns success to the Client.
Leader election
The selection process is roughly divided into two stages (the Paxos protocol is quite complex, here is just a brief introduction):
After confirming the master, initiate an accept request for the node that responded successfully in the first step again. If most of the responses are successful, the master selection is successful.
Here are a few points to note:
java -jar fe/lib/je-7.3.7.jar DbGroupAdmin -helperHosts {ip:edit_log_port} -groupName PALO_JOURNAL_GROUP -transferMaster -force {node_name} 5000
LOG copy
Each write generates a LOG, which is identified by consecutive integers. The Leader node sequentially synchronizes the LOG to the Follower node (the LOG of the Observer node is pulled from the Leader node by itself). Only after the LOG is synchronized to most Followers can the write be considered successful.
Checkpoint
FE When the node restarts, it needs to play back LOG to complete the recovery of historical data. As the system runs, there will be more and more LOG, so the recovery time of the restart will also become longer. This requires Checkpoint to compress multiple LOG. Checkpoint is completed by the leader node, and then the generated image file is pushed to other nodes. If all nodes can receive this image file, the leader will delete the LOG before this image file. Checkpoint is triggered after a certain amount of LOG is generated, and the number of LOG is controlled by edit_log_roll_num .
Related parameters
FE
Application data is highly available
Overview
StarRocks uses multi-replica technology to ensure high availability of data after writing. Specifically, when creating a table, users can determine how many replicas the created table should have by configuring replication_num properties. When initiating an import transaction, data will be written to multiple replicas and persisted at the same time, and then the data import success will be returned to the user. StarRocks also allows users to trade off between availability and performance, and can choose priority performance, that is, importing fewer replicas can return success to the user, or choose priority availability.
StarRocks ensures that multiple copies of a table are distributed on different hosts. When some copies are abnormal, such as machine downtime or network anomalies causing write failures, StarRocks will automatically detect and repair these abnormal copies. StarRocks will clone some or all of the data from healthy copies from other hosts to repair abnormal copies. StarRocks adopts a multi-version (MVCC) approach to improve the efficiency of data import. The cloning process can use these multi-version data for physical replication to ensure efficient version repair.
Multiple copy write
Taking the Stream Load in the above figure as an example, an import is divided into the following stages.
Other imports have similar processes, that is, data will only be visible to the outside world after being written to multiple replicas and persisted. Writing to multiple replicas ensures that query services can still be provided in extreme scenarios such as downtime and network anomalies.
Of course, in some scenarios, users do not have such high requirements for data availability, but pay more attention to the performance of imports. StarRocks supports specifying the data import security level by configuring write_quorum properties on the table, that is, setting how many copies of data are needed to import successfully. StarRocks can return the import success. This property is supported since version 2.5.
write_quorum The values and their corresponding descriptions are as follows:
Copy auto repair
When some replicas are abnormal, such as when an import failure of a replica results in incomplete versions, downtime results in missing replicas, or the machine goes offline and needs to supplement new replicas, StarRocks will automatically repair the abnormal replicas.
FE will scan the tablet copy of all tables every 20s, and confirm whether the copy is healthy based on the current visible version number and BE health. If the version number of the copy is less than the visible version number, the copy needs to be repaired through incremental cloning. If the BE heartbeat where the copy is located is lost or manually marked as offline, the copy needs to be repaired through full cloning.
FE detects a replica that needs to be repaired, generates a scheduled task, and adds it to the scheduled queue. The scheduler (TabletScheduler) takes out the tasks that need to be scheduled from the queue, creates a clone task according to the task type, and distributes it to the corresponding BE for execution. For incomplete replicas, FE will directly distribute the clone task to the BE where the replica is located, and inform the BE to clone the missing data from the BE where the healthy replica is located. The subsequent cloning process is executed on the BE side. For the scenario of missing replicas, FE will select a new BE, create an empty replica, and issue a full clone task to the BE where the empty replica is located.
Whether it is incremental cloning or full cloning, BE adopts physical replication, directly obtains the required physical files from the healthy copy, and updates them to its own Metadata. After the cloning is completed, BE will report the status to the FE scheduler. FE cleans up and updates its Metadata. This copy repair is completed, and the overall process is as follows.
During replica repair, as long as there are healthy replicas, the query can still be performed. In addition, for the scenario where 3 replicas have 1 replica that is unhealthy, the import can also be performed normally under the default MAJORITY write_quorum configuration.
Related parameters
FE
Backup & restore
Currently support database level or table level backup , if want to do a daily backup , can also try partition table by date and backup single partition each day
Nodes:
Load Balance all connections:
Data Replication and Tablets:
Recommendation:
Beta Was this translation helpful? Give feedback.
All reactions