Skip to content

Commit

Permalink
add(docs): repair, expand example on calculating max intensity/parallel
Browse files Browse the repository at this point in the history
Fixes #3576
  • Loading branch information
Michal-Leszczynski committed Sep 26, 2023
1 parent bcd9b3f commit 298144f
Show file tree
Hide file tree
Showing 6 changed files with 50 additions and 26 deletions.
30 changes: 28 additions & 2 deletions docs/source/repair/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ Parallel repairs

Each node can take part in at most one Scylla repair job at any given moment, but Scylla Manager can repair distinct replica sets in a token ring in parallel.
This is beneficial for big clusters.
For example, a 9 node cluster and a keyspace with replication factor 3, can be repaired up to 3 times faster in parallel.
The following diagram presents a benchmark results comparing different parallel flag values.
In a benchmark we ran 9 Scylla 2020.1 nodes on AWS i3.2xlarge machines under 50% load, for details check `this blog post <https://www.scylladb.com/2020/11/12/scylla-manager-2-2-repair-revisited/>`_

Expand All @@ -51,20 +50,47 @@ In a benchmark we ran 9 Scylla 2020.1 nodes on AWS i3.2xlarge machines under 50%

By default Scylla Manager runs repairs with full parallelism, you can change that using :ref:`sctool repair --parallel flag <sctool-repair>`.

Maximal effective parallelism
=============================

Max parallelism is determined by:
* the constraint that each node can only take part in one ScyllaDB repair job at any given moment.
* ScyllaDB repair job targeting the full replica set of the repaired token range.

For example, let's assume a cluster with 2 datacenters, 5 nodes each.
When you repair the keyspace ``my_keyspace with replication = {'class': 'NetworkTopologyStrategy', 'dc1': 2, 'dc2': 3}``,
max parallelism is equal to ``1``, because each ScyllaDB repair job targets a full replica set of the repaired token range.
Every replica set consists of 2 nodes from ``dc1`` and 3 nodes from ``dc2``,
so it's impossible to schedule 2 repair jobs to run simultaneously (``dc2`` lacks one more node for it to be possible).

Repair is performed table by table and keyspace by keyspace,
so max effective parallelism might change depending on which keyspace is being repaired.

Repair intensity
================

Intensity specifies how many token ranges can be repaired in a Scylla node at every given time.
The default intensity is one, you can change that using :ref:`sctool repair --intensity flag <sctool-repair>`.

Scylla Manager 2.2 adds support for intensity value zero.
In that case the number of token ranges is calculated based on node memory and adjusted to the Scylla maximal number of ranges that can be repaired in parallel (see ``max_repair_ranges_in_parallel`` in Scylla logs).
In that case, the number of token ranges is calculated based on node memory and adjusted to ScyllaDB's maximum number of ranges that can be repaired in parallel.
If you want to repair faster, try using intensity zero.

Note that the less the cluster is loaded the more it makes sense to increase intensity.
If you increase intensity on a loaded cluster it may not give speed benefits since cluster have no resources to process more repairs.
In our experiments in a 50% loaded cluster increasing intensity from 1 to 2 gives about 10-20% boost and increasing it further will have little impact.

Maximal effective intensity
===========================

Max intensity is calculated based on the ``max_repair_ranges_in_parallel`` value (present in ScyllaDB logs).
This value might be different for each node in the cluster.

As each ScyllaDB repair job targets some subset of all nodes and
ScyllaDB Manager avoids repairing more than ``max_repair_ranges_in_parallel`` on any node,
the max effective intensity for a given repair job is equal to the **minimum** ``max_repair_ranges_in_parallel``
value of nodes taking part in the job.

Changing repair speed
=====================

Expand Down
6 changes: 2 additions & 4 deletions docs/source/sctool/partials/sctool_repair.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ options:
default_value: "1"
usage: |
How many token ranges to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs).
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information).
Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume.
If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum.
See effectively used intensity value in the display of 'sctool progress repair' command.
Expand Down Expand Up @@ -107,9 +107,7 @@ options:
usage: |
The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas).
Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used.
The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes.
The formula to calculate is is as follows: number of nodes / RF.
For example, the maximum parallelism for a 6 node cluster with RF=3 is 2.
The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information).
If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum.
See effectively used parallel value in the display of 'sctool progress repair' command.
- name: retry-wait
Expand Down
14 changes: 8 additions & 6 deletions docs/source/sctool/partials/sctool_repair_control.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,19 @@ options:
- name: intensity
default_value: "1"
usage: |
How many token ranges per shard to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs).
How many token ranges to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information).
Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume.
If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum.
See effectively used intensity value in the display of 'sctool progress repair' command.
- name: parallel
default_value: "0"
usage: |
The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas).
Each node can take part in at most one repair at any given moment.
By default the maximum possible parallelism is used.
The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes.
The formula to calculate is is as follows: number of nodes / RF, ex. for 6 node cluster with RF=3 the maximum parallelism is 2.
Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used.
The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information).
If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum.
See effectively used parallel value in the display of 'sctool progress repair' command.
inherited_options:
- name: api-cert-file
usage: |
Expand Down
6 changes: 2 additions & 4 deletions docs/source/sctool/partials/sctool_repair_update.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ options:
default_value: "1"
usage: |
How many token ranges to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs).
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information).
Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume.
If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum.
See effectively used intensity value in the display of 'sctool progress repair' command.
Expand Down Expand Up @@ -108,9 +108,7 @@ options:
usage: |
The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas).
Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used.
The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes.
The formula to calculate is is as follows: number of nodes / RF.
For example, the maximum parallelism for a 6 node cluster with RF=3 is 2.
The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information).
If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum.
See effectively used parallel value in the display of 'sctool progress repair' command.
- name: retry-wait
Expand Down
14 changes: 8 additions & 6 deletions pkg/command/repair/repaircontrol/res.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,15 @@ long: |
For modifying future repair task runs see 'sctool repair update' command.
intensity: |
How many token ranges per shard to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs).
How many token ranges to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information).
Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume.
If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum.
See effectively used intensity value in the display of 'sctool progress repair' command.
parallel: |
The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas).
Each node can take part in at most one repair at any given moment.
By default the maximum possible parallelism is used.
The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes.
The formula to calculate is is as follows: number of nodes / RF, ex. for 6 node cluster with RF=3 the maximum parallelism is 2.
Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used.
The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information).
If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum.
See effectively used parallel value in the display of 'sctool progress repair' command.
6 changes: 2 additions & 4 deletions pkg/command/repair/res.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,15 @@ ignore-down-hosts: |
intensity: |
How many token ranges to repair in a single Scylla node at the same time.
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs).
Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information).
Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume.
If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum.
See effectively used intensity value in the display of 'sctool progress repair' command.
parallel: |
The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas).
Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used.
The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes.
The formula to calculate is is as follows: number of nodes / RF.
For example, the maximum parallelism for a 6 node cluster with RF=3 is 2.
The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information).
If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum.
See effectively used parallel value in the display of 'sctool progress repair' command.
Expand Down

0 comments on commit 298144f

Please sign in to comment.