From 298144f2642a5da56d8f7ac8f4df8af8399c75e0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Leszczy=C5=84ski?= <2000michal@wp.pl> Date: Sun, 24 Sep 2023 21:33:31 +0200 Subject: [PATCH] add(docs): repair, expand example on calculating max intensity/parallel Fixes #3576 --- docs/source/repair/index.rst | 30 +++++++++++++++++-- .../source/sctool/partials/sctool_repair.yaml | 6 ++-- .../partials/sctool_repair_control.yaml | 14 +++++---- .../sctool/partials/sctool_repair_update.yaml | 6 ++-- pkg/command/repair/repaircontrol/res.yaml | 14 +++++---- pkg/command/repair/res.yaml | 6 ++-- 6 files changed, 50 insertions(+), 26 deletions(-) diff --git a/docs/source/repair/index.rst b/docs/source/repair/index.rst index eab0f7e498..2bc9c43579 100644 --- a/docs/source/repair/index.rst +++ b/docs/source/repair/index.rst @@ -42,7 +42,6 @@ Parallel repairs Each node can take part in at most one Scylla repair job at any given moment, but Scylla Manager can repair distinct replica sets in a token ring in parallel. This is beneficial for big clusters. -For example, a 9 node cluster and a keyspace with replication factor 3, can be repaired up to 3 times faster in parallel. The following diagram presents a benchmark results comparing different parallel flag values. In a benchmark we ran 9 Scylla 2020.1 nodes on AWS i3.2xlarge machines under 50% load, for details check `this blog post `_ @@ -51,6 +50,22 @@ In a benchmark we ran 9 Scylla 2020.1 nodes on AWS i3.2xlarge machines under 50% By default Scylla Manager runs repairs with full parallelism, you can change that using :ref:`sctool repair --parallel flag `. +Maximal effective parallelism +============================= + +Max parallelism is determined by: + * the constraint that each node can only take part in one ScyllaDB repair job at any given moment. + * ScyllaDB repair job targeting the full replica set of the repaired token range. + +For example, let's assume a cluster with 2 datacenters, 5 nodes each. +When you repair the keyspace ``my_keyspace with replication = {'class': 'NetworkTopologyStrategy', 'dc1': 2, 'dc2': 3}``, +max parallelism is equal to ``1``, because each ScyllaDB repair job targets a full replica set of the repaired token range. +Every replica set consists of 2 nodes from ``dc1`` and 3 nodes from ``dc2``, +so it's impossible to schedule 2 repair jobs to run simultaneously (``dc2`` lacks one more node for it to be possible). + +Repair is performed table by table and keyspace by keyspace, +so max effective parallelism might change depending on which keyspace is being repaired. + Repair intensity ================ @@ -58,13 +73,24 @@ Intensity specifies how many token ranges can be repaired in a Scylla node at ev The default intensity is one, you can change that using :ref:`sctool repair --intensity flag `. Scylla Manager 2.2 adds support for intensity value zero. -In that case the number of token ranges is calculated based on node memory and adjusted to the Scylla maximal number of ranges that can be repaired in parallel (see ``max_repair_ranges_in_parallel`` in Scylla logs). +In that case, the number of token ranges is calculated based on node memory and adjusted to ScyllaDB's maximum number of ranges that can be repaired in parallel. If you want to repair faster, try using intensity zero. Note that the less the cluster is loaded the more it makes sense to increase intensity. If you increase intensity on a loaded cluster it may not give speed benefits since cluster have no resources to process more repairs. In our experiments in a 50% loaded cluster increasing intensity from 1 to 2 gives about 10-20% boost and increasing it further will have little impact. +Maximal effective intensity +=========================== + +Max intensity is calculated based on the ``max_repair_ranges_in_parallel`` value (present in ScyllaDB logs). +This value might be different for each node in the cluster. + +As each ScyllaDB repair job targets some subset of all nodes and +ScyllaDB Manager avoids repairing more than ``max_repair_ranges_in_parallel`` on any node, +the max effective intensity for a given repair job is equal to the **minimum** ``max_repair_ranges_in_parallel`` +value of nodes taking part in the job. + Changing repair speed ===================== diff --git a/docs/source/sctool/partials/sctool_repair.yaml b/docs/source/sctool/partials/sctool_repair.yaml index a232b5b059..e6944f07b1 100644 --- a/docs/source/sctool/partials/sctool_repair.yaml +++ b/docs/source/sctool/partials/sctool_repair.yaml @@ -58,7 +58,7 @@ options: default_value: "1" usage: | How many token ranges to repair in a single Scylla node at the same time. - Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs). + Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information). Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume. If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum. See effectively used intensity value in the display of 'sctool progress repair' command. @@ -107,9 +107,7 @@ options: usage: | The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas). Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used. - The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes. - The formula to calculate is is as follows: number of nodes / RF. - For example, the maximum parallelism for a 6 node cluster with RF=3 is 2. + The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information). If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum. See effectively used parallel value in the display of 'sctool progress repair' command. - name: retry-wait diff --git a/docs/source/sctool/partials/sctool_repair_control.yaml b/docs/source/sctool/partials/sctool_repair_control.yaml index b2f8259224..0f16ef2be5 100644 --- a/docs/source/sctool/partials/sctool_repair_control.yaml +++ b/docs/source/sctool/partials/sctool_repair_control.yaml @@ -20,17 +20,19 @@ options: - name: intensity default_value: "1" usage: | - How many token ranges per shard to repair in a single Scylla node at the same time. - Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs). + How many token ranges to repair in a single Scylla node at the same time. + Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information). Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume. + If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum. + See effectively used intensity value in the display of 'sctool progress repair' command. - name: parallel default_value: "0" usage: | The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas). - Each node can take part in at most one repair at any given moment. - By default the maximum possible parallelism is used. - The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes. - The formula to calculate is is as follows: number of nodes / RF, ex. for 6 node cluster with RF=3 the maximum parallelism is 2. + Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used. + The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information). + If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum. + See effectively used parallel value in the display of 'sctool progress repair' command. inherited_options: - name: api-cert-file usage: | diff --git a/docs/source/sctool/partials/sctool_repair_update.yaml b/docs/source/sctool/partials/sctool_repair_update.yaml index 7cf87c027b..f4910991a4 100644 --- a/docs/source/sctool/partials/sctool_repair_update.yaml +++ b/docs/source/sctool/partials/sctool_repair_update.yaml @@ -59,7 +59,7 @@ options: default_value: "1" usage: | How many token ranges to repair in a single Scylla node at the same time. - Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs). + Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information). Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume. If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum. See effectively used intensity value in the display of 'sctool progress repair' command. @@ -108,9 +108,7 @@ options: usage: | The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas). Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used. - The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes. - The formula to calculate is is as follows: number of nodes / RF. - For example, the maximum parallelism for a 6 node cluster with RF=3 is 2. + The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information). If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum. See effectively used parallel value in the display of 'sctool progress repair' command. - name: retry-wait diff --git a/pkg/command/repair/repaircontrol/res.yaml b/pkg/command/repair/repaircontrol/res.yaml index f7ee8322c2..b2f6981963 100644 --- a/pkg/command/repair/repaircontrol/res.yaml +++ b/pkg/command/repair/repaircontrol/res.yaml @@ -11,13 +11,15 @@ long: | For modifying future repair task runs see 'sctool repair update' command. intensity: | - How many token ranges per shard to repair in a single Scylla node at the same time. - Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs). + How many token ranges to repair in a single Scylla node at the same time. + Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information). Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume. + If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum. + See effectively used intensity value in the display of 'sctool progress repair' command. parallel: | The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas). - Each node can take part in at most one repair at any given moment. - By default the maximum possible parallelism is used. - The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes. - The formula to calculate is is as follows: number of nodes / RF, ex. for 6 node cluster with RF=3 the maximum parallelism is 2. + Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used. + The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information). + If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum. + See effectively used parallel value in the display of 'sctool progress repair' command. diff --git a/pkg/command/repair/res.yaml b/pkg/command/repair/res.yaml index 7622610ed6..0408cd227b 100644 --- a/pkg/command/repair/res.yaml +++ b/pkg/command/repair/res.yaml @@ -17,7 +17,7 @@ ignore-down-hosts: | intensity: | How many token ranges to repair in a single Scylla node at the same time. - Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported by node (see max_repair_ranges_in_parallel in Scylla logs). + Zero (0) is a special value, the number of token ranges is adjusted to the maximum supported (see repair docs for more information). Changing the intensity impacts repair granularity if you need to resume it, the higher the value the more work on resume. If you set intensity to a value greater than the maximum supported by the node, intensity will be capped at that maximum. See effectively used intensity value in the display of 'sctool progress repair' command. @@ -25,9 +25,7 @@ intensity: | parallel: | The maximum number of Scylla repair jobs that can run at the same time (on different token ranges and replicas). Each node can take part in at most one repair at any given moment. By default the maximum possible parallelism is used. - The effective parallelism depends on a keyspace replication factor (RF) and the number of nodes. - The formula to calculate is is as follows: number of nodes / RF. - For example, the maximum parallelism for a 6 node cluster with RF=3 is 2. + The maximal effective parallelism depends on keyspace replication strategy and cluster topology (see repair docs for more information). If you set parallel to a value greater than the maximum supported by the node, parallel will be capped at that maximum. See effectively used parallel value in the display of 'sctool progress repair' command.