Skip to content

Commit

Permalink
Fix some links
Browse files Browse the repository at this point in the history
  • Loading branch information
mdlinville committed Nov 13, 2024
1 parent efc1751 commit bc3f8e4
Show file tree
Hide file tree
Showing 28 changed files with 118 additions and 135 deletions.
22 changes: 11 additions & 11 deletions src/current/_includes/common/shutdown/cluster-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

Alias: `server.shutdown.drain_wait`

`server.shutdown.initial_wait` sets a **fixed** duration for the ["unready phase"](#draining) of node drain. Because a load balancer reroutes connections to non-draining nodes within this duration (`0s` by default), this setting should be coordinated with the load balancer settings.
`server.shutdown.initial_wait` sets a **fixed** duration for the ["unready phase"]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) of node drain. Because a load balancer reroutes connections to non-draining nodes within this duration (`0s` by default), this setting should be coordinated with the load balancer settings.

Increase `server.shutdown.initial_wait` so that your load balancer is able to make adjustments before this phase times out. Because the drain process waits unconditionally for the `server.shutdown.initial_wait` duration, do not set this value too high.

Expand All @@ -19,7 +19,7 @@ SET CLUSTER SETTING server.shutdown.initial_wait = '8s';

Alias: `server.shutdown.connection_wait`

`server.shutdown.connections.timeout` sets the **maximum** duration for the ["connection phase"](#draining) of node drain. SQL client connections are allowed to close or time out within this duration (`0s` by default). This setting presents an option to gracefully close the connections before CockroachDB forcibly closes those that remain after the ["SQL drain phase"](#draining).
`server.shutdown.connections.timeout` sets the **maximum** duration for the ["connection phase"]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) of node drain. SQL client connections are allowed to close or time out within this duration (`0s` by default). This setting presents an option to gracefully close the connections before CockroachDB forcibly closes those that remain after the ["SQL drain phase"]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases).

Change this setting **only** if you cannot tolerate connection errors during node drain and cannot configure the maximum lifetime of SQL client connections, which is usually configurable via a [connection pool]({% link {{ page.version.version }}/connection-pooling.md %}#about-connection-pools). Depending on your requirements:

Expand All @@ -31,7 +31,7 @@ Change this setting **only** if you cannot tolerate connection errors during nod

Alias: `server.shutdown.query_wait`

`server.shutdown.transactions.timeout` sets the **maximum** duration for the ["SQL drain phase"](#draining) and the **maximum** duration for the ["DistSQL drain phase"](#draining) of node drain. Active local and distributed queries must complete, in turn, within this duration (`10s` by default).
`server.shutdown.transactions.timeout` sets the **maximum** duration for the ["SQL drain phase"]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) and the **maximum** duration for the ["DistSQL drain phase"]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) of node drain. Active local and distributed queries must complete, in turn, within this duration (`10s` by default).

Ensure that `server.shutdown.transactions.timeout` is greater than:

Expand All @@ -52,14 +52,14 @@ If there are still open transactions on the draining node when the server closes

Alias: `server.shutdown.lease_transfer_wait`

In the ["lease transfer phase"](#draining) of node drain, the server attempts to transfer all [range leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases) and [Raft leaderships]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) from the draining node. `server.shutdown.lease_transfer_iteration.timeout` sets the maximum duration of each iteration of this attempt (`5s` by default). Because this phase does not exit until all transfers are completed, changing this value affects only the frequency at which drain progress messages are printed.
In the ["lease transfer phase"]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) of node drain, the server attempts to transfer all [range leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases) and [Raft leaderships]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) from the draining node. `server.shutdown.lease_transfer_iteration.timeout` sets the maximum duration of each iteration of this attempt (`5s` by default). Because this phase does not exit until all transfers are completed, changing this value affects only the frequency at which drain progress messages are printed.

{% if page.path contains "drain" %}
In most cases, the default value is suitable. Do **not** set `server.shutdown.lease_transfer_iteration.timeout` to a value lower than `5s`. In this case, leases can fail to transfer and node drain will not be able to complete.

#### `server.time_until_store_dead`

`server.time_until_store_dead` sets the duration after which a node is considered "dead" and its data is rebalanced to other nodes (`5m0s` by default). In the node shutdown sequence, this follows [process termination](#node-shutdown-sequence).
`server.time_until_store_dead` sets the duration after which a node is considered "dead" and its data is rebalanced to other nodes (`5m0s` by default). In the node shutdown sequence, this follows [process termination]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases).

Before temporarily stopping nodes for planned maintenance (e.g., upgrading system software), if you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily moving data off the nodes by increasing `server.time_until_store_dead` to match the estimated maintenance window:

Expand All @@ -80,7 +80,7 @@ RESET CLUSTER SETTING server.time_until_store_dead;
~~~

{% elsif page.path contains "decommission" %}
Since [decommissioning](#decommissioning) a node rebalances all of its range replicas onto other nodes, no replicas will remain on the node by the time draining begins. Therefore, no iterations occur during this phase. This setting can be left alone.
Since [decommissioning]({% link {{ page.version.version }}/decommission-a-node.md %}) a node rebalances all of its range replicas onto other nodes, no replicas will remain on the node by the time draining begins. Therefore, no iterations occur during this phase. This setting can be left alone.
{% endif %}

{{site.data.alerts.callout_info}}
Expand All @@ -95,27 +95,27 @@ Possible values are `good` (the default) and `best`. When set to `good`, a rando

### Drain timeout

When [draining manually](#drain-a-node-manually) with `cockroach node drain`, all [drain phases](#draining) must be completed within the duration of `--drain-wait` (`10m` by default) or the drain will stop. This can be observed with an `ERROR: drain timeout` message in the terminal output. To continue the drain, re-initiate the command.
When [draining manually](#drain-a-node-manually) with `cockroach node drain`, all [draining phases]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) must be completed within the duration of `--drain-wait` (`10m` by default) or the drain will stop. This can be observed with an `ERROR: drain timeout` message in the terminal output. To continue the drain, re-initiate the command.

A very long drain may indicate an anomaly, and you should manually inspect the server to determine what blocks the drain.

CockroachDB automatically increases the verbosity of logging when it detects a stall in the range lease transfer stage of `node drain`. Messages logged during such a stall include the time an attempt occurred, the total duration stalled waiting for the transfer attempt to complete, and the lease that is being transferred.

`--drain-wait` sets the timeout for [all draining phases](#draining) and is **not** related to the `server.shutdown.initial_wait` cluster setting, which configures the "unready phase" of draining. The value of `--drain-wait` should be greater than the sum of [`server.shutdown.initial_wait`](#server-shutdown-connections-timeout), [`server.shutdown.connections.timeout`](#server-shutdown-connections-timeout), [`server.shutdown.transactions.timeout`](#server-shutdown-transactions-timeout) times two, and [`server.shutdown.lease_transfer_iteration.timeout`](#server-shutdown-lease_transfer_iteration-timeout).
`--drain-wait` sets the timeout for [all draining phases]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) and is **not** related to the `server.shutdown.initial_wait` cluster setting, which configures the "unready phase" of draining. The value of `--drain-wait` should be greater than the sum of [`server.shutdown.initial_wait`](#server-shutdown-connections-timeout), [`server.shutdown.connections.timeout`](#server-shutdown-connections-timeout), [`server.shutdown.transactions.timeout`](#server-shutdown-transactions-timeout) times two, and [`server.shutdown.lease_transfer_iteration.timeout`](#server-shutdown-lease_transfer_iteration-timeout).

### Termination grace period

On production deployments, a process manager or orchestration system can disrupt graceful node shutdown if its termination grace period is too short.

{{site.data.alerts.callout_danger}}
{{ drain_early_termination_warning }}
Do not terminate the `cockroach` process before all of the phases of draining are complete. Otherwise, you may experience latency spikes until the [leases]({% link {{ page.version.version }}/architecture/glossary.md %}#leaseholder) that were on that node have transitioned to other nodes. It is safe to terminate the `cockroach` process only after a node has completed the drain process. This is especially important in a containerized system, to allow all TCP connections to terminate gracefully.
{{site.data.alerts.end}}

If the `cockroach` process has not terminated at the end of the grace period, a `SIGKILL` signal is sent to perform a "hard" shutdown that bypasses CockroachDB's [node shutdown logic](#node-shutdown-sequence) and forcibly terminates the process.
If the `cockroach` process has not terminated at the end of the grace period, a `SIGKILL` signal is sent to perform a "hard" shutdown that bypasses CockroachDB's [node shutdown logic]({% link {{ page.version.version }}/node-shutdown.md %}#draining-phases) and forcibly terminates the process.

- When using [`systemd`](https://www.freedesktop.org/wiki/Software/systemd/) to run CockroachDB as a service, set the termination grace period with [`TimeoutStopSec`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStopSec=) setting in the service file.

- When using [Kubernetes]({% link {{ page.version.version }}/kubernetes-overview.md %}) to orchestrate CockroachDB, refer to [Decommissioning and draining on Kubernetes](#decommissioning-and-draining-on-kubernetes).
- When using [Kubernetes]({% link {{ page.version.version }}/kubernetes-overview.md %}) to orchestrate CockroachDB, refer to [Draining on Kubernetes]({% link {{ page.version.version }}/drain-a-node.md %}#draining-on-kubernetes).

To determine an appropriate termination grace period:

Expand Down
4 changes: 2 additions & 2 deletions src/current/_includes/common/shutdown/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ Most of the guidance in this page is most relevant to manual deployments that do

Refer to [Cluster Scaling]({% link {{ page.version.version }}/scale-cockroachdb-kubernetes.md %}).

- There is generally no need to interactively drain a node that is not being decommissioned, regardless of how you deployed the cluster in Kubernetes. When you upgrade, downgrade, or change the configuration of a CockroachDB deployment on Kubernetes, you apply the changes using a [rolling update](https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/), which applies the change to one node at a time. On a given node, Kubernetes sends a `SIGTERM` signal to the `cockroach` process. When the `cockroach` process receives this signal, it starts draining itself. After draining is complete or the [termination grace period](#termination-grace-period-on-kubernetes) expires (whichever happens first), Kubernetes terminates the `cockroach` process and then removes the node from the Kubernetes cluster. Kubernetes then applies the updated deployment to the cluster node, restarts the `cockroach` process, and re-joins the cluster. Refer to [Cluster Upgrades]({% link {{ page.version.version }}/upgrade-cockroachdb-kubernetes.md %}).
- There is generally no need to interactively drain a node that is not being decommissioned, regardless of how you deployed the cluster in Kubernetes. When you upgrade, downgrade, or change the configuration of a CockroachDB deployment on Kubernetes, you apply the changes using a [rolling update](https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/), which applies the change to one node at a time. On a given node, Kubernetes sends a `SIGTERM` signal to the `cockroach` process. When the `cockroach` process receives this signal, it starts draining itself. After draining is complete or the [termination grace period]({% link {{ page.version.version }}/drain-a-node.md %}#termination-grace-period-on-kubernetes) expires (whichever happens first), Kubernetes terminates the `cockroach` process and then removes the node from the Kubernetes cluster. Kubernetes then applies the updated deployment to the cluster node, restarts the `cockroach` process, and re-joins the cluster. Refer to [Cluster Upgrades]({% link {{ page.version.version }}/upgrade-cockroachdb-kubernetes.md %}).

- Although the `kubectl drain` command is used for manual maintenance of Kubernetes clusters, it has little direct relevance to the concept of draining a node in a CockroachDB cluster. The `kubectl drain` command gracefully terminates each pod running on a Kubernetes node so that the node can be shut down (in the case of physical hardware) or deleted (in the case of a virtual machine). For details on this command, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/).

Refer to [Termination grace period on Kubernetes](#termination-grace-period-on-kubernetes). For more details about managing CockroachDB on Kubernetes, refer to [Cluster upgrades]({% link {{ page.version.version }}/upgrade-cockroachdb-kubernetes.md %}) and [Cluster scaling]({% link {{ page.version.version }}/scale-cockroachdb-kubernetes.md %}).
Refer to [Termination grace period on Kubernetes]({% link {{ page.version.version }}/drain-a-node.md %}#termination-grace-period-on-kubernetes). For more details about managing CockroachDB on Kubernetes, refer to [Cluster upgrades]({% link {{ page.version.version }}/upgrade-cockroachdb-kubernetes.md %}) and [Cluster scaling]({% link {{ page.version.version }}/scale-cockroachdb-kubernetes.md %}).

### Termination grace period on Kubernetes

Expand Down
2 changes: 1 addition & 1 deletion src/current/_includes/v24.3/essential-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |
| <a id="ranges-underreplicated"></a>ranges.underreplicated | ranges.underreplicated | Number of ranges with fewer live replicas than the replication target | This metric is an indicator of [replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). It shows whether the cluster has data that is not conforming to resilience goals. The next step is to determine the corresponding database object, such as the table or index, of these under-replicated ranges and whether the under-replication is temporarily expected. Use the statement `SELECT table_name, index_name FROM [SHOW RANGES WITH INDEXES] WHERE range_id = {id of under-replicated range};`|
| <a id="ranges-unavailable"></a>ranges.unavailable | ranges.unavailable | Number of ranges with fewer live replicas than needed for quorum | This metric is an indicator of [replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). It shows whether the cluster is unhealthy and can impact workload. If an entire range is unavailable, then it will be unable to process queries. |
| queue.replicate.replacedecommissioningreplica.error | {% if include.deployment == 'self-hosted' %}queue.replicate.replacedecommissioningreplica.error.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of failed decommissioning replica replacements processed by the replicate queue | Refer to [Decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommission-the-node). |
| queue.replicate.replacedecommissioningreplica.error | {% if include.deployment == 'self-hosted' %}queue.replicate.replacedecommissioningreplica.error.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of failed decommissioning replica replacements processed by the replicate queue | Refer to [Decommission the node]({% link {{ page.version.version }}/decommission-a-node.md %}). |
| range.splits | {% if include.deployment == 'self-hosted' %}range.splits.total |{% elsif include.deployment == 'advanced' %}range.splits |{% endif %} Number of range splits | This metric indicates how fast a workload is scaling up. Spikes can indicate resource hot spots since the [split heuristic is based on QPS]({% link {{ page.version.version }}/load-based-splitting.md %}#control-load-based-splitting-threshold). To understand whether hot spots are an issue and with which tables and indexes they are occurring, correlate this metric with other metrics such as CPU usage, such as `sys.cpu.combined.percent-normalized`, or use the [**Hot Ranges** page]({% link {{ page.version.version }}/ui-hot-ranges-page.md %}). |
| range.merges | {% if include.deployment == 'self-hosted' %}range.merges.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of range merges | This metric indicates how fast a workload is scaling down. Merges are Cockroach's [optimization for performance](architecture/distribution-layer.html#range-merges). This metric indicates that there have been deletes in the workload. |

Expand Down
Loading

0 comments on commit bc3f8e4

Please sign in to comment.