From 5c7971b7633bcb2fa2d4908d1efc09d49d88cbf5 Mon Sep 17 00:00:00 2001 From: Ti Chi Robot Date: Fri, 6 Dec 2024 10:04:17 +0800 Subject: [PATCH] ticdc: fix alert doc (#19596) (#19597) --- ticdc/ticdc-alert-rules.md | 56 ++++++++++---------------------------- 1 file changed, 14 insertions(+), 42 deletions(-) diff --git a/ticdc/ticdc-alert-rules.md b/ticdc/ticdc-alert-rules.md index 855819fcc8ae9..5526eee3ddd69 100644 --- a/ticdc/ticdc-alert-rules.md +++ b/ticdc/ticdc-alert-rules.md @@ -54,20 +54,6 @@ For critical alerts, you need to pay close attention to abnormal monitoring metr This alert is similar to replication interruption. See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). -### `ticdc_processor_exit_with_error_count` - -- Alert rule: - - `changes(ticdc_processor_exit_with_error_count[1m]) > 0` - -- Description: - - A replication task reports an error and exits. - -- Solution: - - See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). - ## Warning alerts Warning alerts are a reminder for an issue or error. @@ -86,61 +72,47 @@ Warning alerts are a reminder for an issue or error. Collect TiCDC logs to locate the root cause. -### `cdc_sink_flush_duration_time_more_than_10s` +### `cdc_no_owner` - Alert rule: - `histogram_quantile(0.9, rate(ticdc_sink_txn_worker_flush_duration[1m])) > 10` + `sum(rate(ticdc_owner_ownership_counter[240s])) < 0.5` - Description: - It takes a replication task more than 10 seconds to write data to the downstream database. + There is no owner in the TiCDC cluster for more than 10 minutes. - Solution: - Check whether there are problems in the downstream database. + Collect TiCDC logs to identify the root cause. -### `cdc_processor_checkpoint_tso_no_change_for_1m` +### `ticdc_changefeed_meet_error` - Alert rule: - `changes(ticdc_processor_checkpoint_ts[1m]) < 1` + `(max_over_time(ticdc_owner_status[1m]) == 1 or max_over_time(ticdc_owner_status[1m]) == 6) > 0` - Description: - A replication task has not advanced for more than 1 minute. + A replication task encounters an error. - Solution: See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). -### `ticdc_puller_entry_sorter_sort_bucket` - -- Alert rule: - - `histogram_quantile(0.9, rate(ticdc_puller_entry_sorter_sort_bucket{}[1m])) > 1` - -- Description: - - The delay of TiCDC puller entry sorter is too high. - -- Solution: - - Collect TiCDC logs to locate the root cause. - -### `ticdc_puller_entry_sorter_merge_bucket` +### `ticdc_processor_exit_with_error_count` - Alert rule: - `histogram_quantile(0.9, rate(ticdc_puller_entry_sorter_merge_bucket{}[1m])) > 1` + `changes(ticdc_processor_exit_with_error_count[1m]) > 0` - Description: - The delay of TiCDC puller entry sorter merge is too high. + A replication task reports an error and exits. - Solution: - Collect TiCDC logs to locate the root cause. + See [TiCDC Handles Replication Interruption](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-replication-interruptions). ### `tikv_cdc_min_resolved_ts_no_change_for_1m` @@ -170,15 +142,15 @@ Warning alerts are a reminder for an issue or error. Collect TiCDC monitoring metrics and TiKV logs to locate the root cause. -### `ticdc_sink_mysql_execution_error` +### `ticdc_sink_execution_error` - Alert rule: - `changes(ticdc_sink_mysql_execution_error[1m]) > 0` + `changes(ticdc_sink_execution_error[1m]) > 0` - Description: - An error occurs when a replication task writes data to the downstream MySQL. + An error occurs when a replication task writes data to the downstream. - Solution: