pingcap · ti-chi-bot · Nov 4, 2024 · Oct 28, 2024 · Oct 28, 2024 · Oct 31, 2024
diff --git a/TOC.md b/TOC.md
@@ -677,6 +677,7 @@
     - [TiFlash Late Materialization](/tiflash/tiflash-late-materialization.md)
     - [Spill to Disk](/tiflash/tiflash-spill-disk.md)
     - [Data Validation](/tiflash/tiflash-data-validation.md)
+    - [MinTSO Scheduler](/tiflash/tiflash-mintso-scheduler.md)
     - [Compatibility](/tiflash/tiflash-compatibility.md)
     - [Pipeline Execution Model](/tiflash/tiflash-pipeline-model.md)
   - TiDB Distributed eXecution Framework (DXF)

diff --git a/media/tiflash/tiflash_mintso_v1.png b/media/tiflash/tiflash_mintso_v1.png
diff --git a/media/tiflash/tiflash_mintso_v2.png b/media/tiflash/tiflash_mintso_v2.png
diff --git a/tiflash/tiflash-mintso-scheduler.md b/tiflash/tiflash-mintso-scheduler.md
@@ -0,0 +1,62 @@
+---
+title: TiFlash MinTSO Scheduler
+summary: Understand the implementation principles of the TiFlash MinTSO Scheduler.
+---
+
+# TiFlash MinTSO Scheduler
+
+The TiFlash MinTSO Scheduler is a distributed scheduler for [MPP](/glossary.md#mpp) Tasks in TiFlash. This article introduces the implementation principles of the TiFlash MinTSO Scheduler.
+
+## Background
+
+When processing MPP queries, TiDB splits the query into one or more MPP Tasks and sends these MPP Tasks to the corresponding TiFlash nodes for compilation and execution. Before TiFlash used the [pipeline execution model](/tiflash/tiflash-pipeline-model.md), TiFlash needed to use several threads to execute each MPP Task, with the specific number of threads depending on the complexity of the MPP Task and the concurrency parameters set in TiFlash.
+
+In high concurrency scenarios, TiFlash nodes receive multiple MPP Tasks simultaneously. If the execution of MPP Tasks is not controlled, the number of threads TiFlash needs to request from the system will increase linearly with the number of MPP Tasks. Too many threads will affect TiFlash's execution efficiency, and since the operating system itself supports a limited number of threads, TiFlash will encounter errors when it requests more threads than the operating system can provide.
+
+To improve TiFlash's processing capability in high concurrency scenarios, an MPP Task scheduler needs to be introduced in TiFlash.
+
+## Implementation Principles
+
+As mentioned in the background, the initial purpose of introducing the TiFlash Task Scheduler is to control the number of threads used during MPP query execution. A simple scheduling strategy is to specify the maximum number of threads TiFlash can request. For each MPP Task, the scheduler decides whether the MPP Task can be scheduled based on the current number of threads used by the system and the expected number of threads the MPP Task will use:
+
+![TiFlash MinTSO Scheduler v1](/media/tiflash/tiflash_mintso_v1.png)
+
+Although the above scheduling strategy can effectively control the number of system threads, MPP Tasks are not the smallest independent execution units, and there are dependencies between different MPP Tasks:
+
+```sql
+EXPLAIN SELECT count(*) FROM t0 a JOIN t0 b ON a.id = b.id;
+```
+
+```
++--------------------------------------------+----------+--------------+---------------+----------------------------------------------------------+
+| id                                         | estRows  | task         | access object | operator info                                            |
++--------------------------------------------+----------+--------------+---------------+----------------------------------------------------------+
+| HashAgg_44                                 | 1.00     | root         |               | funcs:count(Column#8)->Column#7                          |
+| └─TableReader_46                           | 1.00     | root         |               | MppVersion: 2, data:ExchangeSender_45                    |
+|   └─ExchangeSender_45                      | 1.00     | mpp[tiflash] |               | ExchangeType: PassThrough                                |
+|     └─HashAgg_13                           | 1.00     | mpp[tiflash] |               | funcs:count(1)->Column#8                                 |
+|       └─Projection_43                      | 12487.50 | mpp[tiflash] |               | test.t0.id                                               |
+|         └─HashJoin_42                      | 12487.50 | mpp[tiflash] |               | inner join, equal:[eq(test.t0.id, test.t0.id)]           |
+|           ├─ExchangeReceiver_22(Build)     | 9990.00  | mpp[tiflash] |               |                                                          |
+|           │ └─ExchangeSender_21            | 9990.00  | mpp[tiflash] |               | ExchangeType: Broadcast, Compression: FAST               |
+|           │   └─Selection_20               | 9990.00  | mpp[tiflash] |               | not(isnull(test.t0.id))                                  |
+|           │     └─TableFullScan_19         | 10000.00 | mpp[tiflash] | table:a       | pushed down filter:empty, keep order:false, stats:pseudo |
+|           └─Selection_24(Probe)            | 9990.00  | mpp[tiflash] |               | not(isnull(test.t0.id))                                  |
+|             └─TableFullScan_23             | 10000.00 | mpp[tiflash] | table:b       | pushed down filter:empty, keep order:false, stats:pseudo |
++--------------------------------------------+----------+--------------+---------------+----------------------------------------------------------+
+```
+
+For example, the above query generates 2 MPP Tasks on each TiFlash node, where the MPP Task containing `ExchangeSender_45` depends on the MPP Task containing `ExchangeSender_21`. In high concurrency scenarios, if the scheduler schedules the MPP Task containing `ExchangeSender_45` for each query, the system will enter a deadlock state.
+
+To avoid deadlock, TiFlash introduces the following two levels of thread limits:
+
+* thread_soft_limit: Mainly used to limit the number of threads used by the system. For specific MPP Tasks, this limit can be broken to avoid deadlock.
+* thread_hard_limit: Mainly to protect the system. Once the number of threads used by the system exceeds the hard limit, TiFlash will report an error to avoid deadlock.
+
+The principle of avoiding deadlock using soft limit and hard limit is: by using the soft limit to restrict the total number of threads used by all queries, resources are fully utilized while avoiding thread resource exhaustion. The hard limit ensures that in any situation, there is at least one query in the system that can break the soft limit and continue to acquire thread resources and run, thus avoiding deadlock. As long as the number of threads does not exceed the hard limit, there will always be one query in the system where all its MPP Tasks can be executed normally, thus preventing deadlock.
+
+The goal of the MinTSO Scheduler is to control the number of system threads while ensuring that there is always one and only one special query in the system, where all its MPP Tasks can be scheduled. The MinTSO Scheduler is a fully distributed scheduler, with each TiFlash node scheduling MPP Tasks based only on its own information. Therefore, all MinTSO Schedulers on TiFlash nodes need to identify the same "special" query. In TiDB, each query carries a read timestamp (`start_ts`), and the MinTSO Scheduler defines the "special" query as the query with the smallest `start_ts` on the current TiFlash node. Based on the principle that the global minimum is also the local minimum, the "special" query selected by all TiFlash nodes must be the same, called the MinTSO query. The scheduling process of the MinTSO Scheduler is as follows:
+
+![TiFlash MinTSO Scheduler v2](/media/tiflash/tiflash_mintso_v2.png)
+
+By introducing soft limit and hard limit, the MinTSO Scheduler effectively avoids system deadlock while controlling the number of system threads. However, in high concurrency scenarios, it is possible that most queries will have only part of their MPP Tasks scheduled. Queries with only part MPP Tasks scheduled cannot execute normally, leading to low system execution efficiency. To avoid this situation, TiFlash introduces a limit at the query level for the MinTSO Scheduler, called active_set_soft_limit. This limit requires that only up to active_set_soft_limit queries' MPP Tasks can participate in scheduling; other queries' MPP Tasks do not participate in scheduling, and only after the current queries finish can new queries participate in scheduling. This limit is only a soft limit because for the MinTSO query, all its MPP Tasks can be scheduled directly as long as the number of system threads does not exceed the hard limit.