Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add docs for TiFlash MinTso scheduler #19222

Merged
merged 14 commits into from
Nov 4, 2024
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -677,6 +677,7 @@
- [TiFlash Late Materialization](/tiflash/tiflash-late-materialization.md)
- [Spill to Disk](/tiflash/tiflash-spill-disk.md)
- [Data Validation](/tiflash/tiflash-data-validation.md)
- [MinTSO Scheduler](/tiflash/tiflash-mintso-scheduler.md)
- [Compatibility](/tiflash/tiflash-compatibility.md)
- [Pipeline Execution Model](/tiflash/tiflash-pipeline-model.md)
- TiDB Distributed eXecution Framework (DXF)
Expand Down
Binary file added media/tiflash/tiflash_mintso_v1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/tiflash/tiflash_mintso_v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 62 additions & 0 deletions tiflash/tiflash-mintso-scheduler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: TiFlash MinTSO Scheduler
summary: Understand the implementation principles of the TiFlash MinTSO Scheduler.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved
---

# TiFlash MinTSO Scheduler

The TiFlash MinTSO Scheduler is a distributed scheduler for [MPP](/glossary.md#mpp) Tasks in TiFlash. This article introduces the implementation principles of the TiFlash MinTSO Scheduler.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

## Background

When processing MPP queries, TiDB splits the query into one or more MPP Tasks and sends these MPP Tasks to the corresponding TiFlash nodes for compilation and execution. Before TiFlash used the [pipeline execution model](/tiflash/tiflash-pipeline-model.md), TiFlash needed to use several threads to execute each MPP Task, with the specific number of threads depending on the complexity of the MPP Task and the concurrency parameters set in TiFlash.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

In high concurrency scenarios, TiFlash nodes receive multiple MPP Tasks simultaneously. If the execution of MPP Tasks is not controlled, the number of threads TiFlash needs to request from the system will increase linearly with the number of MPP Tasks. Too many threads will affect TiFlash's execution efficiency, and since the operating system itself supports a limited number of threads, TiFlash will encounter errors when it requests more threads than the operating system can provide.

Check warning on line 14 in tiflash/tiflash-mintso-scheduler.md

View workflow job for this annotation

GitHub Actions / vale

[vale] reported by reviewdog 🐶 [PingCAP.Ambiguous] Consider using a clearer word than 'many' because it may cause confusion. Raw Output: {"message": "[PingCAP.Ambiguous] Consider using a clearer word than 'many' because it may cause confusion.", "location": {"path": "tiflash/tiflash-mintso-scheduler.md", "range": {"start": {"line": 14, "column": 258}}}, "severity": "INFO"}
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

To improve TiFlash's processing capability in high concurrency scenarios, an MPP Task scheduler needs to be introduced in TiFlash.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

## Implementation Principles
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

As mentioned in the background, the initial purpose of introducing the TiFlash Task Scheduler is to control the number of threads used during MPP query execution. A simple scheduling strategy is to specify the maximum number of threads TiFlash can request. For each MPP Task, the scheduler decides whether the MPP Task can be scheduled based on the current number of threads used by the system and the expected number of threads the MPP Task will use:
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

![TiFlash MinTSO Scheduler v1](/media/tiflash/tiflash_mintso_v1.png)

Although the above scheduling strategy can effectively control the number of system threads, MPP Tasks are not the smallest independent execution units, and there are dependencies between different MPP Tasks:
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

```sql
EXPLAIN SELECT count(*) FROM t0 a JOIN t0 b ON a.id = b.id;
```

```
+--------------------------------------------+----------+--------------+---------------+----------------------------------------------------------+
| id | estRows | task | access object | operator info |
+--------------------------------------------+----------+--------------+---------------+----------------------------------------------------------+
| HashAgg_44 | 1.00 | root | | funcs:count(Column#8)->Column#7 |
| └─TableReader_46 | 1.00 | root | | MppVersion: 2, data:ExchangeSender_45 |
| └─ExchangeSender_45 | 1.00 | mpp[tiflash] | | ExchangeType: PassThrough |
| └─HashAgg_13 | 1.00 | mpp[tiflash] | | funcs:count(1)->Column#8 |
| └─Projection_43 | 12487.50 | mpp[tiflash] | | test.t0.id |
| └─HashJoin_42 | 12487.50 | mpp[tiflash] | | inner join, equal:[eq(test.t0.id, test.t0.id)] |
| ├─ExchangeReceiver_22(Build) | 9990.00 | mpp[tiflash] | | |
| │ └─ExchangeSender_21 | 9990.00 | mpp[tiflash] | | ExchangeType: Broadcast, Compression: FAST |
| │ └─Selection_20 | 9990.00 | mpp[tiflash] | | not(isnull(test.t0.id)) |
| │ └─TableFullScan_19 | 10000.00 | mpp[tiflash] | table:a | pushed down filter:empty, keep order:false, stats:pseudo |
| └─Selection_24(Probe) | 9990.00 | mpp[tiflash] | | not(isnull(test.t0.id)) |
| └─TableFullScan_23 | 10000.00 | mpp[tiflash] | table:b | pushed down filter:empty, keep order:false, stats:pseudo |
+--------------------------------------------+----------+--------------+---------------+----------------------------------------------------------+
```

For example, the above query generates 2 MPP Tasks on each TiFlash node, where the MPP Task containing `ExchangeSender_45` depends on the MPP Task containing `ExchangeSender_21`. In high concurrency scenarios, if the scheduler schedules the MPP Task containing `ExchangeSender_45` for each query, the system will enter a deadlock state.

To avoid deadlock, TiFlash introduces the following two levels of thread limits:

* thread_soft_limit: Mainly used to limit the number of threads used by the system. For specific MPP Tasks, this limit can be broken to avoid deadlock.
* thread_hard_limit: Mainly to protect the system. Once the number of threads used by the system exceeds the hard limit, TiFlash will report an error to avoid deadlock.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

The principle of avoiding deadlock using soft limit and hard limit is: by using the soft limit to restrict the total number of threads used by all queries, resources are fully utilized while avoiding thread resource exhaustion. The hard limit ensures that in any situation, there is at least one query in the system that can break the soft limit and continue to acquire thread resources and run, thus avoiding deadlock. As long as the number of threads does not exceed the hard limit, there will always be one query in the system where all its MPP Tasks can be executed normally, thus preventing deadlock.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

The goal of the MinTSO Scheduler is to control the number of system threads while ensuring that there is always one and only one special query in the system, where all its MPP Tasks can be scheduled. The MinTSO Scheduler is a fully distributed scheduler, with each TiFlash node scheduling MPP Tasks based only on its own information. Therefore, all MinTSO Schedulers on TiFlash nodes need to identify the same "special" query. In TiDB, each query carries a read timestamp (`start_ts`), and the MinTSO Scheduler defines the "special" query as the query with the smallest `start_ts` on the current TiFlash node. Based on the principle that the global minimum is also the local minimum, the "special" query selected by all TiFlash nodes must be the same, called the MinTSO query. The scheduling process of the MinTSO Scheduler is as follows:
lilin90 marked this conversation as resolved.
Show resolved Hide resolved

![TiFlash MinTSO Scheduler v2](/media/tiflash/tiflash_mintso_v2.png)

By introducing soft limit and hard limit, the MinTSO Scheduler effectively avoids system deadlock while controlling the number of system threads. However, in high concurrency scenarios, it is possible that most queries will have only part of their MPP Tasks scheduled. Queries with only part MPP Tasks scheduled cannot execute normally, leading to low system execution efficiency. To avoid this situation, TiFlash introduces a limit at the query level for the MinTSO Scheduler, called active_set_soft_limit. This limit requires that only up to active_set_soft_limit queries' MPP Tasks can participate in scheduling; other queries' MPP Tasks do not participate in scheduling, and only after the current queries finish can new queries participate in scheduling. This limit is only a soft limit because for the MinTSO query, all its MPP Tasks can be scheduled directly as long as the number of system threads does not exceed the hard limit.
lilin90 marked this conversation as resolved.
Show resolved Hide resolved
Loading