From 97291aa30b2d7038a5bdaf2c6e81bfe72bb1d846 Mon Sep 17 00:00:00 2001
From: Li Haoyi <haoyi.sg@gmail.com>
Date: Tue, 24 Dec 2024 23:30:57 +0900
Subject: [PATCH] Blog post on selective test execution (#4175)

---
 blog/antora.yml                               |   2 +-
 blog/modules/ROOT/nav.adoc                    |   1 +
 blog/modules/ROOT/pages/1-java-compile.adoc   |   5 +-
 .../ROOT/pages/2-monorepo-build-tool.adoc     |   6 +-
 .../ROOT/pages/3-selective-testing.adoc       | 537 ++++++++++++++++++
 blog/modules/ROOT/pages/index.adoc            |   6 +-
 .../ROOT/pages/large/selective-execution.adoc |   2 +-
 docs/package.mill                             |   7 +-
 .../9-selective-execution/build.mill          |  46 ++
 9 files changed, 604 insertions(+), 8 deletions(-)
 create mode 100644 blog/modules/ROOT/pages/3-selective-testing.adoc

diff --git a/blog/antora.yml b/blog/antora.yml
index fa746cc85ef..00be394c59e 100644
--- a/blog/antora.yml
+++ b/blog/antora.yml
@@ -1,5 +1,5 @@
 name: blog
-title: Mill Blog
+title: The Mill Build Engineering Blog
 version: ~
 nav:
   - modules/ROOT/nav.adoc
diff --git a/blog/modules/ROOT/nav.adoc b/blog/modules/ROOT/nav.adoc
index c090b81378c..5983e0697bc 100644
--- a/blog/modules/ROOT/nav.adoc
+++ b/blog/modules/ROOT/nav.adoc
@@ -1,3 +1,4 @@
 
+* xref:3-selective-testing.adoc[]
 * xref:2-monorepo-build-tool.adoc[]
 * xref:1-java-compile.adoc[]
diff --git a/blog/modules/ROOT/pages/1-java-compile.adoc b/blog/modules/ROOT/pages/1-java-compile.adoc
index 578fbdb7f81..d73c2bedd63 100644
--- a/blog/modules/ROOT/pages/1-java-compile.adoc
+++ b/blog/modules/ROOT/pages/1-java-compile.adoc
@@ -15,6 +15,9 @@ not match today's reality. Nowadays the Java compiler can compile "typical" Java
 should take more than 10s to compile in a single-threaded fashion, and should be even
 faster in the presence of parallelism
 
+// end::header[]
+
+
 Doing some ad-hoc benchmarks, we find that although the compiler is blazing fast, all
 build tools add significant overhead over compiling Java directly:
 
@@ -31,8 +34,6 @@ all build tools fall short of how fast compiling Java _should_ be. This post exp
 these numbers were arrived at, and what that means in un-tapped potential for Java build
 tooling to become truly great.
 
-// end::header[]
-
 ## Mockito Core
 
 To begin to understand the problem, lets consider the codebase of the popular Mockito project:
diff --git a/blog/modules/ROOT/pages/2-monorepo-build-tool.adoc b/blog/modules/ROOT/pages/2-monorepo-build-tool.adoc
index 899c41e8ed4..f21a1f01b09 100644
--- a/blog/modules/ROOT/pages/2-monorepo-build-tool.adoc
+++ b/blog/modules/ROOT/pages/2-monorepo-build-tool.adoc
@@ -22,7 +22,10 @@ Software build tools mostly fall into two categories:
 One question that comes up constantly is why do people use Monorepo build tools? Tools
 like Bazel are orders of magnitude more complicated and hard to use than tools
 like Poetry or Cargo, so why do people use them at all?
-https://knowyourmeme.com/memes/is-he-stupid-is-she-smart-are-they-stupid[Are they stupid?]
+
+// end::header[]
+
+
 
 It turns out that Monorepo build tools like Bazel or Mill do a lot of non-obvious things that
 other build tools don't, that become important in larger codebases (100-10,000 active developers).
@@ -32,7 +35,6 @@ features become critical. We'll explore some of the core features of "Monorepo B
 below, from the perspective of Bazel (which I am familiar with) and Mill (which this
 technical blog is about).
 
-// end::header[]
 
 ## Support for Multiple Languages
 
diff --git a/blog/modules/ROOT/pages/3-selective-testing.adoc b/blog/modules/ROOT/pages/3-selective-testing.adoc
new file mode 100644
index 00000000000..8bd561b8599
--- /dev/null
+++ b/blog/modules/ROOT/pages/3-selective-testing.adoc
@@ -0,0 +1,537 @@
+// tag::header[]
+
+# Understanding Selective Testing
+
+
+:author: Li Haoyi
+:revdate: 24 December 2024
+_{author}, {revdate}_
+
+include::mill:ROOT:partial$gtag-config.adoc[]
+
+Selective testing is a key technique necessary for working with any large codebase
+or monorepo: picking which tests to run to validate a change or pull-request, because
+running every test every time is costly and slow. This blog post will explore what
+selective testing is all about, the different approaches you can take with selective
+testing, based on my experience working on developer tooling for the last decade at
+Dropbox and Databricks. Lastly, we will discuss how selective testing is supported by
+the Mill build tool.
+
+// end::header[]
+
+## Large Codebases
+
+Although codebases can be large, in any large codebase you are typically only working on
+a fraction of it at any point in time. As an example that we will use throughout this
+article, consider a large codebase or monorepo that contains the code for:
+
+1. 3 backend services: `backend_service_1`, `backend_service_2`, `backend_service_3`
+2. 2 frontend web codebases: `frontend_web_1`, `frontend_web_2`
+3. 1 `backend_service_utils` module, and 1 `frontend_web_utils` module
+4. 3 deployments: `deployment_1`, `deployment_2`, `deployment_3`, that make use of the
+   backend services and frontend web codebases.
+5. The three deployments may be combined into a `staging_environment`, which is then
+   used in three sets of ``end_to_end_test``s, one for each deployment
+
+These modules may depend on each other as shown in the diagram below,
+with the various `backend_service` and `frontend_web` codebases
+grouped into `deployments`, and the `_utils` files shared between them:
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+
+  frontend_web_utils -> frontend_web_2 -> deployment_2
+  frontend_web_utils -> frontend_web_1 -> deployment_1
+
+  backend_service_utils -> backend_service_3 -> deployment_3
+  backend_service_utils -> backend_service_2 -> deployment_2
+  backend_service_utils -> backend_service_1 -> deployment_1
+  deployment_1 -> staging_environment
+  deployment_2 -> staging_environment
+  deployment_3 -> staging_environment
+  staging_environment -> end_to_end_test_1
+  staging_environment -> end_to_end_test_2
+  staging_environment -> end_to_end_test_3
+}
+```
+
+These various modules would typically each have their own test suite, which
+for simplicity I left out of the diagrams.
+
+Although the codebase described above is just an example, it reflects the kind of
+codebase structure that exists in many real-world codebases. Using this example,
+let us now consider a few ways in which selective testing can be used.
+
+## No Selective Testing
+
+Most software projects start off naively running every test on every pull-request
+to validate the changes. For small projects this is fine. Every project starts
+small and most projects stay small, so this approach is not a problem at all.
+
+However, for the projects that do continue growing, this strategy quickly
+becomes infeasible:
+
+* The size of the codebase (in number of modules, lines of code, or number of tests) grows
+  linearly `O(n)` over time
+* The number of tests necessary to validate any pull-request also grows linearly `O(n)`
+
+This means that although the test runs may start off running quickly, they naturally slow
+down over time:
+
+* A week-old project may have a test suite that runs in seconds
+* By a few months, the test suite may start taking minutes to run
+* After a few years, the tests may be taking over an hour.
+* And there is no upper bound: in a large growing project test runs can easily
+  take several hours or even days to run
+
+Although at a small scale waiting seconds or minutes for tests to run is not a problem,
+the fact that it grows unbounded means it will _eventually_ become a problem on any growing
+codebase. Large codebases with test suites taking 4-8 hours to run are not uncommon at all, and
+this can become a real bottleneck in how fast developers can implement features or otherwise
+improve the functionality of the codebase.
+
+In general, "no selective testing" works fine for small projects with 1-10 developers, but
+beyond that the inefficiency of running tests not relevant to your changes starts noticeably
+slowing down day to day development. At that point, the first thing people reach for is
+some kind of folder-based selective testing.
+
+
+## Folder-based Selective Testing
+
+Typically, in any large codebase, most work happens within a single part of it: e.g. a
+developer may be working on `backend_service_1`, another may be working on `frontend_web_2`.
+The obvious thing to do would be make sure each module is in its own folder, e.g.
+
+```
+my_repo/
+    backend_service_1/
+    backend_service_2/
+    backend_service_3/
+    frontend_web_1/
+    frontend_web_2/
+    backend_service_utils/
+    frontend_web_utils/
+    deployment_1/
+    deployment_2/
+    deployment_3/
+    ...
+```
+
+To do simple module-based selective execution generally involves:
+
+1. Run a `git diff` on the code change you want to validate
+2. For any folders which contain changed files, run their tests
+
+For example,
+
+- A PR changing `backend_service_1/` will need to run the `backend_service_1/` tests
+- A PR changing `frontend_web_2/` will need to run the `frontend_web_2/` tests
+
+This does limit the number of tests any individual needs to execute: someone working on
+`backend_service_1` does not need to run the tests for `backend_service_2`, `backend_service_3`,
+etc.. This helps keep the test times when working on the monorepo from growing unbounded.
+
+However, folder-based selective testing is not enough: it is possible that changing
+a module would not break its own tests, but it could cause problems with downstream modules that
+depend on it:
+
+1. Changing `backend_service_1` may require corresponding changes to `frontend_web_1`, and
+   if those changes are not coordinated the integration tests in `deployment_1` would fail
+
+2. Changing `frontend_web_utils` could potentially break both `frontend_web_1` and `frontend_web_2`
+   that depend on it, and thus we need to run the tests for both those modules to validate our change.
+
+In these cases, the failure mode of folder-based selective testing is:
+
+* You change code in a folder
+* _That folder's_ tests may pass
+* You merge your change into the main repository
+* Only after merging, you notice _other folders' tests_ failing, which you did not
+  notice up front because you didn't run their tests before merging. But because you
+  merged the breaking change, you have inconvenienced other people working in other
+  parts of the codebase
+* You now have to tediously revert your breaking change, or rush
+  a fix-forward to un-break the folders whose tests you broke, and unblock the developers
+  working in those folders
+
+Folder-based selective testing works fine for codebases with 10-100 developers: there are
+occasional cases where a breakage might slip through, but generally it's infrequent enough
+that it's tolerable. But as the development organization grows beyond 100, these breakages
+affect more and more people and become more and more painful. To resolve this, we need
+something more sophisticated.
+
+## Dependency-based Selective Testing
+
+To solve the problem of code changes potentially breaking downstream modules, we need to make
+sure that for every code change, we run both the tests for that module as well as every downstream
+test. For example, if we make a change to `backend_service_1`, we need to run the unit tests for
+`backend_service_1` as well as the integration tests for `deployment_1`:
+
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+
+  frontend_web_utils -> frontend_web_2 -> deployment_2
+  frontend_web_utils -> frontend_web_1 -> deployment_1
+
+  backend_service_utils -> backend_service_3 -> deployment_3
+  backend_service_utils -> backend_service_2 -> deployment_2
+
+  backend_service_utils -> backend_service_1
+  backend_service_1 [color=red, penwidth=2]
+  deployment_1 [color=red, penwidth=2]
+  backend_service_1 -> deployment_1 [color=red, penwidth=2]
+}
+```
+
+On the other hand, if we make a change to `frontend_web_utils`, we need to run the unit tests
+for `frontend_web_1` and `frontend_web_2`, as well as the integration tests for `deployment_1`
+and `deployment_2`, but _not_ `deployment_3` since (in this example) it doesn't depend on any frontend codebase:
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+  frontend_web_utils [color=red, penwidth=2]
+  frontend_web_2 [color=red, penwidth=2]
+  deployment_2 [color=red, penwidth=2]
+  frontend_web_1 [color=red, penwidth=2]
+  deployment_1 [color=red, penwidth=2]
+
+  frontend_web_utils -> frontend_web_1 -> deployment_1 [color=red, penwidth=2]
+  frontend_web_utils -> frontend_web_2 -> deployment_2 [color=red, penwidth=2]
+
+  backend_service_utils -> backend_service_3 -> deployment_3
+  backend_service_utils -> backend_service_2 -> deployment_2
+  backend_service_utils -> backend_service_1 -> deployment_1
+}
+```
+
+This kind of dependency-based selective test execution is generally straightforward:
+
+1. You need to know which modules own which source files (e.g. based on the folder),
+2. You need to know which modules depend on which other modules
+3. Run a `git diff` on the code change you want to validate
+4. For any modules which contain changed files, run a breadth-first traversal of the module graph
+5. For all the modules discovered during the traversal, run their tests
+
+The algorithm (i.e. a breadth first search) is pretty trivial, the interesting part
+is generally how you know _"which modules own which source files"_ and
+_"which modules depend on which other modules"_.
+
+* For smaller projects this can be
+  managed manually in a bash or python script, e.g.
+  https://github.com/apache/spark/blob/290b4b31bae2e02b648d2c5ef61183f337b18f8f/dev/sparktestsupport/modules.py#L108-L126[this code in Apache Spark]
+  that manually maintains a list of source folders and dependencies per-module,
+  as well as what command in the underlying build tool you need to run
+  in order to test that module (`sbt_test_goals`):
+
+```python
+tags = Module(
+    name="tags",
+    dependencies=[],
+    source_file_regexes=["common/tags/"],
+)
+
+utils = Module(
+    name="utils",
+    dependencies=[tags],
+    source_file_regexes=["common/utils/"],
+    sbt_test_goals=["common-utils/test"],
+)
+
+kvstore = Module(
+    name="kvstore",
+    dependencies=[tags],
+    source_file_regexes=["common/kvstore/"],
+    sbt_test_goals=["kvstore/test"],
+)
+
+...
+```
+
+* In a larger project maintaining this information by hand is tedious and error prone,
+  so it is better to get the information from your build tool that already has it
+  (e.g. via xref:mill::large/selective-execution.adoc[Mill Selective Execution]).
+
+### Limitations of Dependency-based Selective Testing
+
+Dependency-based selective test execution can get you pretty far: 100s to 1,000 developers
+working on a shared codebase. But it still has weaknesses, and as the number of
+developers grows beyond 1,000, you begin noticing issues and inefficiency:
+
+1. *You are limited by the granularity of your module graph*. For example,
+   `backend_service_utils` may be used by all three ``backend_service``s,
+   but not _all_ of `backend_service_utils` is used by all three services. Thus
+   a change to `backend_service_utils` may result in running tests for all three
+   ``backend_service``s, even if that change may not affect that particular service
+
+2. *You may over-test things redundantly*. For example, a function in `backend_service_utils`
+   may be exhaustively tested in ``backend_service_utils``'s own test suite. If so, running
+   unit tests for all three ``backend_service``s
+   as well as integration tests for all three ``deployment``s may be unnecessary, as they
+   will just exercise code paths that are already exercised as part of the `backend_service_utils`
+   test suite
+
+
+
+These failure modes are especially problematic for integration or end-to-end tests.
+The nature of end-to-end tests is that they depend on _everything_, and so you find
+_any change_ in your codebase triggering _every end-to-end_test_ to be run. These
+are also the slowest tests in your codebase, so running every end-to-end test every time
+you touch any line of code is extremely expensive and wasteful.
+
+For example, touching `backend_service_1` is enough to trigger all the `end_to_end` tests:
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+
+  frontend_web_utils -> frontend_web_2 -> deployment_2
+  frontend_web_utils -> frontend_web_1 -> deployment_1
+
+  backend_service_utils -> backend_service_1
+  backend_service_1 [color=red, penwidth=2]
+  deployment_1 [color=red, penwidth=2]
+  backend_service_1 -> deployment_1 [color=red, penwidth=2]
+  backend_service_utils -> backend_service_2 -> deployment_2
+  backend_service_utils -> backend_service_3 -> deployment_3
+
+  deployment_1 -> staging_environment
+  deployment_2 -> staging_environment
+  deployment_3 -> staging_environment
+  staging_environment [color=red, penwidth=2]
+  end_to_end_test_1 [color=red, penwidth=2]
+  end_to_end_test_2 [color=red, penwidth=2]
+  end_to_end_test_3 [color=red, penwidth=2]
+
+  staging_environment -> end_to_end_test_1 [color=red, penwidth=2]
+  staging_environment -> end_to_end_test_2 [color=red, penwidth=2]
+  staging_environment -> end_to_end_test_3 [color=red, penwidth=2]
+}
+```
+
+Touching `frontend_web_2` is also enough to trigger all the `end_to_end` tests:
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+
+
+
+  frontend_web_2 [color=red, penwidth=2]
+  frontend_web_2 -> deployment_2  [color=red, penwidth=2]
+  frontend_web_utils -> frontend_web_1 -> deployment_1
+  frontend_web_utils -> frontend_web_2
+  deployment_2 [color=red, penwidth=2]
+  backend_service_utils -> backend_service_1
+  backend_service_1 -> deployment_1
+  backend_service_utils -> backend_service_2 -> deployment_2
+  backend_service_utils -> backend_service_3 -> deployment_3
+
+  deployment_1 -> staging_environment
+  deployment_2 -> staging_environment
+  deployment_3 -> staging_environment
+  staging_environment [color=red, penwidth=2]
+  end_to_end_test_1 [color=red, penwidth=2]
+  end_to_end_test_2 [color=red, penwidth=2]
+  end_to_end_test_3 [color=red, penwidth=2]
+
+  staging_environment -> end_to_end_test_1 [color=red, penwidth=2]
+  staging_environment -> end_to_end_test_2 [color=red, penwidth=2]
+  staging_environment -> end_to_end_test_3 [color=red, penwidth=2]
+}
+```
+
+The two examples above demonstrate both failure modes:
+
+1. `staging_environment` is very coarse grained, causing all ``end_to_end_test``s to be run
+   even if they don't actually test the code in question
+
+2. Every `end_to_end_test` likely exercises the same setup/teardown/plumbing code,
+   in addition to the core logic under test, resulting in the same code being exercised
+   redundantly by many different tests
+
+This results in the selective testing system wasting both time and compute resources,
+running tests that aren't relevant or repeatedly testing the same code paths over and over.
+While there are some ways you can improve the granularity of the module graph to mitigate
+these two issues, these issues are fundamental to dependency-based selective testing:
+
+* Dependency-based Selective Testing means _a problematic code change cases all affected tests to break_
+* For an effectivec CI system, all we need is that _every problematic code change breaks at least one test_
+
+Fundamentally, CI just needs to ensure that every problematic code change breaks _at least one thing_,
+because that is usually enough for the pull-request author to act on it and resolve the problem.
+Running more tests to display more breakages is usually a waste of time and resources. Thus,
+although dependency-based selective testing helps, it still falls short of the ideal of how a CI
+system should behave.
+
+## Heuristic-based Selective Testing
+
+The next stage of selective testing that most teams encounter is using heuristics: these are
+ad-hoc rules that you put in place to decide what tests to run based on a code change.
+Common heuristics include:
+
+### Limiting Dependency Depth
+
+The chances
+are that a breaking in module X will be caught by X's test suite, or the test suite of
+X's direct downstream modules, so we don't need to run every single transitive downstream
+module's test suite. e.g. if we set `N = 1`, then a change to `backend_service_1` shown below
+will only run tests for `backend_service_1` and `deployment_1`, but not the ``end_to_end_test``s
+downstream of the `staging_environment`:
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+
+  frontend_web_utils -> frontend_web_2 -> deployment_2
+  frontend_web_utils -> frontend_web_1 -> deployment_1
+
+  backend_service_utils -> backend_service_1
+  backend_service_1 [color=red, penwidth=2]
+  deployment_1 [color=red, penwidth=2]
+  backend_service_1 -> deployment_1 [color=red, penwidth=2]
+  backend_service_utils -> backend_service_2 -> deployment_2
+  backend_service_utils -> backend_service_3 -> deployment_3
+
+  deployment_1 -> staging_environment
+  deployment_2 -> staging_environment
+  deployment_3 -> staging_environment
+  staging_environment -> end_to_end_test_1
+  staging_environment -> end_to_end_test_2
+  staging_environment -> end_to_end_test_3
+}
+```
+
+At my last job, we picked `N = 8` somewhat arbitrarily, but as a heuristic there is
+no "right" answer, and the exact choice can be chosen to tradeoff between thoroughness
+and test latency. The principle here is that "most" code is tested
+by its own tests and those of its direct dependencies, so running tests for downstream
+folders which are too far removed in the dependency graph is "generally" not useful.
+
+### Hard-coding Dependency Relationships
+
+For example if I change `deployment_1`, I can choose to ignore the `staging_environment`
+when finding downstream tests, and only run `end_to_end_test_1` since it is end-to-end
+tests for the `deployment_1` module. We represent this by rendering `staging_environment`
+in dashed lines below, and adding additional arrows representing the hard-coded dependency
+from `deployment_1` to `end_to_end_test_1`:
+
+```graphviz
+digraph G {
+  rankdir=LR
+  node [shape=box width=0 height=0 style=filled fillcolor=white]
+
+  frontend_web_utils -> frontend_web_2 -> deployment_2
+  frontend_web_utils -> frontend_web_1 -> deployment_1
+
+  backend_service_utils -> backend_service_3 -> deployment_3
+  backend_service_utils -> backend_service_2 -> deployment_2
+  backend_service_utils -> backend_service_1
+
+  backend_service_1 [color=red, penwidth=2]
+  deployment_1 [color=red, penwidth=2]
+  backend_service_1 -> deployment_1 [color=red, penwidth=2]
+
+
+  deployment_3 -> staging_environment -> end_to_end_test_1 [style=dashed]
+
+  deployment_2 -> staging_environment -> end_to_end_test_2 [style=dashed]
+  deployment_1 -> staging_environment -> end_to_end_test_3 [style=dashed]
+  staging_environment [style=dashed]
+
+
+  deployment_1 -> end_to_end_test_1 [color=red, penwidth=2]
+  end_to_end_test_1 [color=red, penwidth=2]
+  end_to_end_test_2
+  deployment_3 -> end_to_end_test_3
+}
+```
+
+This approach does work, as developers usually have _some_ idea of what tests
+should be run to exercise their application code. But maintaining these hard-coded
+dependency relationships is difficult in a large and evolving codebase:
+
+* Generally only the most senior engineers with the most experience working in the
+  codebase are able to make such judgements
+
+* The nature of heuristics is that there is no right or wrong answer, and it is difficult
+  to determine whether a selection of hard-coded dependencies is good or not
+
+Fundamentally, tweaking magic configuration values to try and optimize an unclear
+result is not a good use of developer time, but in some cases doing so may be necessary 
+in large codebases to keep pull-request validation time and cost under control.
+
+### Machine-Learning-based Selective Testing
+
+The last option I've seen in the wild is machine-learning based test selection. This
+has two steps:
+
+1. Train a machine learning model using the 10,000 commits on the codebase,
+   with the `git diff` and any tests that were broken on.
+
+2. For any new pull-request, feed the `git diff` into the model trained in (1) above
+   to get a list of likely-affected tests, and only run those
+
+This approach basically automates the manual tweaking described in
+<<Hard-coding Dependency Relationships>>, and instead of some senior engineer trying to use
+their intuition to guess what tests to run, the ML model does the same thing.
+You can then easily tune the model to optimize for different tradeoffs between latency
+and thoroughness, depending on what is important for your development team at any point in time.
+
+One downside of machine-learning-based selective testing is the ML models are a black box,
+and you have very little idea of why they do what they do. However, all of the heuristics
+people use in <<Heuristic-based Selective Testing>> are effectively black boxes anyway,
+since you'll be hard-pressed to come up with an answer for why someone
+<<Limiting Dependency Depth>> decided to set `N = 8` rather than `7` or `9`, or why
+someone <<Hard-coding Dependency Relationships>> decided on the exact hard-coded config they
+ended up choosing.
+
+
+### Limitations of Heuristics
+
+The nature of heuristics is that they are _approximate_. That means that it is possible
+both that we run too may tests and waste time, and also that we run too few tests and allow
+a breakage to slip through. Typically this sort of heuristic is only used early in the testing
+pipeline, e.g.
+
+1. When validating pull-requests, use heuristics to trim down the set of tests to run before merging
+2. After merging a pull-request, run a more thorough set of tests without heuristics
+   to catch any bugs that slipped through and prevent bugs from being shipped to customers.
+3. If a bug is noticed during post-merge testing, bisect it and revert/fix the offending commit
+
+This may seem hacky and complicated, and bisecting/reverting commits post-merge can indeed
+waste a lot of time. But such a workflow necessary in any large codebase and organization.
+The heuristics also do not need to be 100% precise, and as long as they are precise _enough_ that
+the time saved skipping tests outweighs the time spend dealing with post-merge breakages, it
+still ends up being worth it.
+
+## Selective Testing in Mill
+
+The Mill build tool's xref:mill::large/selective-execution.adoc[Selective Test Execution]
+supports <<Dependency-based Selective Testing>> out of the box. This makes it easy to set up
+CI for your projects using Mill that only run tests that are downstream of the code
+you changed in a pull-request. Selective Test Execution in Mill is implemented at the _task
+level_, so even custom tasks and overrides can benefit from it. Mill's own pull-request
+validation jobs benefit greatly for selective testing.
+
+However, although Mill provides better support for selective testing that most build tools
+(which provide none), the <<Limitations of Dependency-based Selective Testing>> do cause issues.
+Even in Mill's own pull-request validation jobs, the fact that the most expensive integration
+or end-to-end tests are selected every time causes slowness. What kind of
+<<Heuristic-based Selective Testing>> can help improve things remains an open question.
+
+If you are interested in build tools, especially where they apply to selective testing on
+large codebases and monorepos, you should definitely give the
+https://mill-build.org/[Mill Build Tool] a try! Mill's support for selective testing can
+definitely help you keep pull-request validation times reasonable in a large codebase or
+monorepo, which is key to keeping developers productive and development velocity fast
+even as the size and scope of a project grows.
diff --git a/blog/modules/ROOT/pages/index.adoc b/blog/modules/ROOT/pages/index.adoc
index 42571d4fd2f..4081e94046c 100644
--- a/blog/modules/ROOT/pages/index.adoc
+++ b/blog/modules/ROOT/pages/index.adoc
@@ -1,4 +1,4 @@
-# Mill Build Tool Engineering Blog
+# The Mill Build Engineering Blog
 
 include::mill:ROOT:partial$gtag-config.adoc[]
 
@@ -7,6 +7,10 @@ Welcome to the Mill development blog! This page contains posts and articles on
 technical topics related to the development of the Mill build tool. These discuss
 topics related to JVM languages and monorepo build tooling.
 
+include::3-selective-testing.adoc[tag=header,leveloffset=1]
+
+xref:3-selective-testing.adoc[Read More...]
+
 include::2-monorepo-build-tool.adoc[tag=header,leveloffset=1]
 
 xref:2-monorepo-build-tool.adoc[Read More...]
diff --git a/docs/modules/ROOT/pages/large/selective-execution.adoc b/docs/modules/ROOT/pages/large/selective-execution.adoc
index 665e2a323ba..f6da91ea78d 100644
--- a/docs/modules/ROOT/pages/large/selective-execution.adoc
+++ b/docs/modules/ROOT/pages/large/selective-execution.adoc
@@ -1,4 +1,4 @@
-= Selective Execution
+= Selective Test Execution
 
 include::partial$gtag-config.adoc[]
 
diff --git a/docs/package.mill b/docs/package.mill
index 7c97684c7ef..13e706d0ef6 100644
--- a/docs/package.mill
+++ b/docs/package.mill
@@ -200,7 +200,12 @@ object `package` extends RootModule {
     os.write.over(dest / "antora.yml", (lines ++ newLines).mkString("\n"))
   }
 
-  def blogFolder = Task.Source(build.millSourcePath / "blog")
+  def blogFolder0 = Task.Source(build.millSourcePath / "blog")
+  def blogFolder = Task{
+    os.copy(blogFolder0().path, Task.dest, mergeFolders = true)
+    expandDiagramsInDirectoryAdocFile(Task.dest, mill.main.VisualizeModule.classpath().map(_.path))
+    PathRef(Task.dest)
+  }
   def githubPagesPlaybookText(authorMode: Boolean) = T.task { extraSources: Seq[os.Path] =>
     val taggedSources = for (path <- extraSources) yield {
       s"""    - url: ${build.baseDir}
diff --git a/example/large/selective/9-selective-execution/build.mill b/example/large/selective/9-selective-execution/build.mill
index 095ee7675a3..4501a4cb22f 100644
--- a/example/large/selective/9-selective-execution/build.mill
+++ b/example/large/selective/9-selective-execution/build.mill
@@ -47,9 +47,35 @@ object bar extends MyModule
 
 object qux extends MyModule
 
+// ```graphviz
+// digraph G {
+//   rankdir=LR
+//   node [shape=box width=0 height=0 style=filled fillcolor=white]
+//   qux -> "qux.test"
+//   bar -> "bar.test"
+//   bar -> foo
+//   foo -> "foo.test"
+// }
+// ```
+//
+//
 // In this example, `qux.test` starts off failing with an error, while `foo.test` and
 // `bar.test` pass successfully. Normally, running `__.test` will run all three test
 // suites and show both successes and the one failure:
+//
+// ```graphviz
+// digraph G {
+//   rankdir=LR
+//   node [shape=box width=0 height=0 style=filled fillcolor=white]
+//   "qux.test" [color=red penwidth=2]
+//   qux -> "qux.test"
+//   bar -> "bar.test"
+//   bar -> foo
+//   foo -> "foo.test"
+//   "bar.test"  [color=red penwidth=2]
+//   "foo.test"  [color=red penwidth=2]
+// }
+// ```
 
 /** Usage
 
@@ -90,6 +116,24 @@ Test run bar.BarTests finished: 0 failed, 0 ignored, 1 total, ...
 
 */
 
+// As we only touched ``bar``'s source files, we only need to run tests for
+//
+//
+// ```graphviz
+// digraph G {
+//   rankdir=LR
+//   node [shape=box width=0 height=0 style=filled fillcolor=white]
+//   qux -> "qux.test"
+//   bar [color=red penwidth=2]
+//   bar -> "bar.test" [color=red penwidth=2]
+//   foo [color=red penwidth=2]
+//   bar -> foo [color=red penwidth=2]
+//   foo -> "foo.test" [color=red penwidth=2]
+//   "bar.test" [color=red penwidth=2]
+//   "foo.test" [color=red penwidth=2]
+// }
+// ```
+
 // Similarly, if we make a change `qux/`, using selective execution will only run tests
 // in `qux.test`, and skip those in `foo.test` and `bar.test`.
 // These examples all use `__.test` to selectively run tasks named `.test`, but you can
@@ -106,3 +150,5 @@ Test run bar.BarTests finished: 0 failed, 0 ignored, 1 total, ...
 // execution (e.g. perhaps you want to perform selective execution on pre-merge on pull
 // requests but not post-merge on the main branch)
 //
+// Although selective execution is most commonly used for testing, it is a general-purpose
+// tool that can be used to selectively run any Mill tasks based on the code that changed.