August 2024
This page summarizes the experiments conducted throughout the project.
We expect all our experiments to be reproducible in a qualitative sense. This means that if you run them with the same parameters, you should obtain the same results; running times may however vary between machines.
The above statement comes with a caveat: When using randomized symbolic
execution (apalache simulate
), there is still a chance that you might find an
invariant violation later, unless we support the invariant with an inductive
proof in another experiment.
Hardware. We conducted the experiments on a dedicated AMD Ryzen 9 5950X 16-Core Processor with 128GB of RAM. This powerful machine allowed us to run multiple experiments in parallel. Apalache is powered by the SMT solver Z3, which is known to have significant variations in running times. Therefore, instead of taking the time figures on this page literally, consider them as estimates such as "minutes", "hours", or "days".
Software. We are using the symbolic model checker Apalache for TLA+. Check the Apalache installation instructions in the manual. For reproducibility, we recommend using the latest release in the 0.45.x series. See Apalache releases. The experiments were run on Ubuntu Server 24.04.
In this document, we provide commands for running each experiment in a single thread. We also have an experimental setup that runs individual experiments on multiple CPU cores. If you are curious about this setup, please contact us.
We have produced the following specifications in the project:
-
Spec 1. This is the specification spec1-2/ffg_recursive.tla. It is the result of a manual mechanical translation of the original executable specification in Python, which can be found in ffg.py. This specification is using mutually recursive operators, which are not supported by Apalache. As a result, we are not checking this specification. This specification is the result of our work in Milestone 1.
-
Spec 2. This is the specification spec1-2/ffg.tla. It is a manual adaptation of
Spec 1
. The main difference betweenSpec 2
andSpec 1
is thatSpec 2
uses "folds" (also known as "reduce") instead of recursion. We expectSpec 2
to be equivalent toSpec 1
, although we do not have a formal proof of this equivalence. This specification has high model-checking complexity, roughly speaking, as it contains multiple nested folds. In Milestone 2, we have introduced an optimization that avoids repetitive recursive computations by memorization in the initial states, see PR #38. This specification also contains a preliminary construction that could help us in provingAccountableSafety
. Even considering all the optimizations, the model checker is still consuming a lot of resources (>40G) and computing for over 40h. This specification is the result of our work in Milestone 1 and Milestone 2. -
Spec 3. This is the specification spec3/ffg.tla. It is a manual abstraction of
Spec 2
that is further optimized for constraint solving, especially with Apalache.Spec 3
describes a state machine that adds blocks and votes in every step, and thus enables bounded model-checking.Spec 3
contains a preliminary inductive construction in the initial-state predicate. This is ongoing work in Milestone 4. -
Spec 4. Our experiments in Milestone 2 and Milestone 4 have shown that all specs to this point encode a graph problem of intractable complexity. We experimentally support this claim by providing specifications in Alloy and SMT, which are both are geared at proving relational properties over (small) finite sets and relations.
Spec 4
is a manual reformulation of the problem that avoids expensive graph constructions and is highly optimized for model checking. We expect that this specification allows us to us to proveAccountableSafety
inductively. This is ongoing work in Milestone 4.
The specifications Spec 2
and Spec 3
come with model checking instances
spec1-2/MC_ffg.tla and spec3/MC_ffg.tla, respectively. These instances
fix the specification parameters to small values for model checking purposes,
e.g.:
- four validators,
- from 4 to 10 slots,
- up to 12 votes.
By using small values for the parameters, we are able to produce counterexamples in a reasonable time when such examples exist. By increasing the parameter values, we can increase our confidence in the specification and its properties. While this restriction to small parameter values is dictated by the practicality of our approach, it is informally supported by the "small scope hypothesis," which was popularized by tools such as Alloy.
The following table summarizes the experimental figures in one place:
Experiment | Specification | Property | Time |
---|---|---|---|
4.1 | Spec 2 (w/o PR #38) |
AccountableSafety |
out of memory (>20GB)1 |
4.2.1 | Spec 2 (with PR #38) |
AccountableSafety |
timeout (>40h) |
4.2.2 | Spec 2 (with PR #38) |
Conflicting_Example |
timeout (>40h) |
4.2.2 | Spec 2 (with PR #38) |
Finalized_And_Conflicting_Blocks_Example |
timeout (>40h) |
5.1 | Spec 3 |
AccountableSafety |
TBD (Milestone 4) |
5.2 | Spec 3 |
AccountableSafety |
TBD (Milestone 4) |
6.1 | Spec 4 |
AccountableSafety |
TBD (Milestone 4) |
6.2 | Spec 4 |
AccountableSafety |
TBD (Milestone 4) |
Incomplete results on fixed block graphs (§ 4.3):
Experiment | Specification | Block graph | Property | Time |
---|---|---|---|---|
4.3.1 | Spec 2 (with PR #38) |
SingleChain |
Conflicting_Example |
1 min 3 sec |
4.3.1 | Spec 2 (with PR #38) |
ShortFork |
Conflicting_Example |
52 sec |
4.3.1 | Spec 2 (with PR #38) |
Forest |
Conflicting_Example |
2 min 21 sec |
4.3.2 | Spec 2 (with PR #38) |
SingleChain |
Finalized_And_Conflicting_Blocks_Example |
1 min 5 sec |
4.3.2 | Spec 2 (with PR #38) |
ShortFork |
Finalized_And_Conflicting_Blocks_Example |
10 hours 49 min 47 sec |
4.3.2 | Spec 2 (with PR #38) |
Forest |
Finalized_And_Conflicting_Blocks_Example |
timeout (>40h) |
4.3.3 | Spec 2 (with PR #38) |
SingleChain |
AccountableSafety |
1 min 13 sec |
4.3.3 | Spec 2 (with PR #38) |
ShortFork |
AccountableSafety |
timeout (>40h) |
4.3.3 | Spec 2 (with PR #38) |
Forest |
AccountableSafety |
timeout (>40h) |
We describe model checking experiments with Spec 2
, that is spec1-2/ffg.tla.
Spec 2
is a manual adaptation of Spec 1
that introduces (equivalent) fold
operations instead of recursion (which is not supported by Apalache).
Initial experiments showed that this naive translation quickly1 goes out of memory even with the JVM heap size increased to 20GB (from Apalche's default of 4GB), due to the number of constraints emitted for nested folds (originating from nested recursion in the Python specification). In our experience, it does not help to increase the JVM heap size even further, since for that many constraints, solving time would be prohibitive.
Therefore, we introduced a further manual optimization that flattens nested fold
operations, and allows Apalache to run within 20GB of JVM heap memory. To
flatten nested folds, we introduce TLA+ state variables that memorize the
fold-based operations is_ancestor_descendant_relationship
and
is_justified_checkpoint
(is_complete_chain
is expressed in terms of ancestry
of genesis).
We evaluated 3 startegies for expressing these precomputed state variables and continued with the most efficient one. For details on these experiments, see the description of PR #38.
In this experiment, we are checking whether Accountable Safety holds true for
the flattened version of Spec 2
.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --inv=AccountableSafety MC_ffg.tla
Tip
Since Spec 2
presents a snapshot of a single validator and does not make any
steps, we only have to check properties on the initial states (--length=0
).
Also, note that we have to extend the default JVM heap size from 4G to 20G (JVM_ARGS=-Xmx20G
).
Resulting from the high complexity of Spec 2
, this experiment takes a long time to complete.
We declare it timed out after 40 hours of runtime.
Given the result above, we evaluate Spec 2
on finding reachable protocol states.
To this end, we introduce falsy invariants describing reachable protocol states
in spec1-2/MC_ffg_examples.tla. Apalache should report these example states as
counterexamples to the supplied property.
This serves two purposes:
- it ascertains the correctness of the specification (protocol states that we know to be reachable should be found),
- it evaluates
Spec 2
on a simpler problem (the underlying SMT query is SAT, not UNSAT).
For example, to find a protocol state with two conflicting blocks, we can run:
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --inv=Conflicting_Example MC_ffg_examples.tla
To find a protocol state with two conflicting finalized blocks, we can run:
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --inv=Finalized_And_Conflicting_Blocks_Example MC_ffg_examples.tla
Due to the high complexity of Spec 2
, even these experiments result in a time out after 40 hours.
The results above are not surprising – the solver has to consider both reachability properties for all possible block graphs, and all possible FFG voting scenarios on top of these graphs.
To further evaluate Spec 2
, we fix the block graph – this way the solver
only has to reason about voting. Example fixed block graphs are encoded in
spec1-2/MC_ffg_examples.tla.
We consider
- a single, linear chain (
SingleChain
) - a short chain with a fork (
ShortFork
) - a forest of disconnected chains (
Forest
)
Tip
We supply these alternative initial states to Apalache via --init
.
To find a protocol state with two conflicting blocks, run:
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_SingleChain --inv=Conflicting_Example MC_ffg_examples.tla
Note
The model checker (correctly) does not report a counterexample with
Init_SingleChain
, as there cannot be any conflicting blocks on a single
linear chain.
This experiment took 1 min 3 sec.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_ShortFork --inv=Conflicting_Example MC_ffg_examples.tla
This experiment took 52 sec.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_Forest --inv=Conflicting_Example MC_ffg_examples.tla
This experiment took 2 min 21 sec.
To find a protocol state with two conflicting blocks, run:
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_SingleChain --inv=Finalized_And_Conflicting_Blocks_Example MC_ffg_examples.tla
Note
The model chcker (correctly) does not report a counterexample with
Init_SingleChain
, as there cannot be any conflicting blocks on a single
linear chain.
This experiment took 1 min 5 sec.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_ShortFork --inv=Finalized_And_Conflicting_Blocks_Example MC_ffg_examples.tla
This experiment took 10 hours 49 min 47 sec.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_Forest --inv=Finalized_And_Conflicting_Blocks_Example MC_ffg_examples.tla
This experiment timed out after 40 hours.
To check accountable safety on these fixed block graphs, run:
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_SingleChain --inv=AccountableSafety MC_ffg_examples.tla
Note
Accountable safety trivially holds on this example, as there are no conflicting blocks on a single linear chain.
This experiment took 1 min 13 sec.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_ShortFork --inv=AccountableSafety MC_ffg_examples.tla
This experiment timed out after 40 hours.
$ cd ./spec
$ JVM_ARGS=-Xmx20G apalache-mc check --length=0 --init=Init_Forest --inv=AccountableSafety MC_ffg_examples.tla
This experiment timed out after 40 hours.
In this experiment, we aim at finding an example of at least one justified checkpoint that is different from genesis:
$ apalache-mc check --length=20 --inv=ExistsJustifiedNonGenesisInv MC_ffg.tla
It takes 5 seconds to find an example.
$ apalache-mc check --length=20 --inv=ExistTwoFinalizedConflictingBlocks MC_ffg.tla
It takes 1 hour 27 minutes to find an example.
In this experiment, we are checking whether Accountable Safety holds true for up
to 10 steps of Spec 3
. Since Spec 3
specifies the abstract state machine of
a single validator (alternatively, the network view), we have to check
executions up to some bound.
$ cd ./spec3
$ JVM_ARGS=-Xmx20G apalache-mc check --length=6 --inv=AccountableSafety MC_ffg.tla
This experiment takes about TODO.
To obtain results faster, albeit at the cost of completeness, we analyze 100 random symbolic runs, each consisting of up to 10 steps:
$ cd ./spec3
$ JVM_ARGS=-Xmx20G apalache-mc simulate \
--max-run=100 --length=6 \
--inv=AccountableSafety MC_ffg.tla
This command typically finds counterexamples much quicker than check
.
Therefore, when simulate
does not find counterexamples, it gives us relatively
high confidence that the property holds true.
This experiment takes about 37 minutes on a single node. The interesting thing is that we can parallelize the enumeration of 100 runs across multiple cores. For instance, we use 20 CPU cores to check 5 symbolic runs each.
See the detailed experiments in README.md.
See the detailed experiments in README.md.
In this experiment, we show that IndInv
is an inductive invariant for Spec 4
:
$ cd ./spec4
$ apalache-mc check --length=0 --init=Init --inv=IndInv \
--next=Next MC_ffg_b3_ffg5_v12.tla
This experiment takes about 2 seconds.
$ cd ./spec4
$ apalache-mc check --length=1 --inv=IndInv \
--init=Init --next=Next MC_ffg_b3_ffg5_v12.tla
This experiment takes about 2 seconds.
$ cd ./spec4
$ JVM_ARGS=-Xmx20G apalache-mc check \
--length=0 --init=IndInv --inv=AccountableSafety \
MC_ffg_b3_ffg5_v12.tla
This experiment did not finish after 6 days.
We ran the experiments using the following scripts:
-
check-inductive.sh to check inductiveness of our invariants.
-
check-accountable-safety.sh to check accountable safety against the inductive invariant.
The table below summarizes the experiments with inductiveness checking:
Instance | Init | Invariant | Memory | Time |
---|---|---|---|---|
MC_ffg_b1_ffg5_v12 | Init | IndInv | 580 MB | 7s |
MC_ffg_b3_ffg5_v12 | Init | IndInv | 700 MB | 7s |
MC_ffg_b1_ffg5_v12 | Init_C1 | IndInv | 1.4 GB | 2min 8s |
MC_ffg_b3_ffg5_v12 | Init_C1 | IndInv | 1.8 GB | 19min 10s |
MC_ffg_b3_ffg5_v12 | Init_C2 | IndInv | 1.6 GB | 13min 16s |
MC_ffg_b3_ffg5_v12 | Init_C3 | IndInv | 1.6 GB | 17min 39s |
MC_ffg_b3_ffg5_v12 | Init_C4 | IndInv | 1.6 GB | 16min 23s |
The table below summarizes the experiments with accountable safety:
Instance | Init | Invariant | Memory | Time |
---|---|---|---|---|
MC_ffg-b1_ffg5_v12 | Init_C1 | AccountableSafety | 1.2 GB | 11h 31min |
MC_ffg-b3_ffg5_v12 | Init_C4 | AccountableSafety | 1.4 GB | 6d 16h |
MC_ffg-b3_ffg5_v12 | Init_C2 | AccountableSafety | 1.3 GB | 1day 6h |
MC_ffg-b3_ffg5_v12 | Init_C3 | AccountableSafety | 1.2 GB | 1h 53min |
MC_ffg-b3_ffg5_v12 | Init_C1 | AccountableSafety | 1.5 GB | TO (> 6d) |