docs(frontend): add performance tips section

zama-ai · Aug 2, 2024 · 2968e80 · 2968e80
1 parent dfaf987
commit 2968e80
Show file tree

Hide file tree

Showing 16 changed files with 687 additions and 0 deletions.
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -35,6 +35,7 @@
 
 * [Simulation](execution-analysis/simulation.md)
 * [Debugging and artifact](execution-analysis/debug.md)
+* [Performance](optimization/summary.md)
 * [GPU acceleration](execution-analysis/gpu_acceleration.md)
 * Other
   * [Statistics](compilation/statistics.md)
@@ -46,6 +47,19 @@
 * [Configure](guides/configure.md)
 * [Manage keys](guides/manage_keys.md)
 * [Deploy](guides/deploy.md)
+* [Optimization](optimization/self.md)
+  * [Improve parallelism](optimization/improve-parallelism/self.md)
+    * [Dataflow parallelism](optimization/improve-parallelism/dataflow.md)
+    * [Tensorizing operations](optimization/improve-parallelism/tensorization.md)
+  * [Optimize table lookups](optimization/optimize-table-lookups/self.md)
+    * [Reducing TLU](optimization/optimize-table-lookups/reducing-amount.md)
+    * [Implementation strategies](optimization/optimize-table-lookups/strategies.md)
+    * [Round/truncating](optimization/optimize-table-lookups/round-truncate.md)
+    * [Approximate mode](optimization/optimize-table-lookups/approximate.md)
+    * [Bit extraction](optimization/optimize-table-lookups/bit-extraction.md)
+  * [Optimize cryptographic parameters](optimization/optimize-cryptographic-parameters/self.md)
+    * [Error probability](optimization/optimize-cryptographic-parameters/p-error.md)
+    * [Composition](optimization/optimize-cryptographic-parameters/composition.md)
 
 ## Tutorials
 

diff --git a/docs/_static/compilation/performance_tips/complexity_and_timing_per_bit_width.png b/docs/_static/compilation/performance_tips/complexity_and_timing_per_bit_width.png
diff --git a/docs/optimization/improve-parallelism/dataflow.md b/docs/optimization/improve-parallelism/dataflow.md
@@ -0,0 +1,102 @@
+### Enabling dataflow parallelism
+
+This guide teaches what data parallelism is and how it can improve the execution time of Concrete circuits.
+
+Dataflow parallelism is a great feature, especially when the circuit is doing a lot of scalar operations.
+
+Without dataflow parallelism, circuit is executed operation by operation, like an imperative language. If the operations themselves are not tensorized, loop parallelism would not be utilized and the entire execution would happen in a single thread. Dataflow parallelism changes this by analyzing the operations and their dependencies within the circuit to determine what can be done in parallel and what cannot. Then it distributes the tasks that can be done in parallel to different threads.
+
+For example:
+
+```python
+import time
+
+import numpy as np
+from concrete import fhe
+
+def f(x, y, z):
+    # normally, you'd use fhe.array to construct a concrete tensor
+    # but for this example, we just create a simple numpy array
+    # so the matrix multiplication can happen on a cellular level
+    a = np.array([[x, y], [z, 2]])
+    b = np.array([[1, x], [z, y]])
+    return fhe.array(a @ b)
+
+inputset = fhe.inputset(fhe.uint3, fhe.uint3, fhe.uint3)
+
+for dataflow_parallelize in [False, True]:
+    compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"})
+    circuit = compiler.compile(inputset, dataflow_parallelize=dataflow_parallelize)
+
+    circuit.keygen()
+    for sample in inputset[:3]:  # warmup
+        circuit.encrypt_run_decrypt(*sample)
+
+    timings = []
+    for sample in inputset[3:13]:
+        start = time.time()
+        result = circuit.encrypt_run_decrypt(*sample)
+        end = time.time()
+
+        assert np.array_equal(result, f(*sample))
+        timings.append(end - start)
+
+    if not dataflow_parallelize:
+        print(f"without dataflow parallelize -> {np.mean(timings):.03f}s")
+    else:
+        print(f"   with dataflow parallelize -> {np.mean(timings):.03f}s")
+```
+
+prints:
+
+```
+without dataflow parallelize -> 0.609s
+   with dataflow parallelize -> 0.414s
+```
+
+and the reason for that is:
+
+```
+// this is the generated MLIR for the circuit
+// without dataflow, every single line would be executed one after the other
+
+module {
+  func.func @main(%arg0: !FHE.eint<7>, %arg1: !FHE.eint<7>, %arg2: !FHE.eint<7>) -> tensor<2x2x!FHE.eint<7>> {
+  
+    // but if you look closely, you can see that this multiplication
+    %c1_i2 = arith.constant 1 : i2
+    %0 = "FHE.mul_eint_int"(%arg0, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
+    
+    // is completely independent of this one, so dataflow makes them run in parallel
+    %1 = "FHE.mul_eint"(%arg1, %arg2) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    
+    // however, this addition needs the first two operations
+    // so dataflow waits until both are done before performing this one
+    %2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    
+    // lastly, this multiplication is completely independent from the first three operations
+    // so its execution starts in parallel when execution starts with dataflow
+    %3 = "FHE.mul_eint"(%arg0, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    
+    // similar logic can be applied to the remaining operations...
+    %4 = "FHE.mul_eint"(%arg1, %arg1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    %5 = "FHE.add_eint"(%3, %4) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    %6 = "FHE.mul_eint_int"(%arg2, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
+    %c2_i3 = arith.constant 2 : i3
+    %7 = "FHE.mul_eint_int"(%arg2, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
+    %8 = "FHE.add_eint"(%6, %7) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    %9 = "FHE.mul_eint"(%arg2, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    %10 = "FHE.mul_eint_int"(%arg1, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
+    %11 = "FHE.add_eint"(%9, %10) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
+    %from_elements = tensor.from_elements %2, %5, %8, %11 : tensor<2x2x!FHE.eint<7>>
+    return %from_elements : tensor<2x2x!FHE.eint<7>>
+    
+  }
+}
+```
+
+To summarize, dataflow analyzes the circuit to determine which parts of the circuit can be run at the same time, and tries to run as many operations as possible in parallel.
+
+{% hint style="warning" %}
+When the circuit is tensorized, dataflow might slow execution down since the tensor operations already use multiple threads and adding dataflow on top creates congestion in the CPU between the HPX (dataflow parallelism runtime) and OpenMP (loop parallelism runtime). So try both before deciding on whether to use dataflow or not.
+{% endhint %}
diff --git a/docs/optimization/improve-parallelism/self.md b/docs/optimization/improve-parallelism/self.md
@@ -0,0 +1,11 @@
+## Improve parallelism
+
+This guide teaches the different options for parallelism in Concrete and how to utilize them to improve the execution time of Concrete circuits.
+
+Modern CPUs have multiple cores to perform computation and utilizing multiple cores is a great way to boost performance.
+
+There are two kinds of parallelism in Concrete:
+- Loop parallelism to make tensor operations parallel, achieved by using [OpenMP](https://www.openmp.org/)
+- Dataflow parallelism to make independent operations parallel, achieved by using [HPX](https://hpx.stellar-group.org/)
+
+Loop parallelism is enabled by default, as it's supported on all platforms. Dataflow parallelism however is only supported on Linux, hence not enabled by default.
diff --git a/docs/optimization/improve-parallelism/tensorization.md b/docs/optimization/improve-parallelism/tensorization.md
@@ -0,0 +1,55 @@
+### Tensorizing operations
+
+This guide teaches what tensorization is and how it can improve the execution time of Concrete circuits.
+
+Tensors should be used instead of scalars when possible to maximize loop parallelism.
+
+For example:
+
+```python
+import time
+
+import numpy as np
+from concrete import fhe
+
+inputset = fhe.inputset(fhe.uint6, fhe.uint6, fhe.uint6)
+for tensorize in [False, True]:
+    def f(x, y, z):
+        return (
+            np.sum(fhe.array([x, y, z]) ** 2)
+            if tensorize
+            else (x ** 2) + (y ** 2) + (z ** 2)
+        )
+
+    compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"})
+    circuit = compiler.compile(inputset)
+
+    circuit.keygen()
+    for sample in inputset[:3]:  # warmup
+        circuit.encrypt_run_decrypt(*sample)
+
+    timings = []
+    for sample in inputset[3:13]:
+        start = time.time()
+        result = circuit.encrypt_run_decrypt(*sample)
+        end = time.time()
+
+        assert np.array_equal(result, f(*sample))
+        timings.append(end - start)
+
+    if not tensorize:
+        print(f"without tensorization -> {np.mean(timings):.03f}s")
+    else:
+        print(f"   with tensorization -> {np.mean(timings):.03f}s")
+```
+
+prints:
+
+```
+without tensorization -> 0.214s
+   with tensorization -> 0.118s
+```
+
+{% hint style="info" %}
+Enabling dataflow is kind of letting the runtime do this for you. It'd also help in the specific case.
+{% endhint %}
diff --git a/docs/optimization/optimize-cryptographic-parameters/composition.md b/docs/optimization/optimize-cryptographic-parameters/composition.md
@@ -0,0 +1,65 @@
+### Specifying composition when using modules
+
+This guide explains how to optimize cryptographic parameters by specifying composition when using [modules](../../compilation/composing_functions_with_modules.md).
+
+When using [modules](../../compilation/composing_functions_with_modules.md) make sure to specify [composition](../../compilation/composing_functions_with_modules.md#optimizing-runtimes-with-composition-policies) so that the compiler can select more optimal parameters based on how the functions in the module would be used.
+
+For example:
+
+```python
+import numpy as np
+from concrete import fhe
+
+
+@fhe.module()
+class PowerWithoutComposition:
+    @fhe.function({"x": "encrypted"})
+    def square(x):
+        return x ** 2
+
+    @fhe.function({"x": "encrypted"})
+    def cube(x):
+        return x ** 3
+
+without_composition = PowerWithoutComposition.compile(
+    {
+        "square": fhe.inputset(fhe.uint2),
+        "cube": fhe.inputset(fhe.uint4),
+    }
+)
+print(f"without composition -> {int(without_composition.complexity):>10_} complexity")
+
+
+@fhe.module()
+class PowerWithComposition:
+    @fhe.function({"x": "encrypted"})
+    def square(x):
+        return x ** 2
+
+    @fhe.function({"x": "encrypted"})
+    def cube(x):
+        return x ** 3
+
+    composition = fhe.Wired(
+        [
+            fhe.Wire(fhe.Output(square, 0), fhe.Input(cube, 0))
+        ]
+    )
+
+with_composition = PowerWithComposition.compile(
+    {
+        "square": fhe.inputset(fhe.uint2),
+        "cube": fhe.inputset(fhe.uint4),
+    }
+)
+print(f"   with composition -> {int(with_composition.complexity):>10_} complexity")
+```
+
+prints:
+
+```
+without composition -> 185_863_835 complexity
+   with composition -> 135_871_612 complexity
+```
+
+which means specifying composition resulted in ~35% improvement to complexity for computing `cube(square(x))`.
diff --git a/docs/optimization/optimize-cryptographic-parameters/p-error.md b/docs/optimization/optimize-cryptographic-parameters/p-error.md
@@ -0,0 +1,31 @@
+### Adjusting table lookup error probability
+
+This guide teaches how setting `p_error` configuration option can affect the performance of Concrete circuits.
+
+Adjusting table lookup error probability is discussed extensively in [Table lookup exactness](../../core-features/table_lookups_advanced.md#table-lookup-exactness) section. The idea is to sacrifice exactness to gain performance.
+
+For example:
+
+```python
+import numpy as np
+from concrete import fhe
+
+def f(x, y):
+    return (x // 2) * (y // 3)
+
+inputset = fhe.inputset(fhe.uint4, fhe.uint4)
+for p_error in [(1 / 1_000_000), (1 / 100_000), (1 / 10_000), (1 / 1_000), (1 / 100)]:
+    compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
+    circuit = compiler.compile(inputset, p_error=p_error)
+    print(f"p_error of {p_error:.6f} -> {int(circuit.complexity):_} complexity")
+```
+
+prints:
+
+```
+p_error of 0.000001 -> 294_773_524 complexity
+p_error of 0.000010 -> 286_577_520 complexity
+p_error of 0.000100 -> 275_887_080 complexity
+p_error of 0.001000 -> 265_196_640 complexity
+p_error of 0.010000 -> 184_144_972 complexity
+```
diff --git a/docs/optimization/optimize-cryptographic-parameters/self.md b/docs/optimization/optimize-cryptographic-parameters/self.md
@@ -0,0 +1,5 @@
+## Optimize cryptographic parameters
+
+This guide teaches how to help Concrete Optimizer to select more performant parameters to improve the execution time of Concrete circuits.
+
+The idea is to obtain more optimal cryptographic parameters (especially for table lookups) without changing the operations within the circuit.
diff --git a/docs/optimization/optimize-table-lookups/approximate.md b/docs/optimization/optimize-table-lookups/approximate.md
@@ -0,0 +1,36 @@
+### Activating approximate mode for rounding
+
+This guide teaches how to improve the execution time of Concrete circuits by using approximate mode for rounding.
+
+You can enable [approximate mode](../../core-features/rounding.md#exactness) to gain even more performance when using rounding by sacrificing some more exactness:
+
+```python
+import numpy as np
+from concrete import fhe
+
+inputset = fhe.inputset(fhe.uint10)
+for lsbs_to_remove in range(0, 10):
+    def f(x):
+        return fhe.round_bit_pattern(x, lsbs_to_remove, exactness=fhe.Exactness.APPROXIMATE) // 2
+
+    compiler = fhe.Compiler(f, {"x": "encrypted"})
+    circuit = compiler.compile(inputset)
+
+    print(f"{lsbs_to_remove=} -> {int(circuit.complexity):>13_} complexity")
+
+```
+
+prints:
+
+```
+lsbs_to_remove=0 -> 9_134_406_574 complexity
+lsbs_to_remove=1 -> 5_548_275_712 complexity
+lsbs_to_remove=2 -> 2_430_793_927 complexity
+lsbs_to_remove=3 -> 1_058_638_119 complexity
+lsbs_to_remove=4 ->   409_952_712 complexity
+lsbs_to_remove=5 ->   172_138_947 complexity
+lsbs_to_remove=6 ->    99_198_195 complexity
+lsbs_to_remove=7 ->    71_644_380 complexity
+lsbs_to_remove=8 ->    55_860_516 complexity
+lsbs_to_remove=9 ->    50_978_148 complexity
+```
diff --git a/docs/optimization/optimize-table-lookups/bit-extraction.md b/docs/optimization/optimize-table-lookups/bit-extraction.md
@@ -0,0 +1,38 @@
+### Utilizing bit extraction
+
+This guide teaches how to improve the execution time of Concrete circuits by using bit extraction.
+
+[Bit extraction](../../core-features/bit_extraction.md) is a cheap way to extract certain bits of encrypted values. It can be very useful for improving the performance of circuits.
+
+For example:
+
+```python
+import numpy as np
+from concrete import fhe
+
+inputset = fhe.inputset(fhe.uint6)
+for bit_extraction in [False, True]:
+    def is_even(x):
+        return (
+            x % 2 == 0
+            if not bit_extraction
+            else 1 - fhe.bits(x)[0]
+        )
+
+    compiler = fhe.Compiler(is_even, {"x": "encrypted"})
+    circuit = compiler.compile(inputset)
+
+    if not bit_extraction:
+        print(f"without bit extraction -> {int(circuit.complexity):>11_} complexity")
+    else:
+        print(f"   with bit extraction -> {int(circuit.complexity):>11_} complexity")
+```
+
+prints:
+
+```
+without bit extraction -> 230_210_706 complexity
+   with bit extraction ->  29_506_014 complexity
+```
+
+That's almost 8x improvement to circuit complexity!