-
Notifications
You must be signed in to change notification settings - Fork 149
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(frontend): add performance tips section
- Loading branch information
1 parent
dfaf987
commit 2968e80
Showing
16 changed files
with
687 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+29.3 KB
docs/_static/compilation/performance_tips/complexity_and_timing_per_bit_width.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
### Enabling dataflow parallelism | ||
|
||
This guide teaches what data parallelism is and how it can improve the execution time of Concrete circuits. | ||
|
||
Dataflow parallelism is a great feature, especially when the circuit is doing a lot of scalar operations. | ||
|
||
Without dataflow parallelism, circuit is executed operation by operation, like an imperative language. If the operations themselves are not tensorized, loop parallelism would not be utilized and the entire execution would happen in a single thread. Dataflow parallelism changes this by analyzing the operations and their dependencies within the circuit to determine what can be done in parallel and what cannot. Then it distributes the tasks that can be done in parallel to different threads. | ||
|
||
For example: | ||
|
||
```python | ||
import time | ||
|
||
import numpy as np | ||
from concrete import fhe | ||
|
||
def f(x, y, z): | ||
# normally, you'd use fhe.array to construct a concrete tensor | ||
# but for this example, we just create a simple numpy array | ||
# so the matrix multiplication can happen on a cellular level | ||
a = np.array([[x, y], [z, 2]]) | ||
b = np.array([[1, x], [z, y]]) | ||
return fhe.array(a @ b) | ||
|
||
inputset = fhe.inputset(fhe.uint3, fhe.uint3, fhe.uint3) | ||
|
||
for dataflow_parallelize in [False, True]: | ||
compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"}) | ||
circuit = compiler.compile(inputset, dataflow_parallelize=dataflow_parallelize) | ||
|
||
circuit.keygen() | ||
for sample in inputset[:3]: # warmup | ||
circuit.encrypt_run_decrypt(*sample) | ||
|
||
timings = [] | ||
for sample in inputset[3:13]: | ||
start = time.time() | ||
result = circuit.encrypt_run_decrypt(*sample) | ||
end = time.time() | ||
|
||
assert np.array_equal(result, f(*sample)) | ||
timings.append(end - start) | ||
|
||
if not dataflow_parallelize: | ||
print(f"without dataflow parallelize -> {np.mean(timings):.03f}s") | ||
else: | ||
print(f" with dataflow parallelize -> {np.mean(timings):.03f}s") | ||
``` | ||
|
||
prints: | ||
|
||
``` | ||
without dataflow parallelize -> 0.609s | ||
with dataflow parallelize -> 0.414s | ||
``` | ||
|
||
and the reason for that is: | ||
|
||
``` | ||
// this is the generated MLIR for the circuit | ||
// without dataflow, every single line would be executed one after the other | ||
module { | ||
func.func @main(%arg0: !FHE.eint<7>, %arg1: !FHE.eint<7>, %arg2: !FHE.eint<7>) -> tensor<2x2x!FHE.eint<7>> { | ||
// but if you look closely, you can see that this multiplication | ||
%c1_i2 = arith.constant 1 : i2 | ||
%0 = "FHE.mul_eint_int"(%arg0, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7> | ||
// is completely independent of this one, so dataflow makes them run in parallel | ||
%1 = "FHE.mul_eint"(%arg1, %arg2) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
// however, this addition needs the first two operations | ||
// so dataflow waits until both are done before performing this one | ||
%2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
// lastly, this multiplication is completely independent from the first three operations | ||
// so its execution starts in parallel when execution starts with dataflow | ||
%3 = "FHE.mul_eint"(%arg0, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
// similar logic can be applied to the remaining operations... | ||
%4 = "FHE.mul_eint"(%arg1, %arg1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
%5 = "FHE.add_eint"(%3, %4) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
%6 = "FHE.mul_eint_int"(%arg2, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7> | ||
%c2_i3 = arith.constant 2 : i3 | ||
%7 = "FHE.mul_eint_int"(%arg2, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7> | ||
%8 = "FHE.add_eint"(%6, %7) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
%9 = "FHE.mul_eint"(%arg2, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
%10 = "FHE.mul_eint_int"(%arg1, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7> | ||
%11 = "FHE.add_eint"(%9, %10) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7> | ||
%from_elements = tensor.from_elements %2, %5, %8, %11 : tensor<2x2x!FHE.eint<7>> | ||
return %from_elements : tensor<2x2x!FHE.eint<7>> | ||
} | ||
} | ||
``` | ||
|
||
To summarize, dataflow analyzes the circuit to determine which parts of the circuit can be run at the same time, and tries to run as many operations as possible in parallel. | ||
|
||
{% hint style="warning" %} | ||
When the circuit is tensorized, dataflow might slow execution down since the tensor operations already use multiple threads and adding dataflow on top creates congestion in the CPU between the HPX (dataflow parallelism runtime) and OpenMP (loop parallelism runtime). So try both before deciding on whether to use dataflow or not. | ||
{% endhint %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
## Improve parallelism | ||
|
||
This guide teaches the different options for parallelism in Concrete and how to utilize them to improve the execution time of Concrete circuits. | ||
|
||
Modern CPUs have multiple cores to perform computation and utilizing multiple cores is a great way to boost performance. | ||
|
||
There are two kinds of parallelism in Concrete: | ||
- Loop parallelism to make tensor operations parallel, achieved by using [OpenMP](https://www.openmp.org/) | ||
- Dataflow parallelism to make independent operations parallel, achieved by using [HPX](https://hpx.stellar-group.org/) | ||
|
||
Loop parallelism is enabled by default, as it's supported on all platforms. Dataflow parallelism however is only supported on Linux, hence not enabled by default. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
### Tensorizing operations | ||
|
||
This guide teaches what tensorization is and how it can improve the execution time of Concrete circuits. | ||
|
||
Tensors should be used instead of scalars when possible to maximize loop parallelism. | ||
|
||
For example: | ||
|
||
```python | ||
import time | ||
|
||
import numpy as np | ||
from concrete import fhe | ||
|
||
inputset = fhe.inputset(fhe.uint6, fhe.uint6, fhe.uint6) | ||
for tensorize in [False, True]: | ||
def f(x, y, z): | ||
return ( | ||
np.sum(fhe.array([x, y, z]) ** 2) | ||
if tensorize | ||
else (x ** 2) + (y ** 2) + (z ** 2) | ||
) | ||
|
||
compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"}) | ||
circuit = compiler.compile(inputset) | ||
|
||
circuit.keygen() | ||
for sample in inputset[:3]: # warmup | ||
circuit.encrypt_run_decrypt(*sample) | ||
|
||
timings = [] | ||
for sample in inputset[3:13]: | ||
start = time.time() | ||
result = circuit.encrypt_run_decrypt(*sample) | ||
end = time.time() | ||
|
||
assert np.array_equal(result, f(*sample)) | ||
timings.append(end - start) | ||
|
||
if not tensorize: | ||
print(f"without tensorization -> {np.mean(timings):.03f}s") | ||
else: | ||
print(f" with tensorization -> {np.mean(timings):.03f}s") | ||
``` | ||
|
||
prints: | ||
|
||
``` | ||
without tensorization -> 0.214s | ||
with tensorization -> 0.118s | ||
``` | ||
|
||
{% hint style="info" %} | ||
Enabling dataflow is kind of letting the runtime do this for you. It'd also help in the specific case. | ||
{% endhint %} |
65 changes: 65 additions & 0 deletions
65
docs/optimization/optimize-cryptographic-parameters/composition.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
### Specifying composition when using modules | ||
|
||
This guide explains how to optimize cryptographic parameters by specifying composition when using [modules](../../compilation/composing_functions_with_modules.md). | ||
|
||
When using [modules](../../compilation/composing_functions_with_modules.md) make sure to specify [composition](../../compilation/composing_functions_with_modules.md#optimizing-runtimes-with-composition-policies) so that the compiler can select more optimal parameters based on how the functions in the module would be used. | ||
|
||
For example: | ||
|
||
```python | ||
import numpy as np | ||
from concrete import fhe | ||
|
||
|
||
@fhe.module() | ||
class PowerWithoutComposition: | ||
@fhe.function({"x": "encrypted"}) | ||
def square(x): | ||
return x ** 2 | ||
|
||
@fhe.function({"x": "encrypted"}) | ||
def cube(x): | ||
return x ** 3 | ||
|
||
without_composition = PowerWithoutComposition.compile( | ||
{ | ||
"square": fhe.inputset(fhe.uint2), | ||
"cube": fhe.inputset(fhe.uint4), | ||
} | ||
) | ||
print(f"without composition -> {int(without_composition.complexity):>10_} complexity") | ||
|
||
|
||
@fhe.module() | ||
class PowerWithComposition: | ||
@fhe.function({"x": "encrypted"}) | ||
def square(x): | ||
return x ** 2 | ||
|
||
@fhe.function({"x": "encrypted"}) | ||
def cube(x): | ||
return x ** 3 | ||
|
||
composition = fhe.Wired( | ||
[ | ||
fhe.Wire(fhe.Output(square, 0), fhe.Input(cube, 0)) | ||
] | ||
) | ||
|
||
with_composition = PowerWithComposition.compile( | ||
{ | ||
"square": fhe.inputset(fhe.uint2), | ||
"cube": fhe.inputset(fhe.uint4), | ||
} | ||
) | ||
print(f" with composition -> {int(with_composition.complexity):>10_} complexity") | ||
``` | ||
|
||
prints: | ||
|
||
``` | ||
without composition -> 185_863_835 complexity | ||
with composition -> 135_871_612 complexity | ||
``` | ||
|
||
which means specifying composition resulted in ~35% improvement to complexity for computing `cube(square(x))`. |
31 changes: 31 additions & 0 deletions
31
docs/optimization/optimize-cryptographic-parameters/p-error.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
### Adjusting table lookup error probability | ||
|
||
This guide teaches how setting `p_error` configuration option can affect the performance of Concrete circuits. | ||
|
||
Adjusting table lookup error probability is discussed extensively in [Table lookup exactness](../../core-features/table_lookups_advanced.md#table-lookup-exactness) section. The idea is to sacrifice exactness to gain performance. | ||
|
||
For example: | ||
|
||
```python | ||
import numpy as np | ||
from concrete import fhe | ||
|
||
def f(x, y): | ||
return (x // 2) * (y // 3) | ||
|
||
inputset = fhe.inputset(fhe.uint4, fhe.uint4) | ||
for p_error in [(1 / 1_000_000), (1 / 100_000), (1 / 10_000), (1 / 1_000), (1 / 100)]: | ||
compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"}) | ||
circuit = compiler.compile(inputset, p_error=p_error) | ||
print(f"p_error of {p_error:.6f} -> {int(circuit.complexity):_} complexity") | ||
``` | ||
|
||
prints: | ||
|
||
``` | ||
p_error of 0.000001 -> 294_773_524 complexity | ||
p_error of 0.000010 -> 286_577_520 complexity | ||
p_error of 0.000100 -> 275_887_080 complexity | ||
p_error of 0.001000 -> 265_196_640 complexity | ||
p_error of 0.010000 -> 184_144_972 complexity | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
## Optimize cryptographic parameters | ||
|
||
This guide teaches how to help Concrete Optimizer to select more performant parameters to improve the execution time of Concrete circuits. | ||
|
||
The idea is to obtain more optimal cryptographic parameters (especially for table lookups) without changing the operations within the circuit. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
### Activating approximate mode for rounding | ||
|
||
This guide teaches how to improve the execution time of Concrete circuits by using approximate mode for rounding. | ||
|
||
You can enable [approximate mode](../../core-features/rounding.md#exactness) to gain even more performance when using rounding by sacrificing some more exactness: | ||
|
||
```python | ||
import numpy as np | ||
from concrete import fhe | ||
|
||
inputset = fhe.inputset(fhe.uint10) | ||
for lsbs_to_remove in range(0, 10): | ||
def f(x): | ||
return fhe.round_bit_pattern(x, lsbs_to_remove, exactness=fhe.Exactness.APPROXIMATE) // 2 | ||
|
||
compiler = fhe.Compiler(f, {"x": "encrypted"}) | ||
circuit = compiler.compile(inputset) | ||
|
||
print(f"{lsbs_to_remove=} -> {int(circuit.complexity):>13_} complexity") | ||
|
||
``` | ||
|
||
prints: | ||
|
||
``` | ||
lsbs_to_remove=0 -> 9_134_406_574 complexity | ||
lsbs_to_remove=1 -> 5_548_275_712 complexity | ||
lsbs_to_remove=2 -> 2_430_793_927 complexity | ||
lsbs_to_remove=3 -> 1_058_638_119 complexity | ||
lsbs_to_remove=4 -> 409_952_712 complexity | ||
lsbs_to_remove=5 -> 172_138_947 complexity | ||
lsbs_to_remove=6 -> 99_198_195 complexity | ||
lsbs_to_remove=7 -> 71_644_380 complexity | ||
lsbs_to_remove=8 -> 55_860_516 complexity | ||
lsbs_to_remove=9 -> 50_978_148 complexity | ||
``` |
38 changes: 38 additions & 0 deletions
38
docs/optimization/optimize-table-lookups/bit-extraction.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
### Utilizing bit extraction | ||
|
||
This guide teaches how to improve the execution time of Concrete circuits by using bit extraction. | ||
|
||
[Bit extraction](../../core-features/bit_extraction.md) is a cheap way to extract certain bits of encrypted values. It can be very useful for improving the performance of circuits. | ||
|
||
For example: | ||
|
||
```python | ||
import numpy as np | ||
from concrete import fhe | ||
|
||
inputset = fhe.inputset(fhe.uint6) | ||
for bit_extraction in [False, True]: | ||
def is_even(x): | ||
return ( | ||
x % 2 == 0 | ||
if not bit_extraction | ||
else 1 - fhe.bits(x)[0] | ||
) | ||
|
||
compiler = fhe.Compiler(is_even, {"x": "encrypted"}) | ||
circuit = compiler.compile(inputset) | ||
|
||
if not bit_extraction: | ||
print(f"without bit extraction -> {int(circuit.complexity):>11_} complexity") | ||
else: | ||
print(f" with bit extraction -> {int(circuit.complexity):>11_} complexity") | ||
``` | ||
|
||
prints: | ||
|
||
``` | ||
without bit extraction -> 230_210_706 complexity | ||
with bit extraction -> 29_506_014 complexity | ||
``` | ||
|
||
That's almost 8x improvement to circuit complexity! |
Oops, something went wrong.