Skip to content

Commit

Permalink
docs(frontend): add performance tips section
Browse files Browse the repository at this point in the history
  • Loading branch information
umut-sahin committed Aug 2, 2024
1 parent dfaf987 commit 2968e80
Show file tree
Hide file tree
Showing 16 changed files with 687 additions and 0 deletions.
14 changes: 14 additions & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@

* [Simulation](execution-analysis/simulation.md)
* [Debugging and artifact](execution-analysis/debug.md)
* [Performance](optimization/summary.md)
* [GPU acceleration](execution-analysis/gpu_acceleration.md)
* Other
* [Statistics](compilation/statistics.md)
Expand All @@ -46,6 +47,19 @@
* [Configure](guides/configure.md)
* [Manage keys](guides/manage_keys.md)
* [Deploy](guides/deploy.md)
* [Optimization](optimization/self.md)
* [Improve parallelism](optimization/improve-parallelism/self.md)
* [Dataflow parallelism](optimization/improve-parallelism/dataflow.md)
* [Tensorizing operations](optimization/improve-parallelism/tensorization.md)
* [Optimize table lookups](optimization/optimize-table-lookups/self.md)
* [Reducing TLU](optimization/optimize-table-lookups/reducing-amount.md)
* [Implementation strategies](optimization/optimize-table-lookups/strategies.md)
* [Round/truncating](optimization/optimize-table-lookups/round-truncate.md)
* [Approximate mode](optimization/optimize-table-lookups/approximate.md)
* [Bit extraction](optimization/optimize-table-lookups/bit-extraction.md)
* [Optimize cryptographic parameters](optimization/optimize-cryptographic-parameters/self.md)
* [Error probability](optimization/optimize-cryptographic-parameters/p-error.md)
* [Composition](optimization/optimize-cryptographic-parameters/composition.md)

## Tutorials

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
102 changes: 102 additions & 0 deletions docs/optimization/improve-parallelism/dataflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
### Enabling dataflow parallelism

This guide teaches what data parallelism is and how it can improve the execution time of Concrete circuits.

Dataflow parallelism is a great feature, especially when the circuit is doing a lot of scalar operations.

Without dataflow parallelism, circuit is executed operation by operation, like an imperative language. If the operations themselves are not tensorized, loop parallelism would not be utilized and the entire execution would happen in a single thread. Dataflow parallelism changes this by analyzing the operations and their dependencies within the circuit to determine what can be done in parallel and what cannot. Then it distributes the tasks that can be done in parallel to different threads.

For example:

```python
import time

import numpy as np
from concrete import fhe

def f(x, y, z):
# normally, you'd use fhe.array to construct a concrete tensor
# but for this example, we just create a simple numpy array
# so the matrix multiplication can happen on a cellular level
a = np.array([[x, y], [z, 2]])
b = np.array([[1, x], [z, y]])
return fhe.array(a @ b)

inputset = fhe.inputset(fhe.uint3, fhe.uint3, fhe.uint3)

for dataflow_parallelize in [False, True]:
compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"})
circuit = compiler.compile(inputset, dataflow_parallelize=dataflow_parallelize)

circuit.keygen()
for sample in inputset[:3]: # warmup
circuit.encrypt_run_decrypt(*sample)

timings = []
for sample in inputset[3:13]:
start = time.time()
result = circuit.encrypt_run_decrypt(*sample)
end = time.time()

assert np.array_equal(result, f(*sample))
timings.append(end - start)

if not dataflow_parallelize:
print(f"without dataflow parallelize -> {np.mean(timings):.03f}s")
else:
print(f" with dataflow parallelize -> {np.mean(timings):.03f}s")
```

prints:

```
without dataflow parallelize -> 0.609s
with dataflow parallelize -> 0.414s
```

and the reason for that is:

```
// this is the generated MLIR for the circuit
// without dataflow, every single line would be executed one after the other
module {
func.func @main(%arg0: !FHE.eint<7>, %arg1: !FHE.eint<7>, %arg2: !FHE.eint<7>) -> tensor<2x2x!FHE.eint<7>> {
// but if you look closely, you can see that this multiplication
%c1_i2 = arith.constant 1 : i2
%0 = "FHE.mul_eint_int"(%arg0, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
// is completely independent of this one, so dataflow makes them run in parallel
%1 = "FHE.mul_eint"(%arg1, %arg2) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
// however, this addition needs the first two operations
// so dataflow waits until both are done before performing this one
%2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
// lastly, this multiplication is completely independent from the first three operations
// so its execution starts in parallel when execution starts with dataflow
%3 = "FHE.mul_eint"(%arg0, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
// similar logic can be applied to the remaining operations...
%4 = "FHE.mul_eint"(%arg1, %arg1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
%5 = "FHE.add_eint"(%3, %4) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
%6 = "FHE.mul_eint_int"(%arg2, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
%c2_i3 = arith.constant 2 : i3
%7 = "FHE.mul_eint_int"(%arg2, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
%8 = "FHE.add_eint"(%6, %7) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
%9 = "FHE.mul_eint"(%arg2, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
%10 = "FHE.mul_eint_int"(%arg1, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
%11 = "FHE.add_eint"(%9, %10) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
%from_elements = tensor.from_elements %2, %5, %8, %11 : tensor<2x2x!FHE.eint<7>>
return %from_elements : tensor<2x2x!FHE.eint<7>>
}
}
```

To summarize, dataflow analyzes the circuit to determine which parts of the circuit can be run at the same time, and tries to run as many operations as possible in parallel.

{% hint style="warning" %}
When the circuit is tensorized, dataflow might slow execution down since the tensor operations already use multiple threads and adding dataflow on top creates congestion in the CPU between the HPX (dataflow parallelism runtime) and OpenMP (loop parallelism runtime). So try both before deciding on whether to use dataflow or not.
{% endhint %}
11 changes: 11 additions & 0 deletions docs/optimization/improve-parallelism/self.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## Improve parallelism

This guide teaches the different options for parallelism in Concrete and how to utilize them to improve the execution time of Concrete circuits.

Modern CPUs have multiple cores to perform computation and utilizing multiple cores is a great way to boost performance.

There are two kinds of parallelism in Concrete:
- Loop parallelism to make tensor operations parallel, achieved by using [OpenMP](https://www.openmp.org/)
- Dataflow parallelism to make independent operations parallel, achieved by using [HPX](https://hpx.stellar-group.org/)

Loop parallelism is enabled by default, as it's supported on all platforms. Dataflow parallelism however is only supported on Linux, hence not enabled by default.
55 changes: 55 additions & 0 deletions docs/optimization/improve-parallelism/tensorization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
### Tensorizing operations

This guide teaches what tensorization is and how it can improve the execution time of Concrete circuits.

Tensors should be used instead of scalars when possible to maximize loop parallelism.

For example:

```python
import time

import numpy as np
from concrete import fhe

inputset = fhe.inputset(fhe.uint6, fhe.uint6, fhe.uint6)
for tensorize in [False, True]:
def f(x, y, z):
return (
np.sum(fhe.array([x, y, z]) ** 2)
if tensorize
else (x ** 2) + (y ** 2) + (z ** 2)
)

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"})
circuit = compiler.compile(inputset)

circuit.keygen()
for sample in inputset[:3]: # warmup
circuit.encrypt_run_decrypt(*sample)

timings = []
for sample in inputset[3:13]:
start = time.time()
result = circuit.encrypt_run_decrypt(*sample)
end = time.time()

assert np.array_equal(result, f(*sample))
timings.append(end - start)

if not tensorize:
print(f"without tensorization -> {np.mean(timings):.03f}s")
else:
print(f" with tensorization -> {np.mean(timings):.03f}s")
```

prints:

```
without tensorization -> 0.214s
with tensorization -> 0.118s
```

{% hint style="info" %}
Enabling dataflow is kind of letting the runtime do this for you. It'd also help in the specific case.
{% endhint %}
65 changes: 65 additions & 0 deletions docs/optimization/optimize-cryptographic-parameters/composition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
### Specifying composition when using modules

This guide explains how to optimize cryptographic parameters by specifying composition when using [modules](../../compilation/composing_functions_with_modules.md).

When using [modules](../../compilation/composing_functions_with_modules.md) make sure to specify [composition](../../compilation/composing_functions_with_modules.md#optimizing-runtimes-with-composition-policies) so that the compiler can select more optimal parameters based on how the functions in the module would be used.

For example:

```python
import numpy as np
from concrete import fhe


@fhe.module()
class PowerWithoutComposition:
@fhe.function({"x": "encrypted"})
def square(x):
return x ** 2

@fhe.function({"x": "encrypted"})
def cube(x):
return x ** 3

without_composition = PowerWithoutComposition.compile(
{
"square": fhe.inputset(fhe.uint2),
"cube": fhe.inputset(fhe.uint4),
}
)
print(f"without composition -> {int(without_composition.complexity):>10_} complexity")


@fhe.module()
class PowerWithComposition:
@fhe.function({"x": "encrypted"})
def square(x):
return x ** 2

@fhe.function({"x": "encrypted"})
def cube(x):
return x ** 3

composition = fhe.Wired(
[
fhe.Wire(fhe.Output(square, 0), fhe.Input(cube, 0))
]
)

with_composition = PowerWithComposition.compile(
{
"square": fhe.inputset(fhe.uint2),
"cube": fhe.inputset(fhe.uint4),
}
)
print(f" with composition -> {int(with_composition.complexity):>10_} complexity")
```

prints:

```
without composition -> 185_863_835 complexity
with composition -> 135_871_612 complexity
```

which means specifying composition resulted in ~35% improvement to complexity for computing `cube(square(x))`.
31 changes: 31 additions & 0 deletions docs/optimization/optimize-cryptographic-parameters/p-error.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
### Adjusting table lookup error probability

This guide teaches how setting `p_error` configuration option can affect the performance of Concrete circuits.

Adjusting table lookup error probability is discussed extensively in [Table lookup exactness](../../core-features/table_lookups_advanced.md#table-lookup-exactness) section. The idea is to sacrifice exactness to gain performance.

For example:

```python
import numpy as np
from concrete import fhe

def f(x, y):
return (x // 2) * (y // 3)

inputset = fhe.inputset(fhe.uint4, fhe.uint4)
for p_error in [(1 / 1_000_000), (1 / 100_000), (1 / 10_000), (1 / 1_000), (1 / 100)]:
compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, p_error=p_error)
print(f"p_error of {p_error:.6f} -> {int(circuit.complexity):_} complexity")
```

prints:

```
p_error of 0.000001 -> 294_773_524 complexity
p_error of 0.000010 -> 286_577_520 complexity
p_error of 0.000100 -> 275_887_080 complexity
p_error of 0.001000 -> 265_196_640 complexity
p_error of 0.010000 -> 184_144_972 complexity
```
5 changes: 5 additions & 0 deletions docs/optimization/optimize-cryptographic-parameters/self.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Optimize cryptographic parameters

This guide teaches how to help Concrete Optimizer to select more performant parameters to improve the execution time of Concrete circuits.

The idea is to obtain more optimal cryptographic parameters (especially for table lookups) without changing the operations within the circuit.
36 changes: 36 additions & 0 deletions docs/optimization/optimize-table-lookups/approximate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
### Activating approximate mode for rounding

This guide teaches how to improve the execution time of Concrete circuits by using approximate mode for rounding.

You can enable [approximate mode](../../core-features/rounding.md#exactness) to gain even more performance when using rounding by sacrificing some more exactness:

```python
import numpy as np
from concrete import fhe

inputset = fhe.inputset(fhe.uint10)
for lsbs_to_remove in range(0, 10):
def f(x):
return fhe.round_bit_pattern(x, lsbs_to_remove, exactness=fhe.Exactness.APPROXIMATE) // 2

compiler = fhe.Compiler(f, {"x": "encrypted"})
circuit = compiler.compile(inputset)

print(f"{lsbs_to_remove=} -> {int(circuit.complexity):>13_} complexity")

```

prints:

```
lsbs_to_remove=0 -> 9_134_406_574 complexity
lsbs_to_remove=1 -> 5_548_275_712 complexity
lsbs_to_remove=2 -> 2_430_793_927 complexity
lsbs_to_remove=3 -> 1_058_638_119 complexity
lsbs_to_remove=4 -> 409_952_712 complexity
lsbs_to_remove=5 -> 172_138_947 complexity
lsbs_to_remove=6 -> 99_198_195 complexity
lsbs_to_remove=7 -> 71_644_380 complexity
lsbs_to_remove=8 -> 55_860_516 complexity
lsbs_to_remove=9 -> 50_978_148 complexity
```
38 changes: 38 additions & 0 deletions docs/optimization/optimize-table-lookups/bit-extraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
### Utilizing bit extraction

This guide teaches how to improve the execution time of Concrete circuits by using bit extraction.

[Bit extraction](../../core-features/bit_extraction.md) is a cheap way to extract certain bits of encrypted values. It can be very useful for improving the performance of circuits.

For example:

```python
import numpy as np
from concrete import fhe

inputset = fhe.inputset(fhe.uint6)
for bit_extraction in [False, True]:
def is_even(x):
return (
x % 2 == 0
if not bit_extraction
else 1 - fhe.bits(x)[0]
)

compiler = fhe.Compiler(is_even, {"x": "encrypted"})
circuit = compiler.compile(inputset)

if not bit_extraction:
print(f"without bit extraction -> {int(circuit.complexity):>11_} complexity")
else:
print(f" with bit extraction -> {int(circuit.complexity):>11_} complexity")
```

prints:

```
without bit extraction -> 230_210_706 complexity
with bit extraction -> 29_506_014 complexity
```

That's almost 8x improvement to circuit complexity!
Loading

0 comments on commit 2968e80

Please sign in to comment.