-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating gate fusion #461
Updating gate fusion #461
Conversation
Codecov Report
@@ Coverage Diff @@
## master #461 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 85 84 -1
Lines 11876 11770 -106
==========================================
- Hits 11876 11770 -106
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I made a few updates in the gate fusion algorithm and now the performance matches with Qiskit. The improvements were mostly by playing with our current Python code and comparing the output fused circuits with the ones from qiskit, I have not looked into the fusion code of other libraries. Here are some benchmarks for several different circuits from the qibojit-benchmarks repository: creation_time - variational
creation_time - supremacy
creation_time - bv
creation_time - qv
creation_time - qft
dry_run_time - variational
dry_run_time - supremacy
dry_run_time - bv
dry_run_time - qv
dry_run_time - qft
simulation_times_mean - variational
simulation_times_mean - supremacy
simulation_times_mean - bv
simulation_times_mean - qv
simulation_times_mean - qft
Here I am comparing with qiskit using the Apart from this, the only thing remaining for this PR is to implement the |
@stavros11 following the discussion today, before merging this PR we should verify:
|
Thanks for the summary and the list, perhaps I would add an additional point for comparing performance with libraries other than qiskit, such as Qulacs and Tensorflow Quantum. I fixed the
Regarding this point, I did some benchmarks on the above circuits using qiskit and changing the simulation_times_mean - variational
simulation_times_mean - supremacy
simulation_times_mean - bv
simulation_times_mean - qv
simulation_times_mean - qft
I do not see something special about their default choice max_qubit=5. In most cases performance keeps increasing for larger values, however there is a saturation around 7. I guess at some point it will start becoming a trade off between performance and memory as the fused gate matrices will be (2^max_qubit, 2^max_qubit). I have not checked the gates of the fused circuits explicitly so it is not certain that the set max_qubit value is always reached, eg. it might be the case that it never fuses gates to more than 7 qubits, even if we set the max_qubit higher, that's why the saturation. Based on these results, I would say that it would be worth trying to implement kernels for more than two qubits and expanding the fusion algorithm. I am just not sure if we should preset a maximum possible value for max_qubit and whether we can define kernels with dynamic size or we need a seperate kernel for each qubit number. Higher than two qubit kernel would not be relevant for experiments but may help in simulation including applications other than fusion. |
@stavros11 thanks for this first point. I believe would be great to have custom operators with larger number of qubits, but we can do this in a later PR. Concerning this point, do you have an idea of how much memory our fusion is using when compared to the no-fusion circuit? |
I have experimented a bit with both the qibo and qiskit fusion and I do not observe any significant change in memory usage between using fusion and no. Particularly for qiskit, I played with the max_qubit option and even used circuits where this number is equal to the total number of qubits (eg. using max_qubit=15 in a circuit with 15 qubits) but still did not observe any change compared to the no fusion simulation. I guess the fusion algorithm is smart and never creates very large gates as this would most likely reduce performance. This is probably why we observe a saturation as max_qubit is increased. In Qibo all fused gates are 4x4 matrices so it is unlikely to get memory issues unless the circuit is exponentially deep, but in such cases there is not much we can do I think.
Regarding this point, I did some benchmarks using both qibojit and qibotf for the variational circuit example only: creation_time
dry_run_time
simulation_times_mean
As expected the GPU helps in the circuit execution. For qibotf the GPU simulation time has the usual issue so the dry run is more accurate measurement. Other than that I believe the numbers are reasonable for both backends. Regarding performing the gate fusion matrix multiplications on GPU, I tried this for the above example and got the following times: creation_time
dry_run_time
simulation_times_mean
So we notice that the GPU dry run times become significantly slower for both backends. The difference is due to the fused matrix calculations which are only done during the dry run as these matrices are cached once calculated. To be more concrete, in terms of code doing the fusion operation on GPU means that the backend Based on the above results, I believe it is preferable, at least for the two qubit fusion, to keep these operations on numpy/CPU (that is keep matrices = [cp.random.random((4, 4)) for _ in range(100)]
start_time = time.time()
t = cp.eye(4)
for m in matrices:
t = m @ t
print(time.time() - start_time) vs doing the operation on CPU and casting back on GPU: matrices = [cp.random.random((4, 4)) for _ in range(100)]
start_time = time.time()
t = np.eye(4)
for m in matrices:
t = m.get() @ t # m must be transfered from GPU to CPU
t = cp.array(t) # result transfered back to GPU
print(time.time() - start_time) In my local machine the latter is more than x1000 faster for these dimensions despite the transfers. The GPU becomes faster only when I increase the size of matrices, eg. to 1024x1024. With that in mind I believe this PR is ready as it is. We should consider extending the kernels to more qubits in the future. |
@stavros11 thank you very much for this. I also believe the CPU projection approach is acceptable and fully justified. |
@mlazzarin could you please perform a first review of this PR, e.g. checking the code, example and docs? |
Ok, I could do that tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I made a few minor remarks regarding comments/docstrings.
src/qibo/tests/test_core_circuit.py
Outdated
@@ -31,8 +31,7 @@ def test_circuit_add_layer(backend, nqubits, accelerators): | |||
for gate in c.queue: | |||
assert isinstance(gate, gates.Unitary) | |||
|
|||
# TODO: Test `_fuse_copy` | |||
# TODO: Test `fuse` | |||
# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py` | |
# :meth:`qibo.core.circuit.Circuit.fuse()` is tested in `test_core_fusion.py` |
src/qibo/core/gates.py
Outdated
class FusedGate(MatrixGate, abstract_gates.FusedGate): | ||
|
||
def __init__(self, *q): | ||
BackendGate.__init__(self) | ||
abstract_gates.FusedGate.__init__(self, *q) | ||
if self.gate_op: | ||
if len(self.target_qubits) == 1: | ||
self.gate_op = K.op.apply_gate | ||
elif len(self.target_qubits) == 2: | ||
self.gate_op = K.op.apply_two_qubit_gate | ||
else: | ||
raise_error(NotImplementedError, "Fused gates can target up to two qubits.") | ||
|
||
def _construct_unitary(self): | ||
matrix = K.qnp.eye(2 ** len(self.target_qubits)) | ||
for gate in self.gates: | ||
gmatrix = K.to_numpy(gate.matrix) | ||
if len(gate.qubits) < len(self.target_qubits): | ||
if gate.qubits[0] == self.target_qubits[0]: | ||
gmatrix = K.qnp.kron(gmatrix, K.qnp.eye(2)) | ||
else: | ||
gmatrix = K.qnp.kron(K.qnp.eye(2), gmatrix) | ||
elif gate.qubits != self.target_qubits: | ||
gmatrix = K.qnp.reshape(gmatrix, 4 * (2,)) | ||
gmatrix = K.qnp.transpose(gmatrix, [1, 0, 3, 2]) | ||
gmatrix = K.qnp.reshape(gmatrix, (4, 4)) | ||
matrix = gmatrix @ matrix | ||
return K.cast(matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be useful to add a docstring that explain where this class is used, and some comments that highlight where the two-qubit limit is hard coded (it may be useful in the future).
Thank you for reviewing this. I added some docstrings and comments in the suggested places, please let me know if these are easy to follow and make it any easier to read through the code. |
Thank you very much, they are useful indeed. |
This attempts to improve gate fusion to match the performance of other libraries (see #451). I tried to check what other libraries do in this case and qiskit, qsim and qulacs seem to implement the fusion in C++. See for example here for the qiskit implementation.
Here I implemented a single algorithm which in the first pass it fuses one qubit gates and in the second pass it fuses two qubit gates with the neighboring one qubit gates. There is no specific selection on which gates will be fused, we just loop through the gates and perform all possible fusions in order.
Here is some performance comparison with qiskit using the variational circuit benchmark, with and without fusion.
No fusion
Fusion (using `fusion_max_qubit=2`)
Fusion
The third table uses the default qiskit fusion, which goes up to 5-qubit gates, while the second table limits qiskit to use up to two qubit fused gates, as the qibo implementation does. Note that in Qibo we cannot go to more than two qubits using the custom backends since we only have kernels for up to two qubit gates.
Qiskit performance is still better. The difference could be either due to different algorithm used or because of the C++ vs Python overhead, or more likely both.