Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating gate fusion #461

Merged
merged 34 commits into from
Sep 22, 2021
Merged

Updating gate fusion #461

merged 34 commits into from
Sep 22, 2021

Conversation

stavros11
Copy link
Member

This attempts to improve gate fusion to match the performance of other libraries (see #451). I tried to check what other libraries do in this case and qiskit, qsim and qulacs seem to implement the fusion in C++. See for example here for the qiskit implementation.

Here I implemented a single algorithm which in the first pass it fuses one qubit gates and in the second pass it fuses two qubit gates with the neighboring one qubit gates. There is no specific selection on which gates will be fused, we just loop through the gates and perform all possible fusions in order.

Here is some performance comparison with qiskit using the variational circuit benchmark, with and without fusion.

No fusion
nqubits creation time qiskit creation time qibo dry run time qiskit dry run time qibo simulation time qiskit simulation time qibo
22 0.04288 0.00261 0.05275 0.07731 0.06422 0.02616
23 0.04281 0.00265 0.09580 0.11051 0.08741 0.05339
24 0.04331 0.00279 0.15906 0.17514 0.17004 0.10230
25 0.04297 0.00282 0.54398 0.53285 0.54866 0.50177
26 0.04433 0.00287 1.12366 1.15239 1.14877 1.06958
27 0.04379 0.00302 2.31479 2.26544 2.38414 2.18739
28 0.04433 0.00300 4.74315 4.69717 4.90792 4.57531
29 0.04490 0.00304 9.77379 9.47510 9.99879 9.29732
30 0.04698 0.00360 20.06423 19.77659 20.57615 19.58396
Fusion (using `fusion_max_qubit=2`)
nqubits creation time qiskit creation time qibo dry run time qiskit dry run time qibo simulation time qiskit simulation time qibo
22 0.04253 0.00384 0.02823 0.06814 0.03171 0.02035
23 0.04389 0.00386 0.04512 0.09582 0.05023 0.04237
24 0.04320 0.00402 0.06710 0.17278 0.08034 0.09608
25 0.04434 0.00400 0.18203 0.31944 0.19189 0.24066
26 0.04434 0.00412 0.36120 0.57665 0.39272 0.48691
27 0.04514 0.00469 0.77073 1.12824 0.81376 0.99514
28 0.04493 0.00484 1.49052 2.33499 1.62031 2.03122
29 0.04490 0.00494 3.06657 4.57194 3.32253 4.39394
30 0.04474 0.00512 6.11339 9.23624 6.62070 9.05363
Fusion
nqubits creation time qiskit creation time qibo dry run time qiskit dry run time qibo simulation time qiskit simulation time qibo
22 0.04236 0.00384 0.01942 0.06814 0.02362 0.02035
23 0.04302 0.00386 0.03487 0.09582 0.04113 0.04237
24 0.04326 0.00402 0.05549 0.17278 0.06585 0.09608
25 0.04310 0.00400 0.14205 0.31944 0.15691 0.24066
26 0.04376 0.00412 0.28565 0.57665 0.31373 0.48691
27 0.04398 0.00469 0.59204 1.12824 0.66236 0.99514
28 0.04496 0.00484 1.15985 2.33499 1.29715 2.03122
29 0.04487 0.00494 2.45058 4.57194 2.70454 4.39394
30 0.04594 0.00512 4.90200 9.23624 5.36761 9.05363

The third table uses the default qiskit fusion, which goes up to 5-qubit gates, while the second table limits qiskit to use up to two qubit fused gates, as the qibo implementation does. Note that in Qibo we cannot go to more than two qubits using the custom backends since we only have kernels for up to two qubit gates.

Qiskit performance is still better. The difference could be either due to different algorithm used or because of the C++ vs Python overhead, or more likely both.

@codecov
Copy link

codecov bot commented Sep 15, 2021

Codecov Report

Merging #461 (c3e7997) into master (f577a91) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            master      #461    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           85        84     -1     
  Lines        11876     11770   -106     
==========================================
- Hits         11876     11770   -106     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/qibo/tests/test_abstract_circuit.py 100.00% <ø> (ø)
src/qibo/tests/test_core_circuit.py 100.00% <ø> (ø)
src/qibo/abstractions/abstract_gates.py 100.00% <100.00%> (ø)
src/qibo/abstractions/circuit.py 100.00% <100.00%> (ø)
src/qibo/abstractions/gates.py 100.00% <100.00%> (ø)
src/qibo/core/circuit.py 100.00% <100.00%> (ø)
src/qibo/core/distcircuit.py 100.00% <100.00%> (ø)
src/qibo/core/gates.py 100.00% <100.00%> (ø)
src/qibo/tests/test_abstract_gates.py 100.00% <100.00%> (ø)
src/qibo/tests/test_core_circuit_features.py 100.00% <100.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f577a91...c3e7997. Read the comment docs.

@stavros11
Copy link
Member Author

I made a few updates in the gate fusion algorithm and now the performance matches with Qiskit. The improvements were mostly by playing with our current Python code and comparing the output fused circuits with the ones from qiskit, I have not looked into the fusion code of other libraries. Here are some benchmarks for several different circuits from the qibojit-benchmarks repository:

creation_time - variational
nqubits qiskit qiskit (two qubit) qibo
18 0.04065 0.04146 0.00298
19 0.04141 0.04129 0.00303
20 0.04184 0.04184 0.00311
21 0.04230 0.04263 0.00326
22 0.04275 0.04314 0.00328
23 0.04356 0.04288 0.00338
24 0.04391 0.04337 0.00352
25 0.04380 0.04397 0.00358
26 0.04400 0.04412 0.00439
27 0.04458 0.04424 0.00415
28 0.04433 0.04484 0.00434
29 0.04500 0.04454 0.00436
30 0.04515 0.04539 0.00467
creation_time - supremacy
nqubits qiskit qiskit (two qubit) qibo
18 0.03981 0.04025 0.00488
19 0.04090 0.04122 0.00529
20 0.04104 0.04130 0.00541
21 0.04174 0.04187 0.00552
22 0.04141 0.04189 0.00576
23 0.04250 0.04177 0.00591
24 0.04239 0.04267 0.00605
25 0.04279 0.04303 0.00586
26 0.04273 0.04345 0.00640
27 0.04383 0.04283 0.00659
28 0.04348 0.04363 0.00670
29 0.04425 0.04426 0.00695
30 0.04479 0.04450 0.00700
creation_time - bv
nqubits qiskit qiskit (two qubit) qibo
18 0.03965 0.03980 0.00258
19 0.04049 0.04021 0.00265
20 0.04052 0.04131 0.00266
21 0.04087 0.04071 0.00291
22 0.04138 0.04069 0.00287
23 0.04158 0.04153 0.00307
24 0.04133 0.04157 0.00296
25 0.04186 0.04221 0.00308
26 0.04202 0.04235 0.00355
27 0.04257 0.04286 0.00368
28 0.04266 0.04307 0.00379
29 0.04585 0.04325 0.00393
30 0.04290 0.04385 0.00412
creation_time - qv
nqubits qiskit qiskit (two qubit) qibo
18 0.28577 0.27792 0.01087
19 0.28534 0.29191 0.01091
20 0.27896 0.28904 0.01124
21 0.29239 0.28102 0.01137
22 0.28758 0.29007 0.01222
23 0.28676 0.28332 0.01210
24 0.29299 0.28341 0.01297
25 0.29028 0.28740 0.01264
26 0.29127 0.29472 0.01334
27 0.28485 0.28685 0.01293
28 0.28586 0.29436 0.01388
29 0.29285 0.29762 0.01394
30 0.28904 0.29681 0.01443
creation_time - qft
nqubits qiskit qiskit (two qubit) qibo
18 0.06117 0.06059 0.00932
19 0.06315 0.06315 0.01042
20 0.06579 0.06515 0.01115
21 0.06869 0.06825 0.01186
22 0.07184 0.07263 0.01312
23 0.07540 0.07620 0.01405
24 0.08013 0.07941 0.01547
25 0.08299 0.08198 0.01652
26 0.08595 0.08740 0.01741
27 0.08899 0.09135 0.01900
28 0.09578 0.09446 0.02028
29 0.09808 0.09828 0.02105
30 0.10188 0.10211 0.02213
dry_run_time - variational
nqubits qiskit qiskit (two qubit) qibo
18 0.00807 0.01376 0.05248
19 0.01259 0.01208 0.05357
20 0.01092 0.01427 0.05899
21 0.01438 0.01910 0.06198
22 0.02122 0.02807 0.07461
23 0.03211 0.03801 0.08826
24 0.06010 0.07681 0.14901
25 0.14386 0.18610 0.22061
26 0.28454 0.36022 0.42207
27 0.58965 0.74806 0.71549
28 1.18559 1.49037 1.56468
29 2.46222 3.08915 2.95549
30 4.92414 6.13211 6.20692
dry_run_time - supremacy
nqubits qiskit qiskit (two qubit) qibo
18 0.00930 0.01138 0.49117
19 0.01167 0.01289 0.19414
20 0.01160 0.01363 0.19067
21 0.01682 0.01789 0.19895
22 0.02217 0.02467 0.20838
23 0.03376 0.03940 0.22358
24 0.05296 0.07066 0.27306
25 0.12937 0.19672 0.38054
26 0.24693 0.40252 0.54773
27 0.47316 0.82409 0.98895
28 1.00354 1.68372 1.73485
29 1.98238 3.34738 3.58885
30 3.89945 6.93802 6.76083
dry_run_time - bv
nqubits qiskit qiskit (two qubit) qibo
18 0.00899 0.00975 0.04902
19 0.00919 0.01196 0.05109
20 0.01258 0.01251 0.05455
21 0.01414 0.01715 0.06031
22 0.02040 0.02382 0.06584
23 0.03422 0.04285 0.09790
24 0.06083 0.08257 0.10464
25 0.14899 0.25198 0.28223
26 0.29465 0.50953 0.51721
27 0.59213 1.03156 1.06625
28 1.19239 2.10442 2.12164
29 2.41061 4.33946 4.35946
30 5.04592 8.90709 8.49319
dry_run_time - qv
nqubits qiskit qiskit (two qubit) qibo
18 0.01170 0.01025 0.05073
19 0.00926 0.01019 0.05623
20 0.01208 0.01235 0.05852
21 0.01621 0.01432 0.06097
22 0.02189 0.02654 0.06573
23 0.03193 0.03205 0.08846
24 0.05694 0.05291 0.13012
25 0.10503 0.14441 0.18442
26 0.21641 0.29920 0.32602
27 0.39561 0.57425 0.58844
28 0.73579 1.20204 1.22720
29 1.46428 2.36381 2.34498
30 3.09096 4.99359 4.74417
dry_run_time - qft
nqubits qiskit qiskit (two qubit) qibo
18 0.04680 0.15951 0.54917
19 0.05828 0.19554 0.06498
20 0.07511 0.24832 0.07566
21 0.09060 0.34120 0.08253
22 0.10543 0.42893 0.10224
23 0.17519 0.52038 0.16160
24 0.30901 1.11017 0.39842
25 0.91094 2.94783 1.04712
26 1.87574 6.13629 2.53819
27 3.88388 12.96822 5.43247
28 8.26834 27.65820 11.06349
29 17.38104 58.59509 23.13947
30 36.62964 124.18945 48.85041
simulation_times_mean - variational
nqubits qiskit qiskit (two qubit) qibo
18 0.00668 0.01099 0.00164
19 0.00804 0.01223 0.00245
20 0.00996 0.01509 0.00387
21 0.01615 0.02021 0.00948
22 0.02322 0.02992 0.01716
23 0.04162 0.04608 0.03306
24 0.06739 0.08019 0.07918
25 0.15706 0.18933 0.16000
26 0.31615 0.38573 0.33222
27 0.65377 0.81682 0.64535
28 1.29587 1.61279 1.40367
29 2.70374 3.31418 2.77074
30 5.37828 6.63932 6.12979
simulation_times_mean - supremacy
nqubits qiskit qiskit (two qubit) qibo
18 0.00594 0.00796 0.00158
19 0.00781 0.00929 0.00246
20 0.00933 0.01110 0.00364
21 0.01437 0.01701 0.00933
22 0.02118 0.02511 0.01620
23 0.03837 0.04523 0.03145
24 0.06270 0.07579 0.06293
25 0.13863 0.20276 0.17734
26 0.26971 0.42590 0.37606
27 0.53298 0.87264 0.77061
28 1.12530 1.79977 1.51866
29 2.20068 3.57973 3.23281
30 4.37278 7.40979 6.68904
simulation_times_mean - bv
nqubits qiskit qiskit (two qubit) qibo
18 0.00673 0.00895 0.00161
19 0.00763 0.01036 0.00295
20 0.00935 0.01198 0.00473
21 0.01569 0.01887 0.01063
22 0.02398 0.02925 0.01886
23 0.04112 0.04957 0.03844
24 0.07207 0.08864 0.05902
25 0.16021 0.25933 0.22253
26 0.32227 0.53419 0.48578
27 0.65315 1.08912 0.95682
28 1.33140 2.22416 1.95658
29 2.63782 4.57933 3.98087
30 5.52658 9.39029 8.38702
simulation_times_mean - qv
nqubits qiskit qiskit (two qubit) qibo
18 0.00681 0.00655 0.00127
19 0.00729 0.00759 0.00199
20 0.00922 0.00922 0.00315
21 0.01402 0.01474 0.00791
22 0.02263 0.02262 0.01417
23 0.03734 0.03614 0.02619
24 0.06255 0.06302 0.06671
25 0.11140 0.15199 0.11946
26 0.23175 0.31978 0.27303
27 0.44798 0.62832 0.50357
28 0.86967 1.30972 1.07623
29 1.68772 2.60102 2.13045
30 3.55397 5.46923 4.65302
simulation_times_mean - qft
nqubits qiskit qiskit (two qubit) qibo
18 0.04974 0.16390 0.00687
19 0.05759 0.19229 0.00953
20 0.07467 0.24225 0.01507
21 0.09373 0.30339 0.02773
22 0.14522 0.39349 0.04980
23 0.18156 0.55178 0.11046
24 0.37427 0.96208 0.28709
25 0.91660 2.96658 0.84098
26 1.89046 6.20743 2.39814
27 3.97402 13.06850 5.16271
28 8.47476 27.96610 10.89850
29 17.62641 58.98025 23.01198
30 37.03798 124.84739 48.54022

Here I am comparing with qiskit using the fusion_max_qubit=2 since the Qibo fusion is limited to two qubit gates too. As we can see, the default qiskit which goes up to up to five qubit gates provides a further advantage in many cases, however it would require us to implement a five qubit kernel in all custom backends.

Apart from this, the only thing remaining for this PR is to implement the set_parameters properly for the new fusion scheme. It would also be useful to compare performance with other libraries that provide fusion and circuit optimizations such as Qulacs or Tensorflow Quantum.

@scarrazza
Copy link
Member

@stavros11 following the discussion today, before merging this PR we should verify:

  • what happens with performance for `fusion_max_qubits > 5
  • run the qibo fusion on GPU
  • concerning the possibility to test fusion with more qubits

@stavros11
Copy link
Member Author

Thanks for the summary and the list, perhaps I would add an additional point for comparing performance with libraries other than qiskit, such as Qulacs and Tensorflow Quantum.

I fixed the set_parameters functionality for fused circuits so now this PR should be complete in terms of features. In terms of Qibo it remains to check performance on GPU and see if it is preferrable to move the gate matrix products on GPU instead of pure numpy as it is now.

  • what happens with performance for `fusion_max_qubits > 5

Regarding this point, I did some benchmarks on the above circuits using qiskit and changing the fusion_max_qubit flag:

simulation_times_mean - variational
nqubits max_qubit=1 max_qubit=2 max_qubit=3 max_qubit=4 max_qubit=5 max_qubit=6 max_qubit=7 max_qubit=8 max_qubit=9 max_qubit=10
18 0.02405 0.01088 0.00660 0.00664 0.00692 0.00698 0.00699 0.00669 0.00699 0.00669
19 0.02792 0.01196 0.00789 0.00762 0.00802 0.00743 0.00806 0.00853 0.00806 0.00853
20 0.03124 0.01514 0.01033 0.01002 0.01024 0.01032 0.01024 0.00999 0.01024 0.00999
21 0.04314 0.02122 0.01634 0.01611 0.01607 0.01624 0.01581 0.01574 0.01581 0.01574
22 0.05518 0.02971 0.02350 0.02331 0.02391 0.02336 0.02345 0.02340 0.02345 0.02340
23 0.08492 0.04751 0.04119 0.04127 0.04213 0.04132 0.04185 0.04231 0.04185 0.04231
24 0.16684 0.07798 0.06542 0.06767 0.06660 0.06691 0.06848 0.06691 0.06848 0.06691
25 0.55057 0.18935 0.15935 0.15835 0.15687 0.15680 0.15776 0.15984 0.15776 0.15984
26 1.15167 0.39033 0.31787 0.32108 0.31716 0.32029 0.31721 0.32007 0.31721 0.32007
27 2.37243 0.81211 0.65918 0.65656 0.66042 0.65499 0.65360 0.65353 0.65360 0.65353
28 4.90384 1.62537 1.29820 1.30289 1.29559 1.30702 1.29724 1.29812 1.29724 1.29812
29 10.01225 3.30861 2.70313 2.70996 2.71704 2.70586 2.70630 2.70700 2.70630 2.70700
30 20.57989 6.63328 5.37503 5.37237 5.38669 5.35828 5.36507 5.38278 5.36507 5.38278
simulation_times_mean - supremacy
nqubits max_qubit=1 max_qubit=2 max_qubit=3 max_qubit=4 max_qubit=5 max_qubit=6 max_qubit=7 max_qubit=8 max_qubit=9 max_qubit=10
18 0.03041 0.00775 0.00667 0.00606 0.00670 0.00641 0.00605 0.00643 0.00605 0.00643
19 0.03549 0.00909 0.00731 0.00721 0.00767 0.00693 0.00755 0.00740 0.00755 0.00740
20 0.04140 0.01186 0.00954 0.00917 0.00900 0.00942 0.00970 0.00946 0.00970 0.00946
21 0.05386 0.01714 0.01446 0.01446 0.01484 0.01490 0.01391 0.01434 0.01391 0.01434
22 0.08105 0.02462 0.02195 0.02096 0.02120 0.02136 0.02094 0.02130 0.02094 0.02130
23 0.10721 0.04289 0.03809 0.03809 0.03828 0.03826 0.03836 0.03818 0.03836 0.03818
24 0.20390 0.07373 0.06204 0.06208 0.06355 0.06452 0.06360 0.06451 0.06360 0.06451
25 0.68874 0.20261 0.13766 0.13738 0.13814 0.13969 0.13914 0.13769 0.13914 0.13769
26 1.43602 0.42405 0.27219 0.27104 0.26810 0.26946 0.26863 0.26922 0.26863 0.26922
27 2.93484 0.87140 0.52922 0.53110 0.53352 0.52862 0.53120 0.53257 0.53120 0.53257
28 6.03868 1.79730 1.12340 1.12688 1.12264 1.12405 1.11745 1.11310 1.11745 1.11310
29 12.46403 3.57908 2.19462 2.20758 2.20211 2.19717 2.20616 2.21261 2.20616 2.21261
30 25.61836 7.42030 4.37386 4.38552 4.37996 4.37190 4.37334 4.39099 4.37334 4.39099
simulation_times_mean - bv
nqubits max_qubit=1 max_qubit=2 max_qubit=3 max_qubit=4 max_qubit=5 max_qubit=6 max_qubit=7 max_qubit=8 max_qubit=9 max_qubit=10
18 0.02403 0.00862 0.00664 0.00632 0.00641 0.00659 0.00629 0.00634 0.00629 0.00634
19 0.02783 0.01033 0.00756 0.00733 0.00762 0.00754 0.00760 0.00782 0.00760 0.00782
20 0.03305 0.01202 0.00990 0.01017 0.01028 0.01033 0.00988 0.01010 0.00988 0.01010
21 0.04805 0.01867 0.01526 0.01609 0.01516 0.01605 0.01561 0.01525 0.01561 0.01525
22 0.06789 0.02872 0.02396 0.02400 0.02389 0.02462 0.02406 0.02362 0.02406 0.02362
23 0.12376 0.04823 0.04130 0.04082 0.03915 0.04119 0.03971 0.04227 0.03971 0.04227
24 0.27293 0.08890 0.07343 0.07248 0.07359 0.07387 0.07146 0.07179 0.07146 0.07179
25 0.62091 0.25956 0.15887 0.15729 0.15892 0.15839 0.15811 0.15709 0.15811 0.15709
26 1.26396 0.53527 0.31921 0.31940 0.32251 0.32188 0.32086 0.31842 0.32086 0.31842
27 2.57758 1.08288 0.64259 0.63976 0.64593 0.63962 0.63936 0.64198 0.63936 0.64198
28 5.29699 2.22619 1.32634 1.33162 1.32600 1.32281 1.33697 1.32669 1.33697 1.32669
29 10.89181 4.57231 2.64353 2.63743 2.63712 2.63570 2.64681 2.63405 2.64681 2.63405
30 22.37617 9.38960 5.51818 5.50756 5.51528 5.51009 5.52545 5.49817 5.52545 5.49817
simulation_times_mean - qv
nqubits max_qubit=1 max_qubit=2 max_qubit=3 max_qubit=4 max_qubit=5 max_qubit=6 max_qubit=7 max_qubit=8 max_qubit=9 max_qubit=10
18 0.04322 0.00680 0.00667 0.00635 0.00652 0.00564 0.00660 0.00670 0.00660 0.00670
19 0.04609 0.00739 0.00808 0.00618 0.00729 0.00751 0.00756 0.00750 0.00756 0.00750
20 0.05918 0.00960 0.00982 0.00915 0.00951 0.00963 0.00954 0.00972 0.00954 0.00972
21 0.07193 0.01479 0.01449 0.01440 0.01366 0.01428 0.01349 0.01420 0.01349 0.01420
22 0.10503 0.02255 0.02286 0.02187 0.02223 0.02224 0.02219 0.02216 0.02219 0.02216
23 0.15921 0.03581 0.03606 0.03626 0.03635 0.03646 0.03659 0.03644 0.03659 0.03644
24 0.34033 0.06303 0.06362 0.06336 0.06206 0.06275 0.06365 0.06532 0.06365 0.06532
25 1.03461 0.14939 0.15163 0.11332 0.11507 0.11236 0.11240 0.11072 0.11240 0.11072
26 2.35084 0.31506 0.31706 0.23791 0.22800 0.23376 0.22758 0.23537 0.22758 0.23537
27 4.59989 0.62251 0.62059 0.44545 0.44243 0.44404 0.44742 0.44000 0.44742 0.44000
28 9.85240 1.30905 1.30669 0.86448 0.86715 0.86223 0.85821 0.85909 0.85821 0.85909
29 19.42483 2.60548 2.60134 1.67026 1.66095 1.66718 1.67798 1.67076 1.67798 1.67076
30 41.32197 5.49336 5.45955 3.54717 3.54791 3.55664 3.57428 3.58050 3.57428 3.58050
simulation_times_mean - qft
nqubits max_qubit=1 max_qubit=2 max_qubit=3 max_qubit=4 max_qubit=5 max_qubit=6 max_qubit=7 max_qubit=8 max_qubit=9 max_qubit=10
18 0.09986 0.16404 0.09149 0.06225 0.04967 0.04686 0.05050 0.07268 0.05050 0.07268
19 0.11716 0.19525 0.10615 0.07433 0.05839 0.05606 0.05966 0.08331 0.05966 0.08331
20 0.14500 0.23848 0.13353 0.09241 0.06979 0.06907 0.07173 0.09961 0.07173 0.09961
21 0.19092 0.30773 0.16762 0.11969 0.09653 0.08923 0.09515 0.12232 0.09515 0.12232
22 0.23396 0.40506 0.22470 0.15205 0.11718 0.11908 0.12315 0.14953 0.12315 0.14953
23 0.40148 0.52892 0.29227 0.20953 0.19608 0.19701 0.18965 0.23660 0.18965 0.23660
24 0.59941 0.99779 0.59794 0.45931 0.34976 0.35910 0.33913 0.38624 0.33913 0.38624
25 1.33685 2.95785 1.60361 1.13129 0.91176 0.82218 0.77790 0.78049 0.77790 0.78049
26 3.04149 6.20017 3.31727 2.34929 1.91131 1.71177 1.57888 1.54756 1.57888 1.54756
27 6.64567 13.03866 7.00405 4.94307 3.97760 3.56223 3.25542 3.09612 3.25542 3.09612
28 13.32589 27.87680 14.91064 10.52617 8.43909 7.52417 6.84090 6.49729 6.84090 6.49729
29 26.63698 59.07422 31.48731 22.16421 17.64281 15.68286 14.25087 13.39163 14.25087 13.39163
30 53.24005 124.78964 66.31264 46.81171 37.05214 32.95734 29.85373 28.06654 29.85373 28.06654

I do not see something special about their default choice max_qubit=5. In most cases performance keeps increasing for larger values, however there is a saturation around 7. I guess at some point it will start becoming a trade off between performance and memory as the fused gate matrices will be (2^max_qubit, 2^max_qubit). I have not checked the gates of the fused circuits explicitly so it is not certain that the set max_qubit value is always reached, eg. it might be the case that it never fuses gates to more than 7 qubits, even if we set the max_qubit higher, that's why the saturation.

Based on these results, I would say that it would be worth trying to implement kernels for more than two qubits and expanding the fusion algorithm. I am just not sure if we should preset a maximum possible value for max_qubit and whether we can define kernels with dynamic size or we need a seperate kernel for each qubit number. Higher than two qubit kernel would not be relevant for experiments but may help in simulation including applications other than fusion.

@scarrazza
Copy link
Member

@stavros11 thanks for this first point. I believe would be great to have custom operators with larger number of qubits, but we can do this in a later PR. Concerning this point, do you have an idea of how much memory our fusion is using when compared to the no-fusion circuit?

@stavros11
Copy link
Member Author

@stavros11 thanks for this first point. I believe would be great to have custom operators with larger number of qubits, but we can do this in a later PR. Concerning this point, do you have an idea of how much memory our fusion is using when compared to the no-fusion circuit?

I have experimented a bit with both the qibo and qiskit fusion and I do not observe any significant change in memory usage between using fusion and no. Particularly for qiskit, I played with the max_qubit option and even used circuits where this number is equal to the total number of qubits (eg. using max_qubit=15 in a circuit with 15 qubits) but still did not observe any change compared to the no fusion simulation. I guess the fusion algorithm is smart and never creates very large gates as this would most likely reduce performance. This is probably why we observe a saturation as max_qubit is increased. In Qibo all fused gates are 4x4 matrices so it is unlikely to get memory issues unless the circuit is exponentially deep, but in such cases there is not much we can do I think.

  • run the qibo fusion on GPU

Regarding this point, I did some benchmarks using both qibojit and qibotf for the variational circuit example only:

creation_time
nqubits qibojit GPU qibojit CPU qibotf GPU qibotf CPU
20 0.00317 0.00265 0.00282 0.00280
21 0.00306 0.00270 0.00287 0.00287
22 0.00324 0.00289 0.00299 0.00305
23 0.00323 0.00282 0.00502 0.00301
24 0.00343 0.00294 0.00318 0.00322
25 0.00341 0.00299 0.00369 0.00537
26 0.00356 0.00315 0.00338 0.00337
27 0.00410 0.00319 0.00339 0.00539
28 0.00371 0.00321 0.00418 0.00360
29 0.00373 0.00335 0.00595 0.00634
30 0.00397 0.00341 0.00375 0.00477
dry_run_time
nqubits qibojit GPU qibojit CPU qibotf GPU qibotf CPU
20 0.60094 0.19034 0.01593 0.03277
21 0.60293 0.19012 0.01641 0.03719
22 0.60577 0.19227 0.02350 0.04367
23 0.61031 0.21780 0.03070 0.05935
24 0.62081 0.25514 0.04284 0.10947
25 0.64364 0.35161 0.06437 0.20662
26 0.68604 0.52574 0.11755 0.40043
27 0.86415 0.88200 0.18766 0.78747
28 0.97427 1.70433 0.39359 1.57633
29 1.46281 3.19376 0.72674 3.18582
30 2.26731 6.46221 1.51298 6.53077
simulation_times_mean
nqubits qibojit GPU qibojit CPU qibotf GPU qibotf CPU
20 0.00230 0.00349 0.00039 0.00395
21 0.00350 0.00860 0.00039 0.00891
22 0.00599 0.01453 0.00071 0.01760
23 0.01107 0.03244 0.00059 0.04004
24 0.02150 0.07474 0.00074 0.09513
25 0.04188 0.16199 0.00076 0.19555
26 0.08467 0.33764 0.00092 0.39360
27 0.17538 0.70823 0.00048 0.82863
28 0.35841 1.48992 0.00105 1.68576
29 0.74661 2.98614 0.00117 3.43216
30 1.52286 6.29063 0.00122 7.10333

As expected the GPU helps in the circuit execution. For qibotf the GPU simulation time has the usual issue so the dry run is more accurate measurement. Other than that I believe the numbers are reasonable for both backends.

Regarding performing the gate fusion matrix multiplications on GPU, I tried this for the above example and got the following times:

creation_time
nqubits qibojit GPU qibojit CPU qibotf GPU qibotf CPU
20 0.00304 0.00254 0.00281 0.00279
21 0.00318 0.00259 0.00292 0.00301
22 0.00317 0.00283 0.00299 0.00306
23 0.00323 0.00282 0.00304 0.00441
24 0.00335 0.00302 0.00318 0.00316
25 0.00347 0.00296 0.00318 0.00324
26 0.00355 0.00308 0.00329 0.00544
27 0.00368 0.00313 0.00334 0.00346
28 0.00380 0.00326 0.00348 0.00347
29 0.00382 0.00331 0.00348 0.00361
30 0.00387 0.00334 0.00370 0.00362
dry_run_time
nqubits qibojit GPU qibojit CPU qibotf GPU qibotf CPU
20 1.13637 0.18461 0.63055 0.04980
21 1.14764 0.18538 0.63303 0.04949
22 1.13734 0.19897 0.64205 0.06177
23 1.25338 0.20849 0.64325 0.08613
24 1.15695 0.26259 0.65710 0.12604
25 1.18059 0.34068 0.67590 0.23626
26 1.22432 0.52480 0.75341 0.42491
27 1.31770 0.88000 0.84305 0.80202
28 1.49975 1.69440 1.03921 1.62485
29 1.89579 3.11498 1.35399 3.14233
30 2.68803 6.38584 2.19434 6.48759
simulation_times_mean
nqubits qibojit GPU qibojit CPU qibotf GPU qibotf CPU
20 0.00229 0.00320 0.00040 0.00392
21 0.00351 0.00864 0.00039 0.00918
22 0.00597 0.01453 0.00071 0.01687
23 0.01108 0.03181 0.00044 0.04127
24 0.02128 0.07351 0.00046 0.09235
25 0.04206 0.15654 0.00046 0.18817
26 0.08415 0.33237 0.00080 0.37578
27 0.17478 0.68841 0.00102 0.79483
28 0.35781 1.40665 0.00110 1.61881
29 0.74544 2.85560 0.00051 3.18566
30 1.52842 6.13281 0.00126 6.74426

So we notice that the GPU dry run times become significantly slower for both backends. The difference is due to the fused matrix calculations which are only done during the dry run as these matrices are cached once calculated. To be more concrete, in terms of code doing the fusion operation on GPU means that the backend K is used instead of K.qnp when constructing the matrix of the fused gate.

Based on the above results, I believe it is preferable, at least for the two qubit fusion, to keep these operations on numpy/CPU (that is keep K.qnp as it is now). Another way to see this is to benchmark multiplying small matrices directly on GPU:

matrices = [cp.random.random((4, 4)) for _ in range(100)]

start_time = time.time()
t = cp.eye(4)
for m in matrices:
    t = m @ t
print(time.time() - start_time)

vs doing the operation on CPU and casting back on GPU:

matrices = [cp.random.random((4, 4)) for _ in range(100)]

start_time = time.time()
t = np.eye(4)
for m in matrices:
    t = m.get() @ t # m must be transfered from GPU to CPU
t = cp.array(t) # result transfered back to GPU
print(time.time() - start_time)

In my local machine the latter is more than x1000 faster for these dimensions despite the transfers. The GPU becomes faster only when I increase the size of matrices, eg. to 1024x1024.

With that in mind I believe this PR is ready as it is. We should consider extending the kernels to more qubits in the future.

@stavros11 stavros11 marked this pull request as ready for review September 16, 2021 14:29
@stavros11 stavros11 changed the title [WIP] Updating gate fusion Updating gate fusion Sep 16, 2021
@scarrazza
Copy link
Member

@stavros11 thank you very much for this. I also believe the CPU projection approach is acceptable and fully justified.

@scarrazza
Copy link
Member

@mlazzarin could you please perform a first review of this PR, e.g. checking the code, example and docs?

@scarrazza scarrazza requested a review from mlazzarin September 18, 2021 18:49
@mlazzarin
Copy link
Contributor

@mlazzarin could you please perform a first review of this PR, e.g. checking the code, example and docs?

Ok, I could do that tomorrow.

Copy link
Contributor

@mlazzarin mlazzarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I made a few minor remarks regarding comments/docstrings.

@@ -31,8 +31,7 @@ def test_circuit_add_layer(backend, nqubits, accelerators):
for gate in c.queue:
assert isinstance(gate, gates.Unitary)

# TODO: Test `_fuse_copy`
# TODO: Test `fuse`
# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py`
# :meth:`qibo.core.circuit.Circuit.fuse()` is tested in `test_core_fusion.py`

src/qibo/core/circuit.py Show resolved Hide resolved
src/qibo/abstractions/gates.py Show resolved Hide resolved
Comment on lines 1084 to 1111
class FusedGate(MatrixGate, abstract_gates.FusedGate):

def __init__(self, *q):
BackendGate.__init__(self)
abstract_gates.FusedGate.__init__(self, *q)
if self.gate_op:
if len(self.target_qubits) == 1:
self.gate_op = K.op.apply_gate
elif len(self.target_qubits) == 2:
self.gate_op = K.op.apply_two_qubit_gate
else:
raise_error(NotImplementedError, "Fused gates can target up to two qubits.")

def _construct_unitary(self):
matrix = K.qnp.eye(2 ** len(self.target_qubits))
for gate in self.gates:
gmatrix = K.to_numpy(gate.matrix)
if len(gate.qubits) < len(self.target_qubits):
if gate.qubits[0] == self.target_qubits[0]:
gmatrix = K.qnp.kron(gmatrix, K.qnp.eye(2))
else:
gmatrix = K.qnp.kron(K.qnp.eye(2), gmatrix)
elif gate.qubits != self.target_qubits:
gmatrix = K.qnp.reshape(gmatrix, 4 * (2,))
gmatrix = K.qnp.transpose(gmatrix, [1, 0, 3, 2])
gmatrix = K.qnp.reshape(gmatrix, (4, 4))
matrix = gmatrix @ matrix
return K.cast(matrix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to add a docstring that explain where this class is used, and some comments that highlight where the two-qubit limit is hard coded (it may be useful in the future).

@stavros11
Copy link
Member Author

Looks good to me, I made a few minor remarks regarding comments/docstrings.

Thank you for reviewing this. I added some docstrings and comments in the suggested places, please let me know if these are easy to follow and make it any easier to read through the code.

@mlazzarin
Copy link
Contributor

Thank you very much, they are useful indeed.

@scarrazza scarrazza merged commit fc1eb31 into master Sep 22, 2021
@scarrazza scarrazza deleted the fusion branch October 5, 2021 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants