Updating gate fusion #461

stavros11 · 2021-09-08T17:15:26Z

This attempts to improve gate fusion to match the performance of other libraries (see #451). I tried to check what other libraries do in this case and qiskit, qsim and qulacs seem to implement the fusion in C++. See for example here for the qiskit implementation.

Here I implemented a single algorithm which in the first pass it fuses one qubit gates and in the second pass it fuses two qubit gates with the neighboring one qubit gates. There is no specific selection on which gates will be fused, we just loop through the gates and perform all possible fusions in order.

Here is some performance comparison with qiskit using the variational circuit benchmark, with and without fusion.

No fusion

nqubits	creation time qiskit	creation time qibo	dry run time qiskit	dry run time qibo	simulation time qiskit	simulation time qibo
22	0.04288	0.00261	0.05275	0.07731	0.06422	0.02616
23	0.04281	0.00265	0.09580	0.11051	0.08741	0.05339
24	0.04331	0.00279	0.15906	0.17514	0.17004	0.10230
25	0.04297	0.00282	0.54398	0.53285	0.54866	0.50177
26	0.04433	0.00287	1.12366	1.15239	1.14877	1.06958
27	0.04379	0.00302	2.31479	2.26544	2.38414	2.18739
28	0.04433	0.00300	4.74315	4.69717	4.90792	4.57531
29	0.04490	0.00304	9.77379	9.47510	9.99879	9.29732
30	0.04698	0.00360	20.06423	19.77659	20.57615	19.58396

Fusion (using `fusion_max_qubit=2`)

nqubits	creation time qiskit	creation time qibo	dry run time qiskit	dry run time qibo	simulation time qiskit	simulation time qibo
22	0.04253	0.00384	0.02823	0.06814	0.03171	0.02035
23	0.04389	0.00386	0.04512	0.09582	0.05023	0.04237
24	0.04320	0.00402	0.06710	0.17278	0.08034	0.09608
25	0.04434	0.00400	0.18203	0.31944	0.19189	0.24066
26	0.04434	0.00412	0.36120	0.57665	0.39272	0.48691
27	0.04514	0.00469	0.77073	1.12824	0.81376	0.99514
28	0.04493	0.00484	1.49052	2.33499	1.62031	2.03122
29	0.04490	0.00494	3.06657	4.57194	3.32253	4.39394
30	0.04474	0.00512	6.11339	9.23624	6.62070	9.05363

Fusion

nqubits	creation time qiskit	creation time qibo	dry run time qiskit	dry run time qibo	simulation time qiskit	simulation time qibo
22	0.04236	0.00384	0.01942	0.06814	0.02362	0.02035
23	0.04302	0.00386	0.03487	0.09582	0.04113	0.04237
24	0.04326	0.00402	0.05549	0.17278	0.06585	0.09608
25	0.04310	0.00400	0.14205	0.31944	0.15691	0.24066
26	0.04376	0.00412	0.28565	0.57665	0.31373	0.48691
27	0.04398	0.00469	0.59204	1.12824	0.66236	0.99514
28	0.04496	0.00484	1.15985	2.33499	1.29715	2.03122
29	0.04487	0.00494	2.45058	4.57194	2.70454	4.39394
30	0.04594	0.00512	4.90200	9.23624	5.36761	9.05363

The third table uses the default qiskit fusion, which goes up to 5-qubit gates, while the second table limits qiskit to use up to two qubit fused gates, as the qibo implementation does. Note that in Qibo we cannot go to more than two qubits using the custom backends since we only have kernels for up to two qubit gates.

Qiskit performance is still better. The difference could be either due to different algorithm used or because of the C++ vs Python overhead, or more likely both.

codecov · 2021-09-15T02:02:21Z

Codecov Report

Merging #461 (c3e7997) into master (f577a91) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master      #461    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           85        84     -1     
  Lines        11876     11770   -106     
==========================================
- Hits         11876     11770   -106

Flag	Coverage Δ
unittests	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/qibo/tests/test_abstract_circuit.py	`100.00% <ø> (ø)`
src/qibo/tests/test_core_circuit.py	`100.00% <ø> (ø)`
src/qibo/abstractions/abstract_gates.py	`100.00% <100.00%> (ø)`
src/qibo/abstractions/circuit.py	`100.00% <100.00%> (ø)`
src/qibo/abstractions/gates.py	`100.00% <100.00%> (ø)`
src/qibo/core/circuit.py	`100.00% <100.00%> (ø)`
src/qibo/core/distcircuit.py	`100.00% <100.00%> (ø)`
src/qibo/core/gates.py	`100.00% <100.00%> (ø)`
src/qibo/tests/test_abstract_gates.py	`100.00% <100.00%> (ø)`
src/qibo/tests/test_core_circuit_features.py	`100.00% <100.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f577a91...c3e7997. Read the comment docs.

stavros11 · 2021-09-15T02:02:25Z

I made a few updates in the gate fusion algorithm and now the performance matches with Qiskit. The improvements were mostly by playing with our current Python code and comparing the output fused circuits with the ones from qiskit, I have not looked into the fusion code of other libraries. Here are some benchmarks for several different circuits from the qibojit-benchmarks repository:

creation_time - variational

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.04065	0.04146	0.00298
19	0.04141	0.04129	0.00303
20	0.04184	0.04184	0.00311
21	0.04230	0.04263	0.00326
22	0.04275	0.04314	0.00328
23	0.04356	0.04288	0.00338
24	0.04391	0.04337	0.00352
25	0.04380	0.04397	0.00358
26	0.04400	0.04412	0.00439
27	0.04458	0.04424	0.00415
28	0.04433	0.04484	0.00434
29	0.04500	0.04454	0.00436
30	0.04515	0.04539	0.00467

creation_time - supremacy

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.03981	0.04025	0.00488
19	0.04090	0.04122	0.00529
20	0.04104	0.04130	0.00541
21	0.04174	0.04187	0.00552
22	0.04141	0.04189	0.00576
23	0.04250	0.04177	0.00591
24	0.04239	0.04267	0.00605
25	0.04279	0.04303	0.00586
26	0.04273	0.04345	0.00640
27	0.04383	0.04283	0.00659
28	0.04348	0.04363	0.00670
29	0.04425	0.04426	0.00695
30	0.04479	0.04450	0.00700

creation_time - bv

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.03965	0.03980	0.00258
19	0.04049	0.04021	0.00265
20	0.04052	0.04131	0.00266
21	0.04087	0.04071	0.00291
22	0.04138	0.04069	0.00287
23	0.04158	0.04153	0.00307
24	0.04133	0.04157	0.00296
25	0.04186	0.04221	0.00308
26	0.04202	0.04235	0.00355
27	0.04257	0.04286	0.00368
28	0.04266	0.04307	0.00379
29	0.04585	0.04325	0.00393
30	0.04290	0.04385	0.00412

creation_time - qv

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.28577	0.27792	0.01087
19	0.28534	0.29191	0.01091
20	0.27896	0.28904	0.01124
21	0.29239	0.28102	0.01137
22	0.28758	0.29007	0.01222
23	0.28676	0.28332	0.01210
24	0.29299	0.28341	0.01297
25	0.29028	0.28740	0.01264
26	0.29127	0.29472	0.01334
27	0.28485	0.28685	0.01293
28	0.28586	0.29436	0.01388
29	0.29285	0.29762	0.01394
30	0.28904	0.29681	0.01443

creation_time - qft

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.06117	0.06059	0.00932
19	0.06315	0.06315	0.01042
20	0.06579	0.06515	0.01115
21	0.06869	0.06825	0.01186
22	0.07184	0.07263	0.01312
23	0.07540	0.07620	0.01405
24	0.08013	0.07941	0.01547
25	0.08299	0.08198	0.01652
26	0.08595	0.08740	0.01741
27	0.08899	0.09135	0.01900
28	0.09578	0.09446	0.02028
29	0.09808	0.09828	0.02105
30	0.10188	0.10211	0.02213

dry_run_time - variational

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00807	0.01376	0.05248
19	0.01259	0.01208	0.05357
20	0.01092	0.01427	0.05899
21	0.01438	0.01910	0.06198
22	0.02122	0.02807	0.07461
23	0.03211	0.03801	0.08826
24	0.06010	0.07681	0.14901
25	0.14386	0.18610	0.22061
26	0.28454	0.36022	0.42207
27	0.58965	0.74806	0.71549
28	1.18559	1.49037	1.56468
29	2.46222	3.08915	2.95549
30	4.92414	6.13211	6.20692

dry_run_time - supremacy

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00930	0.01138	0.49117
19	0.01167	0.01289	0.19414
20	0.01160	0.01363	0.19067
21	0.01682	0.01789	0.19895
22	0.02217	0.02467	0.20838
23	0.03376	0.03940	0.22358
24	0.05296	0.07066	0.27306
25	0.12937	0.19672	0.38054
26	0.24693	0.40252	0.54773
27	0.47316	0.82409	0.98895
28	1.00354	1.68372	1.73485
29	1.98238	3.34738	3.58885
30	3.89945	6.93802	6.76083

dry_run_time - bv

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00899	0.00975	0.04902
19	0.00919	0.01196	0.05109
20	0.01258	0.01251	0.05455
21	0.01414	0.01715	0.06031
22	0.02040	0.02382	0.06584
23	0.03422	0.04285	0.09790
24	0.06083	0.08257	0.10464
25	0.14899	0.25198	0.28223
26	0.29465	0.50953	0.51721
27	0.59213	1.03156	1.06625
28	1.19239	2.10442	2.12164
29	2.41061	4.33946	4.35946
30	5.04592	8.90709	8.49319

dry_run_time - qv

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.01170	0.01025	0.05073
19	0.00926	0.01019	0.05623
20	0.01208	0.01235	0.05852
21	0.01621	0.01432	0.06097
22	0.02189	0.02654	0.06573
23	0.03193	0.03205	0.08846
24	0.05694	0.05291	0.13012
25	0.10503	0.14441	0.18442
26	0.21641	0.29920	0.32602
27	0.39561	0.57425	0.58844
28	0.73579	1.20204	1.22720
29	1.46428	2.36381	2.34498
30	3.09096	4.99359	4.74417

dry_run_time - qft

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.04680	0.15951	0.54917
19	0.05828	0.19554	0.06498
20	0.07511	0.24832	0.07566
21	0.09060	0.34120	0.08253
22	0.10543	0.42893	0.10224
23	0.17519	0.52038	0.16160
24	0.30901	1.11017	0.39842
25	0.91094	2.94783	1.04712
26	1.87574	6.13629	2.53819
27	3.88388	12.96822	5.43247
28	8.26834	27.65820	11.06349
29	17.38104	58.59509	23.13947
30	36.62964	124.18945	48.85041

simulation_times_mean - variational

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00668	0.01099	0.00164
19	0.00804	0.01223	0.00245
20	0.00996	0.01509	0.00387
21	0.01615	0.02021	0.00948
22	0.02322	0.02992	0.01716
23	0.04162	0.04608	0.03306
24	0.06739	0.08019	0.07918
25	0.15706	0.18933	0.16000
26	0.31615	0.38573	0.33222
27	0.65377	0.81682	0.64535
28	1.29587	1.61279	1.40367
29	2.70374	3.31418	2.77074
30	5.37828	6.63932	6.12979

simulation_times_mean - supremacy

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00594	0.00796	0.00158
19	0.00781	0.00929	0.00246
20	0.00933	0.01110	0.00364
21	0.01437	0.01701	0.00933
22	0.02118	0.02511	0.01620
23	0.03837	0.04523	0.03145
24	0.06270	0.07579	0.06293
25	0.13863	0.20276	0.17734
26	0.26971	0.42590	0.37606
27	0.53298	0.87264	0.77061
28	1.12530	1.79977	1.51866
29	2.20068	3.57973	3.23281
30	4.37278	7.40979	6.68904

simulation_times_mean - bv

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00673	0.00895	0.00161
19	0.00763	0.01036	0.00295
20	0.00935	0.01198	0.00473
21	0.01569	0.01887	0.01063
22	0.02398	0.02925	0.01886
23	0.04112	0.04957	0.03844
24	0.07207	0.08864	0.05902
25	0.16021	0.25933	0.22253
26	0.32227	0.53419	0.48578
27	0.65315	1.08912	0.95682
28	1.33140	2.22416	1.95658
29	2.63782	4.57933	3.98087
30	5.52658	9.39029	8.38702

simulation_times_mean - qv

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.00681	0.00655	0.00127
19	0.00729	0.00759	0.00199
20	0.00922	0.00922	0.00315
21	0.01402	0.01474	0.00791
22	0.02263	0.02262	0.01417
23	0.03734	0.03614	0.02619
24	0.06255	0.06302	0.06671
25	0.11140	0.15199	0.11946
26	0.23175	0.31978	0.27303
27	0.44798	0.62832	0.50357
28	0.86967	1.30972	1.07623
29	1.68772	2.60102	2.13045
30	3.55397	5.46923	4.65302

simulation_times_mean - qft

nqubits	qiskit	qiskit (two qubit)	qibo
18	0.04974	0.16390	0.00687
19	0.05759	0.19229	0.00953
20	0.07467	0.24225	0.01507
21	0.09373	0.30339	0.02773
22	0.14522	0.39349	0.04980
23	0.18156	0.55178	0.11046
24	0.37427	0.96208	0.28709
25	0.91660	2.96658	0.84098
26	1.89046	6.20743	2.39814
27	3.97402	13.06850	5.16271
28	8.47476	27.96610	10.89850
29	17.62641	58.98025	23.01198
30	37.03798	124.84739	48.54022

Here I am comparing with qiskit using the fusion_max_qubit=2 since the Qibo fusion is limited to two qubit gates too. As we can see, the default qiskit which goes up to up to five qubit gates provides a further advantage in many cases, however it would require us to implement a five qubit kernel in all custom backends.

Apart from this, the only thing remaining for this PR is to implement the set_parameters properly for the new fusion scheme. It would also be useful to compare performance with other libraries that provide fusion and circuit optimizations such as Qulacs or Tensorflow Quantum.

scarrazza · 2021-09-15T17:37:18Z

@stavros11 following the discussion today, before merging this PR we should verify:

what happens with performance for `fusion_max_qubits > 5
run the qibo fusion on GPU
concerning the possibility to test fusion with more qubits

stavros11 · 2021-09-15T18:03:14Z

Thanks for the summary and the list, perhaps I would add an additional point for comparing performance with libraries other than qiskit, such as Qulacs and Tensorflow Quantum.

I fixed the set_parameters functionality for fused circuits so now this PR should be complete in terms of features. In terms of Qibo it remains to check performance on GPU and see if it is preferrable to move the gate matrix products on GPU instead of pure numpy as it is now.

what happens with performance for `fusion_max_qubits > 5

Regarding this point, I did some benchmarks on the above circuits using qiskit and changing the fusion_max_qubit flag:

simulation_times_mean - variational

nqubits	max_qubit=1	max_qubit=2	max_qubit=3	max_qubit=4	max_qubit=5	max_qubit=6	max_qubit=7	max_qubit=8	max_qubit=9	max_qubit=10
18	0.02405	0.01088	0.00660	0.00664	0.00692	0.00698	0.00699	0.00669	0.00699	0.00669
19	0.02792	0.01196	0.00789	0.00762	0.00802	0.00743	0.00806	0.00853	0.00806	0.00853
20	0.03124	0.01514	0.01033	0.01002	0.01024	0.01032	0.01024	0.00999	0.01024	0.00999
21	0.04314	0.02122	0.01634	0.01611	0.01607	0.01624	0.01581	0.01574	0.01581	0.01574
22	0.05518	0.02971	0.02350	0.02331	0.02391	0.02336	0.02345	0.02340	0.02345	0.02340
23	0.08492	0.04751	0.04119	0.04127	0.04213	0.04132	0.04185	0.04231	0.04185	0.04231
24	0.16684	0.07798	0.06542	0.06767	0.06660	0.06691	0.06848	0.06691	0.06848	0.06691
25	0.55057	0.18935	0.15935	0.15835	0.15687	0.15680	0.15776	0.15984	0.15776	0.15984
26	1.15167	0.39033	0.31787	0.32108	0.31716	0.32029	0.31721	0.32007	0.31721	0.32007
27	2.37243	0.81211	0.65918	0.65656	0.66042	0.65499	0.65360	0.65353	0.65360	0.65353
28	4.90384	1.62537	1.29820	1.30289	1.29559	1.30702	1.29724	1.29812	1.29724	1.29812
29	10.01225	3.30861	2.70313	2.70996	2.71704	2.70586	2.70630	2.70700	2.70630	2.70700
30	20.57989	6.63328	5.37503	5.37237	5.38669	5.35828	5.36507	5.38278	5.36507	5.38278

simulation_times_mean - supremacy

nqubits	max_qubit=1	max_qubit=2	max_qubit=3	max_qubit=4	max_qubit=5	max_qubit=6	max_qubit=7	max_qubit=8	max_qubit=9	max_qubit=10
18	0.03041	0.00775	0.00667	0.00606	0.00670	0.00641	0.00605	0.00643	0.00605	0.00643
19	0.03549	0.00909	0.00731	0.00721	0.00767	0.00693	0.00755	0.00740	0.00755	0.00740
20	0.04140	0.01186	0.00954	0.00917	0.00900	0.00942	0.00970	0.00946	0.00970	0.00946
21	0.05386	0.01714	0.01446	0.01446	0.01484	0.01490	0.01391	0.01434	0.01391	0.01434
22	0.08105	0.02462	0.02195	0.02096	0.02120	0.02136	0.02094	0.02130	0.02094	0.02130
23	0.10721	0.04289	0.03809	0.03809	0.03828	0.03826	0.03836	0.03818	0.03836	0.03818
24	0.20390	0.07373	0.06204	0.06208	0.06355	0.06452	0.06360	0.06451	0.06360	0.06451
25	0.68874	0.20261	0.13766	0.13738	0.13814	0.13969	0.13914	0.13769	0.13914	0.13769
26	1.43602	0.42405	0.27219	0.27104	0.26810	0.26946	0.26863	0.26922	0.26863	0.26922
27	2.93484	0.87140	0.52922	0.53110	0.53352	0.52862	0.53120	0.53257	0.53120	0.53257
28	6.03868	1.79730	1.12340	1.12688	1.12264	1.12405	1.11745	1.11310	1.11745	1.11310
29	12.46403	3.57908	2.19462	2.20758	2.20211	2.19717	2.20616	2.21261	2.20616	2.21261
30	25.61836	7.42030	4.37386	4.38552	4.37996	4.37190	4.37334	4.39099	4.37334	4.39099

simulation_times_mean - bv

nqubits	max_qubit=1	max_qubit=2	max_qubit=3	max_qubit=4	max_qubit=5	max_qubit=6	max_qubit=7	max_qubit=8	max_qubit=9	max_qubit=10
18	0.02403	0.00862	0.00664	0.00632	0.00641	0.00659	0.00629	0.00634	0.00629	0.00634
19	0.02783	0.01033	0.00756	0.00733	0.00762	0.00754	0.00760	0.00782	0.00760	0.00782
20	0.03305	0.01202	0.00990	0.01017	0.01028	0.01033	0.00988	0.01010	0.00988	0.01010
21	0.04805	0.01867	0.01526	0.01609	0.01516	0.01605	0.01561	0.01525	0.01561	0.01525
22	0.06789	0.02872	0.02396	0.02400	0.02389	0.02462	0.02406	0.02362	0.02406	0.02362
23	0.12376	0.04823	0.04130	0.04082	0.03915	0.04119	0.03971	0.04227	0.03971	0.04227
24	0.27293	0.08890	0.07343	0.07248	0.07359	0.07387	0.07146	0.07179	0.07146	0.07179
25	0.62091	0.25956	0.15887	0.15729	0.15892	0.15839	0.15811	0.15709	0.15811	0.15709
26	1.26396	0.53527	0.31921	0.31940	0.32251	0.32188	0.32086	0.31842	0.32086	0.31842
27	2.57758	1.08288	0.64259	0.63976	0.64593	0.63962	0.63936	0.64198	0.63936	0.64198
28	5.29699	2.22619	1.32634	1.33162	1.32600	1.32281	1.33697	1.32669	1.33697	1.32669
29	10.89181	4.57231	2.64353	2.63743	2.63712	2.63570	2.64681	2.63405	2.64681	2.63405
30	22.37617	9.38960	5.51818	5.50756	5.51528	5.51009	5.52545	5.49817	5.52545	5.49817

simulation_times_mean - qv

nqubits	max_qubit=1	max_qubit=2	max_qubit=3	max_qubit=4	max_qubit=5	max_qubit=6	max_qubit=7	max_qubit=8	max_qubit=9	max_qubit=10
18	0.04322	0.00680	0.00667	0.00635	0.00652	0.00564	0.00660	0.00670	0.00660	0.00670
19	0.04609	0.00739	0.00808	0.00618	0.00729	0.00751	0.00756	0.00750	0.00756	0.00750
20	0.05918	0.00960	0.00982	0.00915	0.00951	0.00963	0.00954	0.00972	0.00954	0.00972
21	0.07193	0.01479	0.01449	0.01440	0.01366	0.01428	0.01349	0.01420	0.01349	0.01420
22	0.10503	0.02255	0.02286	0.02187	0.02223	0.02224	0.02219	0.02216	0.02219	0.02216
23	0.15921	0.03581	0.03606	0.03626	0.03635	0.03646	0.03659	0.03644	0.03659	0.03644
24	0.34033	0.06303	0.06362	0.06336	0.06206	0.06275	0.06365	0.06532	0.06365	0.06532
25	1.03461	0.14939	0.15163	0.11332	0.11507	0.11236	0.11240	0.11072	0.11240	0.11072
26	2.35084	0.31506	0.31706	0.23791	0.22800	0.23376	0.22758	0.23537	0.22758	0.23537
27	4.59989	0.62251	0.62059	0.44545	0.44243	0.44404	0.44742	0.44000	0.44742	0.44000
28	9.85240	1.30905	1.30669	0.86448	0.86715	0.86223	0.85821	0.85909	0.85821	0.85909
29	19.42483	2.60548	2.60134	1.67026	1.66095	1.66718	1.67798	1.67076	1.67798	1.67076
30	41.32197	5.49336	5.45955	3.54717	3.54791	3.55664	3.57428	3.58050	3.57428	3.58050

simulation_times_mean - qft

nqubits	max_qubit=1	max_qubit=2	max_qubit=3	max_qubit=4	max_qubit=5	max_qubit=6	max_qubit=7	max_qubit=8	max_qubit=9	max_qubit=10
18	0.09986	0.16404	0.09149	0.06225	0.04967	0.04686	0.05050	0.07268	0.05050	0.07268
19	0.11716	0.19525	0.10615	0.07433	0.05839	0.05606	0.05966	0.08331	0.05966	0.08331
20	0.14500	0.23848	0.13353	0.09241	0.06979	0.06907	0.07173	0.09961	0.07173	0.09961
21	0.19092	0.30773	0.16762	0.11969	0.09653	0.08923	0.09515	0.12232	0.09515	0.12232
22	0.23396	0.40506	0.22470	0.15205	0.11718	0.11908	0.12315	0.14953	0.12315	0.14953
23	0.40148	0.52892	0.29227	0.20953	0.19608	0.19701	0.18965	0.23660	0.18965	0.23660
24	0.59941	0.99779	0.59794	0.45931	0.34976	0.35910	0.33913	0.38624	0.33913	0.38624
25	1.33685	2.95785	1.60361	1.13129	0.91176	0.82218	0.77790	0.78049	0.77790	0.78049
26	3.04149	6.20017	3.31727	2.34929	1.91131	1.71177	1.57888	1.54756	1.57888	1.54756
27	6.64567	13.03866	7.00405	4.94307	3.97760	3.56223	3.25542	3.09612	3.25542	3.09612
28	13.32589	27.87680	14.91064	10.52617	8.43909	7.52417	6.84090	6.49729	6.84090	6.49729
29	26.63698	59.07422	31.48731	22.16421	17.64281	15.68286	14.25087	13.39163	14.25087	13.39163
30	53.24005	124.78964	66.31264	46.81171	37.05214	32.95734	29.85373	28.06654	29.85373	28.06654

I do not see something special about their default choice max_qubit=5. In most cases performance keeps increasing for larger values, however there is a saturation around 7. I guess at some point it will start becoming a trade off between performance and memory as the fused gate matrices will be (2^max_qubit, 2^max_qubit). I have not checked the gates of the fused circuits explicitly so it is not certain that the set max_qubit value is always reached, eg. it might be the case that it never fuses gates to more than 7 qubits, even if we set the max_qubit higher, that's why the saturation.

Based on these results, I would say that it would be worth trying to implement kernels for more than two qubits and expanding the fusion algorithm. I am just not sure if we should preset a maximum possible value for max_qubit and whether we can define kernels with dynamic size or we need a seperate kernel for each qubit number. Higher than two qubit kernel would not be relevant for experiments but may help in simulation including applications other than fusion.

scarrazza · 2021-09-16T07:13:14Z

@stavros11 thanks for this first point. I believe would be great to have custom operators with larger number of qubits, but we can do this in a later PR. Concerning this point, do you have an idea of how much memory our fusion is using when compared to the no-fusion circuit?

stavros11 · 2021-09-16T14:29:35Z

@stavros11 thanks for this first point. I believe would be great to have custom operators with larger number of qubits, but we can do this in a later PR. Concerning this point, do you have an idea of how much memory our fusion is using when compared to the no-fusion circuit?

I have experimented a bit with both the qibo and qiskit fusion and I do not observe any significant change in memory usage between using fusion and no. Particularly for qiskit, I played with the max_qubit option and even used circuits where this number is equal to the total number of qubits (eg. using max_qubit=15 in a circuit with 15 qubits) but still did not observe any change compared to the no fusion simulation. I guess the fusion algorithm is smart and never creates very large gates as this would most likely reduce performance. This is probably why we observe a saturation as max_qubit is increased. In Qibo all fused gates are 4x4 matrices so it is unlikely to get memory issues unless the circuit is exponentially deep, but in such cases there is not much we can do I think.

run the qibo fusion on GPU

Regarding this point, I did some benchmarks using both qibojit and qibotf for the variational circuit example only:

creation_time

nqubits	qibojit GPU	qibojit CPU	qibotf GPU	qibotf CPU
20	0.00317	0.00265	0.00282	0.00280
21	0.00306	0.00270	0.00287	0.00287
22	0.00324	0.00289	0.00299	0.00305
23	0.00323	0.00282	0.00502	0.00301
24	0.00343	0.00294	0.00318	0.00322
25	0.00341	0.00299	0.00369	0.00537
26	0.00356	0.00315	0.00338	0.00337
27	0.00410	0.00319	0.00339	0.00539
28	0.00371	0.00321	0.00418	0.00360
29	0.00373	0.00335	0.00595	0.00634
30	0.00397	0.00341	0.00375	0.00477

dry_run_time

nqubits	qibojit GPU	qibojit CPU	qibotf GPU	qibotf CPU
20	0.60094	0.19034	0.01593	0.03277
21	0.60293	0.19012	0.01641	0.03719
22	0.60577	0.19227	0.02350	0.04367
23	0.61031	0.21780	0.03070	0.05935
24	0.62081	0.25514	0.04284	0.10947
25	0.64364	0.35161	0.06437	0.20662
26	0.68604	0.52574	0.11755	0.40043
27	0.86415	0.88200	0.18766	0.78747
28	0.97427	1.70433	0.39359	1.57633
29	1.46281	3.19376	0.72674	3.18582
30	2.26731	6.46221	1.51298	6.53077

simulation_times_mean

nqubits	qibojit GPU	qibojit CPU	qibotf GPU	qibotf CPU
20	0.00230	0.00349	0.00039	0.00395
21	0.00350	0.00860	0.00039	0.00891
22	0.00599	0.01453	0.00071	0.01760
23	0.01107	0.03244	0.00059	0.04004
24	0.02150	0.07474	0.00074	0.09513
25	0.04188	0.16199	0.00076	0.19555
26	0.08467	0.33764	0.00092	0.39360
27	0.17538	0.70823	0.00048	0.82863
28	0.35841	1.48992	0.00105	1.68576
29	0.74661	2.98614	0.00117	3.43216
30	1.52286	6.29063	0.00122	7.10333

As expected the GPU helps in the circuit execution. For qibotf the GPU simulation time has the usual issue so the dry run is more accurate measurement. Other than that I believe the numbers are reasonable for both backends.

Regarding performing the gate fusion matrix multiplications on GPU, I tried this for the above example and got the following times:

creation_time

nqubits	qibojit GPU	qibojit CPU	qibotf GPU	qibotf CPU
20	0.00304	0.00254	0.00281	0.00279
21	0.00318	0.00259	0.00292	0.00301
22	0.00317	0.00283	0.00299	0.00306
23	0.00323	0.00282	0.00304	0.00441
24	0.00335	0.00302	0.00318	0.00316
25	0.00347	0.00296	0.00318	0.00324
26	0.00355	0.00308	0.00329	0.00544
27	0.00368	0.00313	0.00334	0.00346
28	0.00380	0.00326	0.00348	0.00347
29	0.00382	0.00331	0.00348	0.00361
30	0.00387	0.00334	0.00370	0.00362

dry_run_time

nqubits	qibojit GPU	qibojit CPU	qibotf GPU	qibotf CPU
20	1.13637	0.18461	0.63055	0.04980
21	1.14764	0.18538	0.63303	0.04949
22	1.13734	0.19897	0.64205	0.06177
23	1.25338	0.20849	0.64325	0.08613
24	1.15695	0.26259	0.65710	0.12604
25	1.18059	0.34068	0.67590	0.23626
26	1.22432	0.52480	0.75341	0.42491
27	1.31770	0.88000	0.84305	0.80202
28	1.49975	1.69440	1.03921	1.62485
29	1.89579	3.11498	1.35399	3.14233
30	2.68803	6.38584	2.19434	6.48759

simulation_times_mean

nqubits	qibojit GPU	qibojit CPU	qibotf GPU	qibotf CPU
20	0.00229	0.00320	0.00040	0.00392
21	0.00351	0.00864	0.00039	0.00918
22	0.00597	0.01453	0.00071	0.01687
23	0.01108	0.03181	0.00044	0.04127
24	0.02128	0.07351	0.00046	0.09235
25	0.04206	0.15654	0.00046	0.18817
26	0.08415	0.33237	0.00080	0.37578
27	0.17478	0.68841	0.00102	0.79483
28	0.35781	1.40665	0.00110	1.61881
29	0.74544	2.85560	0.00051	3.18566
30	1.52842	6.13281	0.00126	6.74426

So we notice that the GPU dry run times become significantly slower for both backends. The difference is due to the fused matrix calculations which are only done during the dry run as these matrices are cached once calculated. To be more concrete, in terms of code doing the fusion operation on GPU means that the backend K is used instead of K.qnp when constructing the matrix of the fused gate.

Based on the above results, I believe it is preferable, at least for the two qubit fusion, to keep these operations on numpy/CPU (that is keep K.qnp as it is now). Another way to see this is to benchmark multiplying small matrices directly on GPU:

matrices = [cp.random.random((4, 4)) for _ in range(100)]

start_time = time.time()
t = cp.eye(4)
for m in matrices:
    t = m @ t
print(time.time() - start_time)

vs doing the operation on CPU and casting back on GPU:

matrices = [cp.random.random((4, 4)) for _ in range(100)]

start_time = time.time()
t = np.eye(4)
for m in matrices:
    t = m.get() @ t # m must be transfered from GPU to CPU
t = cp.array(t) # result transfered back to GPU
print(time.time() - start_time)

In my local machine the latter is more than x1000 faster for these dimensions despite the transfers. The GPU becomes faster only when I increase the size of matrices, eg. to 1024x1024.

With that in mind I believe this PR is ready as it is. We should consider extending the kernels to more qubits in the future.

scarrazza · 2021-09-18T18:48:14Z

@stavros11 thank you very much for this. I also believe the CPU projection approach is acceptable and fully justified.

scarrazza · 2021-09-18T18:49:01Z

@mlazzarin could you please perform a first review of this PR, e.g. checking the code, example and docs?

mlazzarin · 2021-09-19T15:20:32Z

@mlazzarin could you please perform a first review of this PR, e.g. checking the code, example and docs?

Ok, I could do that tomorrow.

mlazzarin

Looks good to me, I made a few minor remarks regarding comments/docstrings.

mlazzarin · 2021-09-20T10:05:09Z

src/qibo/tests/test_core_circuit.py

@@ -31,8 +31,7 @@ def test_circuit_add_layer(backend, nqubits, accelerators):
    for gate in c.queue:
        assert isinstance(gate, gates.Unitary)

-# TODO: Test `_fuse_copy`
-# TODO: Test `fuse`
+# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py`


Suggested change

# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py`

# :meth:`qibo.core.circuit.Circuit.fuse()` is tested in `test_core_fusion.py`

src/qibo/core/circuit.py

src/qibo/abstractions/gates.py

mlazzarin · 2021-09-20T13:57:21Z

src/qibo/core/gates.py

+class FusedGate(MatrixGate, abstract_gates.FusedGate):
+
+    def __init__(self, *q):
+        BackendGate.__init__(self)
+        abstract_gates.FusedGate.__init__(self, *q)
+        if self.gate_op:
+            if len(self.target_qubits) == 1:
+                self.gate_op = K.op.apply_gate
+            elif len(self.target_qubits) == 2:
+                self.gate_op = K.op.apply_two_qubit_gate
+            else:
+                raise_error(NotImplementedError, "Fused gates can target up to two qubits.")
+
+    def _construct_unitary(self):
+        matrix = K.qnp.eye(2 ** len(self.target_qubits))
+        for gate in self.gates:
+            gmatrix = K.to_numpy(gate.matrix)
+            if len(gate.qubits) < len(self.target_qubits):
+                if gate.qubits[0] == self.target_qubits[0]:
+                    gmatrix = K.qnp.kron(gmatrix, K.qnp.eye(2))
+                else:
+                    gmatrix = K.qnp.kron(K.qnp.eye(2), gmatrix)
+            elif gate.qubits != self.target_qubits:
+                gmatrix = K.qnp.reshape(gmatrix, 4 * (2,))
+                gmatrix = K.qnp.transpose(gmatrix, [1, 0, 3, 2])
+                gmatrix = K.qnp.reshape(gmatrix, (4, 4))
+            matrix = gmatrix @ matrix
+        return K.cast(matrix)


Maybe it would be useful to add a docstring that explain where this class is used, and some comments that highlight where the two-qubit limit is hard coded (it may be useful in the future).

stavros11 · 2021-09-21T13:18:40Z

Looks good to me, I made a few minor remarks regarding comments/docstrings.

Thank you for reviewing this. I added some docstrings and comments in the suggested places, please let me know if these are easy to follow and make it any easier to read through the code.

mlazzarin · 2021-09-21T14:29:58Z

Thank you very much, they are useful indeed.

stavros11 added 20 commits September 8, 2021 19:04

Implement new fusion

6b05d2e

Remove _fuse_copy

5c1eabb

Merge branch 'master' into fusion

9909016

Remove one qubit fusion

7878049

Fix FusedGate with single gate

233a5e9

Add missing reshape

2340102

Fix QFT fusion

3778607

Update fusion algorithm

b3a354d

Simplify fusion code

42cd335

Unskip tests

509d1ec

Remove old fusion file

b4d9bd9

Unskip more tests

ef71abe

Remove fusion import

3cb67e9

Fix variational test

5067e45

Fix fusion for all circuits

8d06b1d

Remove old fusion class

2f9c17b

Fix CallbackGate with fusion

ebe1382

Implement FusedGate._dagger

c6c36fa

Skip parametrized tests temporarily

f53e419

Fix typo

1b6a473

stavros11 added 7 commits September 15, 2021 20:11

Remove fusion groups

8c36c3d

Update fusion parametrized gates

aad0af1

Fix shallow copy in fuse

cf9157a

Fix set_parameters for fused circuits

9853df8

Use _reset_unitary in parameter setter

28de310

Disable fusion for distributed circuits

ae80c44

Merge branch 'master' into fusion

2c7df31

stavros11 added 2 commits September 15, 2021 21:40

Fix pylint

d055fba

Remove print

0e0d81a

stavros11 added 2 commits September 16, 2021 00:47

Improve coverage

4b406dd

Add fusion test with Toffoli

c4376fa

stavros11 marked this pull request as ready for review September 16, 2021 14:29

stavros11 changed the title ~~[WIP] Updating gate fusion~~ Updating gate fusion Sep 16, 2021

scarrazza requested a review from mlazzarin September 18, 2021 18:49

mlazzarin reviewed Sep 20, 2021

View reviewed changes

stavros11 added 3 commits September 21, 2021 16:35

Merge master

5583509

Update docstrings

95ac9db

Add docstrings in fusion algorithm

c3e7997

mlazzarin approved these changes Sep 21, 2021

View reviewed changes

scarrazza merged commit fc1eb31 into master Sep 22, 2021

scarrazza deleted the fusion branch October 5, 2021 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating gate fusion #461

Updating gate fusion #461

stavros11 commented Sep 8, 2021

codecov bot commented Sep 15, 2021 •

edited

Loading

stavros11 commented Sep 15, 2021

scarrazza commented Sep 15, 2021

stavros11 commented Sep 15, 2021

scarrazza commented Sep 16, 2021

stavros11 commented Sep 16, 2021

scarrazza commented Sep 18, 2021

scarrazza commented Sep 18, 2021

mlazzarin commented Sep 19, 2021

mlazzarin left a comment

mlazzarin Sep 20, 2021

mlazzarin Sep 20, 2021

stavros11 commented Sep 21, 2021

mlazzarin commented Sep 21, 2021

	# :meth:`qibo.core.circuit.Circuit` is tested in `test_core_fusion.py`
	# :meth:`qibo.core.circuit.Circuit.fuse()` is tested in `test_core_fusion.py`

Updating gate fusion #461

Updating gate fusion #461

Conversation

stavros11 commented Sep 8, 2021

codecov bot commented Sep 15, 2021 • edited Loading

Codecov Report

stavros11 commented Sep 15, 2021

scarrazza commented Sep 15, 2021

stavros11 commented Sep 15, 2021

scarrazza commented Sep 16, 2021

stavros11 commented Sep 16, 2021

scarrazza commented Sep 18, 2021

scarrazza commented Sep 18, 2021

mlazzarin commented Sep 19, 2021

mlazzarin left a comment

Choose a reason for hiding this comment

mlazzarin Sep 20, 2021

Choose a reason for hiding this comment

mlazzarin Sep 20, 2021

Choose a reason for hiding this comment

stavros11 commented Sep 21, 2021

mlazzarin commented Sep 21, 2021

codecov bot commented Sep 15, 2021 •

edited

Loading