-
Notifications
You must be signed in to change notification settings - Fork 9
/
vcl_performance.tex
830 lines (669 loc) · 44.3 KB
/
vcl_performance.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
% chapter included in vclmanual.tex
\documentclass[vcl_manual.tex]{subfiles}
\begin{document}
\chapter{Performance considerations}\label{chap:PerformanceConsiderations}
\section{Comparison of alternative methods for writing SIMD code}\label{AlternativeMethodsForSIMDCode}
\flushleft
The SIMD (Single Instruction Multiple Data) instructions play an important role when software performance has to be optimized. Several different ways of writing SIMD code are discussed below.
\vspacesmall
\textbf{Assembly code}\\
Assembly programming is the ultimate way of optimizing code. Almost anything is possible in assembly code, but it is quite tedious and error-prone. There are thousands of different instructions, and it is quite difficult to remember which instruction belongs to which instruction set extension. Assembly code is difficult to document, difficult to debug, and difficult to maintain.
\vspacesmall
\textbf{Intrinsic functions}\\
Several compilers support intrinsic functions that are direct representations of machine instructions. A big advantage of using intrinsic functions rather than assembly code is that the compiler takes care of register allocation, function calling conventions, and other details that are difficult to keep track of when writing assembly code. Another advantage is that the compiler can optimize the code further by such methods as scheduling, interprocedural optimization, function inlining, constant propagation, common subexpression elimination, loop invariant code motion, induction variables, etc. Many of these optimization methods are rarely used in assembly code because they make the code unwieldy and unmanageable. Consequently, the combination on intrinsic functions and a good optimizing compiler can often produce more efficient code than what a decent assembly programmer would do.
\vspacesmall
A disadvantage of intrinsic functions is that these functions have long names that are difficult to remember and they make the code look awkward.
\vspacesmall
\textbf{Intel vector classes}\\
Intel has published a number of vector classes in the form of three C++ header files named fvec.h, dvec.h and ivec.h. These are simpler to use than the intrinsic functions, but unfortunately the Intel vector class files are poorly maintained; they provide only the most basic functionality; and Intel has done very little to promote, support, or develop them. The Intel vector classes have no way of converting data between arrays and vectors. This leaves us with no way of putting data into a vector other than specifying each element separately - which pretty much destroys the advantage of using vectors. The Intel vector classes work only with Intel and MS compilers.
\vspacesmall
\textbf{The VCL vector class library}\\
The present vector class library has several important features, listed on page \pageref{Features}. It provides the same level of optimization as the intrinsic functions, but it is much easier to use. This makes it possible to make optimal use of the SIMD instructions without the need to remember the thousands of different instructions or intrinsic functions. It also takes away the hassle of remembering which instruction belongs to which instruction set extension and making different code versions for different instruction sets.
\vspacesmall
\textbf{Automatic vectorization}\\
A good optimizing compiler is able to automatically transform linear code to vector code in simple cases. Typically, a good compiler will vectorize an algorithm that loops through an array and does some calculations on each array element.
\vspacesmall
Automatic vectorization is the easiest way of generating SIMD code, and I would very much recommend to use this method when it works. Automatic vectorization may fail or produce suboptimal code in the following cases:
\begin{itemize}
\item when the algorithm is too complex.
\item when data have to be re-arranged in order to fit into vectors and it is not obvious to the compiler how to do this, or when other parts of the code needs to be changed to handle the re-arranged data.
\item when it is not known to the compiler which data sets are bigger or smaller than the vector size.
\item when it is not known to the compiler whether the size of a data set is a multiple of the vector size or not.
\item when the algorithm involves calls to functions that are defined elsewhere or cannot be inlined and are not readily available in vector versions.
\item when the algorithm involves many branches that are not easily vectorized.
\item When the compiler cannot rule out that not-taken branches may generate false exceptions or other side effects.
\item when floating point operations have to be reordered or transformed and it is not known to the compiler whether these transformations are permissible with respect to precision, overflow, etc.
\item when functions are implemented with lookup tables.
\end{itemize}
\vspacesmall
The vector class library is intended as a good alternative when automatic vectorization fails to produce optimal code for any of these reasons.
\vspacesmall
\section{Choice of compiler and function libraries}\label{ChoiceOfCompiler}
It is recommended to compile for 64-bit mode because this gives access to more memory and more registers. The CPU gives you access to only 8 vector registers in 32-bit mode, but 32 vector registers in 64-bit mode if the AVX512 instruction set is enabled. Compiler options for different instruction sets are listed in table \ref{table:CommandLineOptions}.
\vspacesmall
The vector class library has support for the following compilers:
\vspacebig
\textbf{Gnu C++ compiler}\\
This compiler has produced very good optimizations in my tests. The Gnu compiler (g++) is available for all x86 and x86-64 platforms.
\vspacesmall
There are several versions of the Gnu compiler for the Windows platform. The version that comes with msys2 is recommended. The Cygwin64 version is not recommended because it is using a less efficient memory model.
\vspacesmall
Options -O2 or -O3 are recommended for storing vectors in registers and for peeling off the trivial overhead of objects, constructors, etc.
Options -fno-trapping-math and -fno-math-errno facilitate vectorization by disabling old exception handling methods that are not suitable for vectors.
Do not use the option -ffast-math or -ffinite-math-only if you want to rely on INF and NAN because these options may disable the detection of INF and NAN.
\vspacebig
\textbf{Clang C++ compiler}\\
This compiler has produced the best optimized code in my tests. The Clang compiler is available for all x86 and x86-64 platforms.
\vspacesmall
There are different versions of Clang available for the Windows platform.
The msys2 version, and the version that comes as a plugin for Visual Studio are both recommended.
The Cygwin64 version is not recommended because it is using a less efficient memory model.
\vspacesmall
Use the same compiler options as for the Gnu C++ compiler mentioned above.
\vspacebig
\textbf{Microsoft Visual Studio}\\
This is a very popular compiler for Windows because it has a good and user friendly IDE (Integrated Development Environment) and debugger. Make sure you are compiling for the "unmanaged" version, i. e. \emph{not} using the .net common language runtime (CLR).
\vspacesmall
The Microsoft compiler optimizes reasonably well, but not as good as the other compilers. Support for the latest instruction sets is incomplete.
\vspacesmall
You may use the plugin for the Clang compiler in Visual studio to combine a good compiler with a good IDE.
\vspacesmall
Do not use the option /fp:fast on a Microsoft compiler because
this may disable the detection of INF and NAN.
\vspacebig
\textbf{Intel C++ compiler}\\
Version 2021 or later of an Intel C++ compiler is required for compiling VCL version 2.xx.
\vspacesmall
The Intel C++ compiler currently comes in two versions: "Intel C++ Compiler Classic" (icc for Linux and icl for Windows) and "Intel oneAPI LLVM-based C++ Compiler" (icx). The classic version is a continuation of previous versions and is not recommended for new projects. The LLVM-based version is based on a Clang compiler with additional "Intel proprietary optimizations and code generation".
\vspacesmall
Note that the Intel compiler "Classic" and some of the function libraries favour Intel CPUs and produce code that runs slower than necessary on CPUs of any other brand than Intel. Do not use the Intel compiler "Classic" for software that may run on non-Intel microprocessors. Code produced by the Intel LLVM-based Compiler will usually give good performance on non-Intel processors. Avoid command line options beginning with -x or /Qx. Code compiled with these options can only run on Intel processors. See table \ref{table:CommandLineOptions} for an overview of command line options.
\vspacesmall
You may use the Intel LLVM-based Compiler if you need Intel-specific features. Otherwise, you may as well prefer the pure Clang compiler which is almost identical.
\vspacebig
\textbf{Conclusion}\\
My recommendation is to use the Clang or Gnu compiler for the release version of a program when performance is important. Microsoft Visual Studio may be a convenient aid in the development phase of a Windows project. Switching to Clang is easy because there is a Clang/LLVM plugin to Visual Studio.
\vspacebig
\section{Choosing the optimal vector size and precision} \label{ChoosingOptimalVectorSize}
It takes the same time to make a vector addition with vectors of eight single precision floats (\codei{Vec8f}) as with vectors of four double precision floats (\codei{Vec4d}). Likewise, it takes the same time to add two integer vectors whether the vectors have eight 32-bit integers (\codei{Vec8i}) or sixteen 16-bit integers (\codei{Vec16s}). Therefore, it is advantageous to use the lowest precision or resolution that fits the data. It may even be worthwhile to modify a floating point algorithm to reduce loss of precision if this allows you to use single precision rather than double precision. Half precision can improve the performance even further, if supported.
However, you should also take into account the time it takes to convert data from one precision to another. Therefore, it is not good to mix different precisions.
\vspacesmall
The total vector size is 128 bits, 256 or 512 bits, depending on the instruction set.
The 256-bit floating point vectors are advantageous when the AVX instruction set is available and enabled. The 256-bit integer vectors are advantageous under the AVX2 instruction set.
The 512-bit integer and floating point vectors are available with the AVX512 instruction set.
Table \ref{table:integerVectorClasses} on page \pageref{table:integerVectorClasses} lists the recommended instruction set for each vector class.
You can compile multiple versions of your code for different instruction sets as explained in chapter \ref{CPUDispatching} below. This makes it possible to code for the largest vector size in order to make your code ready for the newest CPU's.
For example, if you are using the vector class Vec16f then you will be using 512-bit vectors when the code is running on a CPU that supports AVX512. The code will use two 256-bit vectors instead of one 512-bit vector when running on a CPU with only AVX2.
\vspacesmall
Current microprocessors can typically do two full size vector operations per clock cycle in small loops. (See \href{https://www.agner.org/optimize/#manuals}{Agner's optimization manuals for details} ).
\vspacesmall
\section{Putting data into vectors}\label{PuttingDataIntoVectors}
The different ways of putting data into vectors are listed on page \pageref{ConstructingVectors}. If the vector elements are constants known at compile time, then the fastest way is to use a constructor:
\begin{example}
\label{exampleVectConstructor}
\end{example}
\begin{lstlisting}[frame=single]
Vec4i a(1); // a = (1, 1, 1, 1)
Vec4i b(2, 3, 4, 5); // b = (2, 3, 4, 5)
\end{lstlisting}
\vspacesmall
If the vector elements are not constants then the fastest way is to load from an array with the method \codei{load}. However, it is not good to load data from an array immediately after writing the data elements to the array one by one, because this causes a "store forwarding stall" (see \href{https://www.agner.org/optimize/#manual_cpp}{Agner's microarchitecture manual}). This is illustrated in the following examples:
\begin{example}
\label{exampleArrayLoopV}
\end{example}
\begin{lstlisting}[frame=single]
// Make vector using constructor
int MakeMyData(int i); // make whatever data we need
void DoSomething(Vec4i & data); // handle these data
const int datasize = 1000; // total number data elements
...
for (int i = 0; i < datasize; i += 4) {
Vec4i d(MakeMyData(i), MakeMyData(i+1),
MakeMyData(i+2), MakeMyData(i+3));
DoSomething(d);
}
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleArrayLoopW}
\end{example}
\begin{lstlisting}[frame=single]
// Load from small array
int MakeMyData(int i); // make whatever data we need
void DoSomething(Vec4i & data); // handle these data
const int datasize = 1000; // total number data elements
...
for (int i = 0; i < datasize; i += 4) {
int data4[4];
for (int j = 0; j < 4; j++) {
data4[j] = MakeMyData(i+j);
}
// store forwarding stall for large read after small writes:
Vec4i d = Vec4i().load(data4);
DoSomething(d);
}
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleArrayLoopX}
\end{example}
\begin{lstlisting}[frame=single]
// Make array a little bigger
int MakeMyData(int i); // make whatever data we need
void DoSomething(Vec4i & data); // handle these data
const int datasize = 1000; // total number data elements
...
for (int i = 0; i < datasize; i += 8) {
int data8[8];
for (int j = 0; j < 8; j++) {
data8[j] = MakeMyData(i+j);
}
Vec4i d;
for (int k = 0; k < 8; k += 4) {
d.load(data8 + k);
DoSomething(d);
}
}
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleArrayLoopY}
\end{example}
\begin{lstlisting}[frame=single]
// Make array full size
int MakeMyData(int i); // make whatever data we need
void DoSomething(Vec4i & data); // handle these data
const int datasize = 1000; // total number data elements
...
int data1000[datasize];
int i;
for (i = 0; i < datasize; i++) {
data1000[i] = MakeMyData(i);
}
Vec4i d;
for (i = 0; i < datasize; i += 4) {
d.load(data1000 + i);
DoSomething(d);
}
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleArrayLoopInsert}
\end{example}
\begin{lstlisting}[frame=single]
// Use insert. No array needed
int MakeMyData(int i); // make whatever data we need
void DoSomething(Vec4i & data); // handle these data
const int datasize = 1000; // total number data elements
...
Vec4i d; // declare vector
for (int i = 0; i < datasize; i += 4) {
for (int j = 0; j < 4; j++) {
d.insert(j, MakeMyData(i+j)); // insert element
}
DoSomething(d);
}
\end{lstlisting}
\vspacesmall
In example \ref{exampleArrayLoopV}, we are combining four data elements into vector d by calling a constructor with four parameters. This may not be the most efficient way because it requires several instructions to combine the four numbers into a single vector.
\vspacesmall
In example \ref{exampleArrayLoopW}, we are putting the four values into an array and then loading the array into a vector. This is causing the so-called \emph{store forwarding stall}. A store forwarding stall occurs in the CPU hardware when doing a large read (here 128 bits) immediately after a smaller write (here 32 bits) to the same address range. This causes a delay of 10 - 20 clock cycles.
\vspacesmall
In example \ref{exampleArrayLoopX}, we are putting eight values into an array and then reading four elements at a time. If we assume that it takes more than 10 - 20 clock cycles to call MakeMyData four times then the first four elements of the array will have sufficient time to make it into the level-1 cache while we are writing the next four elements. This delay is sufficient to avoid the store forwarding stall. A disadvantage of example \ref{exampleArrayLoopX} is that we need an extra loop.
\vspacesmall
In example \ref{exampleArrayLoopY}, we are putting a thousand elements into an array before loading them. This is certain to avoid the store forwarding stall. A disadvantage of example \ref{exampleArrayLoopY} is that the large array takes more cache space.
\vspacesmall
Example \ref{exampleArrayLoopInsert} avoids any memory intermediate by inserting elements directly into the vector. This method is most efficient when the AVX512VL instruction set is enabled. The compiler is likely to keep often-used vectors in registers without saving them to memory.
\vspacesmall
\section{Alignment of arrays and vectors}\label{Alignment}
Reading and writing vectors from or to memory is likely to be slightly faster if the array in memory is aligned to an address divisible by the vector size. The vector size is 16, 32, or 64 bytes for 128, 256, and 512 bits, respectively. The program may not work when compiled for an instruction set less than AVX if vectors are not aligned by at least 16.
\vspacesmall
Most compilers will align large arrays automatically if they are stored in static memory, but perhaps not if they are stored in local memory or allocated with operator \codei{new}, etc.
\vspacesmall
An array can be aligned with the \codei{alignas} keyword, for example:
\begin{lstlisting}[frame=none]
alignas(64) float mydata[1024];
\end{lstlisting}
\vspacesmall
Older compilers use \codei{\_\_declspec(align(64))} in Windows, or
\codei{ \_\_attribute\_\_((aligned(64)) } in Linux.
\vspacesmall
It is always recommended to align large arrays for performance reasons if the code uses vectors. C++ version 17 supports alignment with \codei{std::aligned\_alloc}. \vspacesmall
A useful method is to align an array of vectors with operator \codei{new}. The compiler recognizes that a vector class, e.g. \codei{Vec8f}, needs alignment. An array of such vectors will be aligned correctly, even when allocated with \codei{new}. This is illustrated in the following example:
\begin{lstlisting}[frame=none]
// size of dataset, as number of floats:
int datasize = 1024;
// variable size array, properly aligned
// (assuming that datasize is divisible by vector size):
Vec8f *mydata = new Vec8f[datasize / Vec8f::size()];
// access array as single elements:
float * mydataf = (float*)mydata;
int i;
for (i = 0; i < datasize; i++) {
mydataf[i] = (float)i;
}
// access array as vectors:
Vec8f x;
for (i = 0; i < datasize / Vec8f::size(); i++) {
x = mydata[i];
x *= 100.f;
mydata[i] = x;
}
// remember to free the allocated data:
delete[] mydata;
\end{lstlisting}
\vspacesmall
The container class template \codei{ContainerV} is available as an add-on to the vector class library. This is useful for making a properly aligned array of vectors.
\vspacesmall
Finally, it is possible to do the alignment manually as illustrated in this example:
\begin{lstlisting}[frame=none]
// Example of aligning memory
int arraySize = 1024; // Required array size
const int alignBy = 64; // Required alignment (must be a power of 2)
// allocate more than needed
char * unalignedAddress = new char[arraySize*sizeof(float) + alignBy];
// round up the address to nearest multiple of alignBy
char * alignedAddress = (char*)(((size_t)unalignedAddress+alignBy-1) & (-alignBy));
// cast aligned pointer to required type
float * mydataf = (float*)alignedAddress;
// use the aligned array
for (int i = 0; i < arraySize; i++) {
mydataf[i] = (float)i;
}
// remember to free at unalignedAddress, not alignedAddress
delete[] unalignedAddress; // free memory
\end{lstlisting}
\vspacesmall
\section{When the data size is not a multiple of the vector size}\label{NotAMultipleOfVectorSize}
It is obviously easier to vectorize a data set when the number of elements in the data set is a multiple of the vector size. Here, we will discuss different ways of handling the situation when the data do not fit into an integral number of vectors. We will use the simple example of adding 134 integers stored in an array. The following examples illustrate different solutions.
\vspacesmall
\begin{example}
\label{exampleOddLoop1}
\end{example}
\begin{lstlisting}[frame=single]
// Handling the remaining data one by one
const int datasize = 134;
const int vectorsize = 8;
const int regularpart = datasize & (-vectorsize); // = 128
// (AND-ing with -vectorsize will round down to the nearest
// lower multiple of vectorsize. This works only if vectorsize
// is a power of 2)
int mydata[datasize];
... // initialize mydata
Vec8i sum1(0), temp;
int i;
// loop for 8 numbers at a time
for (i = 0; i < regularpart; i += vectorsize) {
temp.load(mydata+i); // load 8 elements
sum1 += temp; // add 8 elements
}
int sum = 0;
// loop for the remaining 6 numbers
for (; i < datasize; i++) {
sum += mydata[i];
}
sum += horizontal_add(sum1); // add the vector sum
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleOddLoop2}
\end{example}
\begin{lstlisting}[frame=single]
// Handling the remaining data with a smaller vector size
const int datasize = 134;
const int vectorsize = 8;
const int regularpart = datasize & (-vectorsize); // = 128
int mydata[datasize];
... // initialize mydata
Vec8i sum1(0), temp;
int sum = 0;
int i;
// loop for 8 numbers at a time
for (i = 0; i < regularpart; i += vectorsize) {
temp.load(mydata+i); // load 8 elements
sum1 += temp; // add 8 elements
}
sum = horizontal_add(sum1); // sum of first 128 numbers
if (datasize - i >= 4) {
// get four more numbers
Vec4i sum2;
sum2.load(mydata+i);
i += 4;
sum += horizontal_add(sum2);
}
// loop for the remaining 2 numbers
for (; i < datasize; i++) {
sum += mydata[i];
}
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleOddLoop3}
\end{example}
\begin{lstlisting}[frame=single]
// Use partial load for the last vector
const int datasize = 134;
const int vectorsize = 8;
const int regularpart = datasize & (-vectorsize); // = 128
int mydata[datasize];
... // initialize mydata
Vec8i sum1(0), temp;
// loop for 8 numbers at a time
for (int i = 0; i < regularpart; i += vectorsize) {
temp.load(mydata+i); // load 8 elements
sum1 += temp; // add 8 elements
}
// load the last 6 elements
temp.load_partial(datasize-regularpart, mydata+regularpart);
sum1 += temp; // add last 6 elements
int sum = horizontal_add(sum1); // vector sum
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleOddLoop4}
\end{example}
\begin{lstlisting}[frame=single]
// Read past the end of the array and ignore excess data
const int datasize = 134;
const int vectorsize = 8;
int mydata[datasize];
... // initialize mydata
Vec8i sum1(0), temp;
// loop for 8 numbers at a time, reading 136 numbers
for (int i = 0; i < datasize; i += vectorsize) {
temp.load(mydata+i); // load 8 elements
if (datasize - i < vectorsize) {
// set excess data to zero
// (this may be faster than load_partial)
temp.cutoff(datasize - i);
}
sum1 += temp; // add 8 elements
}
int sum = horizontal_add(sum1); // vector sum
\end{lstlisting}
\vspacesmall
\begin{example}
\label{exampleOddLoop5}
\end{example}
\begin{lstlisting}[frame=single]
// Make array bigger and set excess data to zero
const int datasize = 134;
const int vectorsize = 8;
// round up datasize to nearest higher multiple of vectorsize
const int arraysize =
(datasize + vectorsize - 1) & (-vectorsize); // = 136
int mydata[arraysize];
int i;
... // initialize mydata
// set excess data to zero
for (i = datasize; i < arraysize; i++) {
mydata[i] = 0;
}
Vec8i sum1(0), temp;
// loop for 8 numbers at a time, reading 136 numbers
for (i = 0; i < arraysize; i += vectorsize) {
temp.load(mydata+i); // load 8 elements
sum1 += temp; // add 8 elements
}
int sum = horizontal_add(sum1); // vector sum
\end{lstlisting}
\vspacesmall
It is clearly advantageous to increase the array size to a multiple of the vector size, as in example \ref{exampleOddLoop5} above. Likewise, if you are storing vector data to an array, then it is an advantage to make the result array bigger to hold the excess data. If this is not possible then use \codei{store\_partial} to write the last partial vector to the array.
\vspacesmall
It is usually possible to read past the end of an array, as in example \ref{exampleOddLoop4} above, without causing problems. However, there is a theoretical possibility that the array is placed at the very end of the readable data section so that the program will crash when attempting to read from an illegal address past the end of the valid data area. To consider this problem, we need to look at each possible method of data storage:
\begin{itemize}
\item An array declared inside a function, and not static, is stored on the stack. The subsequent addresses on the stack will contain the return address and parameters for the function, followed by local data, parameters, and return address of the next higher function all the way up to main. In this case there is plenty of extra data to read from.
\item A static or global array is stored in static data memory. The static data area is often followed by library data, exception handler tables, link tables, etc. These tables can be seen by requesting a map file from the linker.
\item Data allocated with the operator \codei{new} are stored on the heap. I have no information of the size of the end node in a heap.
\item If an array is declared inside a class definition then one of the three cases above applies, depending on how the class instance (object) is created.
\end{itemize}
\vspacesmall
These problems can be avoided either by making the array bigger or by aligning the array to an address divisible by the vector size, as described on page \pageref{Alignment}. The memory page size is at least 4 kbytes, and always a power of 2. If the array is aligned by the vector size then the page boundaries are certain to coincide with vector boundaries. This makes sure that there is no memory page boundary between the end of the array and the next vector-size boundary. Therefore, we can read up to the next vector-size boundary without the risk of crossing a boundary to an invalid memory page.
\vspacesmall
The add-on package named 'containers' includes efficient container class templates for arrays of fixed size and dynamic size, as well as matrixes.
These containers will automatically extend arrays to a multiple of the vector size.
Use the container class template \codei{ContainerV} for making arrays that fit the vector classes. See containers\_manual.pdf for details.
\vspacesmall
\section{Using multiple accumulators} \label{UsingMultipleAccumulators}
Consider this function which adds a long list of floating point numbers:
\begin{example}
\label{exampleLoopAccumulator1}
\end{example}
\begin{lstlisting}[frame=single]
double add_long_list(double const * p, int n) {
int n1 = n & (-4); // round down n to multiple of 4
Vec4d sum(0.0);
int i;
for (i = 0; i < n1; i += 4) {
sum += Vec4d().load(p + i); // add 4 numbers
}
// add any remaining numbers
sum += Vec4d().load_partial(n - i, p + i);
return horizontal_add(sum);
}
\end{lstlisting}
\vspacesmall
In this example, we have a loop-carried dependency chain (see
\href{https://www.agner.org/optimize/#manual_cpp}{Agner's C++ manual}).
The vector addition inside the loop has a latency of typically 3 - 5 clock cycles. As each addition has to wait for the result of the previous addition, the loop will take 3 - 5 clock cycles per iteration.
\vspacesmall
However, the throughput of floating point additions is typically one or two vector additions per clock cycle. Therefore, we are far from fully utilizing the capacity of the floating point adder. In this situation, we can double the speed by using two accumulators:
\begin{example}
\label{exampleLoopAccumulator2}
\end{example}
\begin{lstlisting}[frame=single]
double add_long_list(double const * p, int n) {
int n2 = n & (-8); // round down n to multiple of 8
Vec4d sum1(0.0), sum2(0.0);
int i;
for (i = 0; i < n2; i += 8) {
sum1 += Vec4d().load(p + i); // add 4 numbers
sum2 += Vec4d().load(p + i + 4); // 4 more numbers
}
if (n - i >= 4) {
// add 4 more numbers
sum1 += Vec4d().load(p + i);
i += 4;
}
// add any remaining numbers
sum2 += Vec4d().load_partial(n - i, p + i);
return horizontal_add(sum1 + sum2);
}
\end{lstlisting}
\vspacesmall
Here, the addition to sum2 can begin before the addition to sum1 is finished. The loop still takes 3 - 5 clock cycles per iteration, but the number of additions done per loop iteration is doubled. It may even be worthwhile to have three or four accumulators in this case if n is very big.
\vspacesmall
In general, if we want to predict whether it is advantageous to have more than one accumulator, we first have to see if there is a loop-carried dependency chain. If the performance is not limited by a loop-carried dependency chain then there is no need for multiple accumulators. Next, we have to look at the latency and throughput of the instructions inside the loop. Floating point addition, subtraction and multiplication all have latencies of typically 3 - 5 clock cycles and a throughput of one or two vector additions/subtractions/multiplications per clock cycle. Therefore, if the loop-carried dependency chain involves floating point addition, subtraction or multiplication; and the total number of floating point operations per loop iteration is lower than the maximum throughput, then it may be advantageous to have two accumulators, or perhaps more than two.
\vspacesmall
There is rarely any reason to have multiple accumulators in integer code, because an integer vector addition has a latency of just 1 or 2 clock cycles.
\vspacesmall
\section{Using multiple threads}\label{UsingMultipleThreads}
Performance can be improved by dividing the work between multiple threads running in parallel on processors with multiple CPU cores.
It is important to distinguish between coarse-grained parallelism and fine-grained parallelism. Coarse-grained parallelism refers to the situation where a long sequence of operations can be carried out independently of other tasks that are running in parallel. Fine-grained parallelism is the situation where a task is divided into many small subtasks, but it is impossible to work for very long on a particular subtask before coordination with other subtasks is necessary.
\vspacesmall
Vector operations are useful for fine-grained parallelism, while multithreading is useful only for coarse-grained parallelism.
The work should be divided between threads in such as way that communication between the threads is avoided, or at least kept at a minimum.
\vspacesmall
Modern computers have multiple CPU cores. It is often possible to run two threads simultaneously in each CPU core. This is called simultaneous multithreading (SMT) or hyperthreading.
Two threads running in the same CPU core will be competing for the same CPU resources so that each thread is getting only half of the available resources. Therefore, SMT is not advantageous for CPU-intensive code such as heavy mathematical calculations.
\vspacesmall
The optimal number of threads for CPU-intensive code is equal to the number of CPU cores or physical processors. If the code is not CPU-intensive, i.e. if the performance is limited by something else such as RAM, disk access, or network speed, then you will probably get better performance by setting the number of threads equal to the number of logical processors. This is the number of threads that can run simultaneously without task switching. The number of logical processors is double the number of physical processors if each CPU core can run two threads simultaneously.
\vspacesmall
The function \codei{physicalProcessors()} gives information about the number of physical and logical processors (see page \pageref{physicalProcessors}).
\vspacesmall
It is not safe to access the same data from multiple threads simultaneously. For example, it may be uncertain whether one thread is reading a value before or after it has been modified by another thread. Container classes, in particular, are unsafe to access from multiple threads.
\vspacesmall
The floating point control word (see p. \pageref{FPControlWordManipulationFunctions}) is not shared between threads.
\vspacesmall
\section{Instruction sets and CPU dispatching}\label{CPUDispatching}
\flushleft
Historically, almost every new generation of microprocessors has added a new extension to the instruction set. Most of the new instructions relate to vector operations. We can take advantage of these new instructions to make vector code more efficient. The vector class library requires the SSE2 instruction set as a minimum, but it makes more efficient code when a higher instruction set is used. Table \ref{table:instructionSetHistory} indicates things that are improved for each successive instruction set extension.
\vspacesmall
\begin {table}[h]
\caption{Instruction set history}
\label{table:instructionSetHistory}
\begin{tabular}{|p{24mm}|p{22mm}|p{100mm}|}
\hline
\bfseries Instruction \newline set & \bfseries Year \newline introduced & \bfseries VCL functions improved \\ \hline
SSE2 & 2001 & minimum requirement for vector class library \\ \hline
SSE3 & 2004 & floating point horizontal\_add \\ \hline
SSSE3 & 2006 & permute, blend and lookup functions, integer abs \\ \hline
SSE4.1 & 2007 & select, blend, horizontal\_and, horizontal\_or, integer max/min, integer multiply (32 and 64 bit), integer divide (32 bit), 64-bit integer compare (==, !=), floating point round, truncate, floor, ceil. \\ \hline
SSE4.2 & 2008 & 64-bit integer compare (\textgreater, \textgreater =, \textless, \textless =). 64 bit integer max, min \\ \hline
AVX & 2011 & all operations on 256-bit floating point vectors: Vec8f, Vec4d \\ \hline
XOP \newline AMD only & 2011 & compare, horizontal\_add\_x, rotate\_left, blend, and lookup on 128-bit integer vectors. Obsolete. \\ \hline
FMA4 \newline AMD only & 2011 & floating point code containing multiplication followed by addition. Obsolete. \\ \hline
FMA3 & 2012 & floating point code containing multiplication followed by addition \\ \hline
AVX2 & 2013 & All operations on 256-bit integer vectors: Vec32c, Vec32uc, Vec16s, Vec16us, Vec8i, Vec8ui, Vec4q, Vec4uq. Gather. \\ \hline
F16C & 2013 & Conversion between single precision and half precision floating point numbers. \\ \hline
AVX512F & 2016 & All operations on 512-bit integer and floating point vectors: Vec16i, Vec16ui, Vec8q, Vec8uq, Vec16f, Vec8d. \\ \hline
AVX512BW & 2018 & 512 bit vectors with 8-bit and 16-bit integer elements \\ \hline
AVX512DQ & 2018 & Faster multiplication of vectors of 64-bit integers. \\ \hline
AVX512VL & 2018 & Compact boolean vectors for 128 and 256 bit data.
Improved performance of insert, extract, load\_partial, store\_partial, and several other functions. \\ \hline
AVX512ER & 2016 & Only on a few processor models have this. Fast exponential functions. Better precision on approx\_recipr and approx\_rsqrt. \\ \hline
AVX512VBMI & 2018 & Faster permutation functions etc. for Vec32c and Vec64c \\ \hline
AVX512VBMI2 & 2019 & Faster extract from 8-bit and 16-bit integer vectors \\ \hline
AVX512-FP16 & 2023 & Half precision floating point calculations. \\ \hline
\end{tabular}
\end{table}
\vspacesmall
The vector class library makes it possible to compile for different instruction sets from the same source code. Different versions are made simply by recompiling the code with different compiler options. The instruction set to use in VCL can be specified on the compiler command line as listed in table \ref{table:CommandLineOptions}.
\begin {table}[h]
\caption{Command line options}
\label{table:CommandLineOptions}
\begin{tabular}{|p{26mm}|p{28mm}|p{28mm}|p{28mm}|p{28mm}|}
\hline
\bfseries Instruction set & \bfseries Gnu and Clang compiler & \bfseries Intel compiler Linux & \bfseries Intel compiler Windows & \bfseries MS compiler \\ \hline
SSE2 & -msse2 & -msse2 & /arch:sse2 & /arch:sse2 \\ \hline
SSE3 & -msse3 & -msse3 & /arch:sse3 & /arch:sse2 \newline /D\_\_SSE3\_\_ \\ \hline
SSSE3 & -mssse3 & -mssse3 & /arch:ssse3 & /arch:sse2 \newline /D\_\_SSSE3\_\_ \\ \hline
SSE4.1 & -msse4.1 & -msse4.1 & /arch:sse4.1 & /arch:sse2 \newline /D\_\_SSE4\_1\_\_ \\ \hline
SSE4.2 & -msse4.2 & -msse4.2 & /arch:sse4.2 & /arch:sse2 \newline /D\_\_SSE4\_2\_\_ \\ \hline
AVX & -mavx \newline -fabi-version=0 & -mavx & /arch:avx & /arch:avx /DINSTRSET=7 \\ \hline
FMA3 & -mfma & -mfma & -mfma & /DINSTRSET=7 \\ \hline
AVX2 & -mavx2 \newline -fabi-version=0 & -mavx2 & /arch:avx2 & /arch:avx2 \newline /DINSTRSET=8 \\ \hline
F16C & -mf16c & -mf16c & /arch:avx2 & /D\_\_F16C\_\_ \\ \hline
AVX512F & -mavx512f & -mavx512f & /arch:COMMON-AVX512 & not supported without AVX512DQ\\ \hline
AVX512VL/BW/ DQ & -mavx512vl -mavx512bw -mavx512dq & -mavx512vl -mavx512bw -mavx512dq & /arch:CORE-AVX512 & /arch:avx2 /DINSTRSET=10 \\ \hline
AVX512VBMI & -mavx512vbmi & -mavx512vbmi & ?? & /D \_\_AVX512VBMI\_\_ \\ \hline
AVX512VBMI2 & -mavx512vbmi2 & -mavx512vbmi2 & ?? & /D \_\_AVX512VBMI2\_\_ \\ \hline
AVX512ER & -mavx512er & -xMIC-AVX512 & /arch:MIC-AVX512 & /D\_\_AVX512ER\_\_ \\ \hline
AVX512-FP16 & -mavx512fp16 & -mavx512fp16 & /arch:SAPPHIRERAPIDS & /D\_\_AVX512FP16\_\_ \\ \hline
C++ standard & -std=c++17 & -std=c++17 & /Qstd:c++17 & -std=c++17 \\ \hline
\end{tabular}
\end{table}
\vspacesmall
The Microsoft compiler does not have command line options for all the instruction sets, but other instruction sets can be specified as defines which are detected in the preprocessing directives of the vector class library.
\vspacesmall
The FMA3 instruction set is not always handled directly by the code in the vector class library, but by the compiler. The compiler may automatically combine a floating point multiplication and a subsequent addition or subtraction into a single instruction, unless you have specified a strict floating point model.
\vspacesmall
It is recommended to specify compiler options that allow efficient code optimizations. Suitable options on Gnu and Clang compilers are
-O2 -fno-trapping-math -fno-math-errno. The option -O3 is sometimes better than -O2, but in some cases it is worse. You may test whether -O2 or -O3 gives the best performance in your specific case.
Suitable options on Microsoft and Intel compilers are /O2 /fp:except-.
\vspacesmall
There is no advantage in using the biggest vector classes unless the corresponding instruction set is specified, but it can be convenient to use these classes anyway if the same source code is compiled for multiple versions with different instruction sets. Each large vector will simply be split up into two or four smaller vectors when compiling for a lower instruction set.
\vspacesmall
It is recommended to make an automatic CPU dispatcher that detects at runtime which instruction sets are supported by the actual CPU, and selects the best version of the code accordingly. For example, you may compile the code three times for three different instruction sets: SSE2, AVX2 and AVX512VL/BW/DQ. The CPU dispatcher will then set a function pointer to point to the appropriate version of the compiled code. You can use the function instrset\_detect (see below, page \pageref{instrsetDetect}) to detect the supported instruction set.
Two examples are provided to show how to do the CPU dispatching:
\vspacesmall
dispatch\_example1.cpp: This example is using different function names for the different versions. This is useful for simple cases with only one or a few functions.
\vspacesmall
dispatch\_example2.cpp: This example is using different namespaces for the different versions. This is the preferred method if the code contains multiple functions, classes, objects, etc.
\vspacesmall
There is an important restriction when you are combining code compiled for different instruction sets: Do not transfer any data as vector objects between different pieces of code that are compiled for different instruction sets because the vectors may be represented differently under the different instruction sets. It is recommended to transfer the data as arrays instead between different parts of the program that are compiled for different instruction sets.
\vspacesmall
The functions listed below can be used for detecting at runtime which instruction set is supported, and other useful information about the CPU.
\vspacesmall
The function \codei{instrset\_detect()} gives a value representing the instruction set level.
\vspacesmall
\label{instrsetDetect}
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & int instrset\_detect() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description &
returns one of these values: \newline
0 = 80386 instruction set \newline
1 or above = SSE supported by CPU \newline
2 or above = SSE2 \newline
3 or above = SSE3 \newline
4 or above = Supplementary SSE3 (SSSE3) \newline
5 or above = SSE4.1 \newline
6 or above = SSE4.2 \newline
7 or above = AVX \newline
8 or above = AVX2 \newline
9 or above = AVX512F \newline
10 or above = AVX512VL, AVX512BW, and AVX512DQ \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
Additional instruction set extensions are not necessarily part of a linear sequence. These extensions can be detected with the following functions.
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & bool hasFMA3() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description & returns true if FMA3 is supported \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & bool hasAVX512ER() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description & returns true if AVX512ER is supported \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & bool hasAVX512VBMI() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description & returns true if AVX512VBMI is supported \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & bool hasAVX512VBMI2() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description & returns true if AVX512VBMI2 is supported \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & bool hasF16C() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description & returns true if F16C is supported \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & bool hasAVX512FP16() \\ \hline
\bfseries Source & instrset\_detect.cpp \\ \hline
\bfseries Description & returns true if AVX512-FP16 is supported \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\label{physicalProcessors}
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Function & int physicalProcessors(int * logical\_processors = 0) \\ \hline
\bfseries Source & add-on/physical\_processors.cpp \\ \hline
\bfseries Description & Returns the number of physical processors = the number of
CPU cores. The number of logical processors (returned through
logical\_processors) is double the number of physical processors if the CPU can run two threads simultaneously in each CPU core. \\ \hline
\bfseries Efficiency & poor \\ \hline
\end{tabular}
\vspacebig
\section{Function calling convention}\label{FunctionCallingConvention}
Function calls are most efficient when vectors are transferred in registers rather than in memory. This can be achieved in various ways:
\begin{itemize}
\item Use inline functions. This is useful for small functions and for functions that are only called in one place. An optimizing compiler may inline functions automatically, even if they are not specified as \codei{inline}. You may declare such functions \codei{static} as well to prevent the compiler from making a non-inline copy of the inlined function.
\item Use Linux or MacOS. Vector parameters are transferred in registers by default on these platforms. Vector function returns are transferred in registers in 64-bit mode.
\item Use \codei{\_\_vectorcall} in 64-bit Windows. The Clang and Microsoft compilers can transfer vector parameters and vector returns in registers when \codei{\_\_vectorcall} is used on the function declaration. See the example on page \pageref{examplePolynomialVectorcall}.
\item Use a vector size that fits the instruction set, according to table \ref{table:integerVectorClasses} on page \pageref{table:integerVectorClasses}.
\end{itemize}
\end{document}