-
Notifications
You must be signed in to change notification settings - Fork 9
/
vcl_introduction.tex
849 lines (709 loc) · 43.1 KB
/
vcl_introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
% chapter included in vclmanual.tex
\documentclass[vcl_manual.tex]{subfiles}
\begin{document}
\chapter{Introduction}\label{chap:Introduction}
\flushleft
The VCL vector class library is a tool that helps C++ programmers make their code much faster by handling multiple data in parallel. Modern CPU’s have \textit{Single Instruction Multiple Data} (SIMD) instructions for handling vectors of multiple data elements in parallel. The compiler may be able to use SIMD instructions automatically in simple cases, but a human programmer is often able to do it better by organizing data into vectors that fit the SIMD instructions. The VCL library is a tool that makes it easier for the programmer to write vector code without having to use assembly language or intrinsic functions. Let us explain this with an example:
\vspacesmall
\begin{example}
\label{exampleArrayLoop1}
\end{example} % frame disappears if I put this after end lstlisting
\begin{lstlisting}[frame=single]
// Array loop
float a[8], b[8], c[8]; // declare arrays
... // put values into arrays
for (int i = 0; i < 8; i++) { // loop for 8 elements
c[i] = a[i] + b[i] * 1.5f; // operations on each element
}
\end{lstlisting}
\vspacesmall
The vector class library allows you to rewrite example \ref{exampleArrayLoop1} using vectors:
\vspacesmall
\begin{example}
\label{exampleArrayLoopVect}
\end{example}
\begin{lstlisting}[frame=single]
// Array loop using vectors
#include "vectorclass.h" // use vector class library
float a[8], b[8], c[8]; // declare arrays
... // put values into arrays
Vec8f avec, bvec, cvec; // define vectors of 8 floats each
avec.load(a); // load array a into vector
bvec.load(b); // load array b into vector
cvec = avec + bvec * 1.5f; // do operations on vectors
cvec.store(c); // save result in array c
\end{lstlisting}
\vspacesmall
Example \ref{exampleArrayLoopVect} does the same as example \ref{exampleArrayLoop1}, but more efficiently because it utilizes SIMD instructions that do eight additions and/or eight multiplications in a single instruction. Modern microprocessors have these instructions which may give you a throughput of eight floating point additions and eight multiplications per clock cycle. A good optimizing compiler may actually convert example \ref{exampleArrayLoop1} automatically to use the SIMD instructions, but in more complicated cases you cannot be sure that the compiler is able to vectorize your code in an optimal way.
\vspacesmall
\section{How it works} \label{HowItWorks}
The type \codei{Vec8f} in example \ref{exampleArrayLoopVect} is a class that encapsulates the intrinsic type
\codei{\_\_m256} which represents a 256-bit vector register holding 8 floating point numbers of 32 bits each. The overloaded operators \codei{+} and \codei{*} represent the SIMD instructions for adding and multiplying vectors. These operators are inlined so that no extra code is generated other than the SIMD instructions. All you have to do to get access to these vector operations is to include "vectorclass.h" in your C++ code and specify the desired instruction set (e.g. SSE2, AVX2, or AVX512) in the compiler options.
\vspacesmall
The code in example \ref{exampleArrayLoopVect} can be reduced to just 4 machine instructions if the instruction set AVX or higher is enabled. The SSE2 instruction set will give 8 machine instructions because the maximum vector register size is only half as big for instruction sets prior to AVX. The code in example \ref{exampleArrayLoop1} will generate approximately 44 instructions if the compiler does not automatically vectorize the code.
\vspacesmall
\section{Features of VCL} \label{Features}
\begin{itemize}
\item Vectors of 8-, 16-, 32- and 64-bit integers, signed and unsigned
\item Vectors of half precision, single precision, and double precision floating point numbers
\item Total vector size 128, 256, or 512 bits
\item Defines almost all common operators
\item Boolean operations and branches on vector elements
\item Many arithmetic functions
\item Standard mathematical functions
\item Permute, blend, gather, scatter, and table look-up functions
\item Fast integer division
\item Can build code for different instruction set extensions from the same source code
\item CPU dispatching to utilize higher instruction sets when available
\item Uses metaprogramming to find the optimal implementation for the selected instruction set and parameter values of a given operator or function
\item Includes extra add-on packages for special purposes and applications
\end{itemize}
\vspacesmall
\section{Instruction sets supported} \label{InstructionSetsSupported}
Since 1997, every new CPU model has extended the x86 instruction set with more SIMD instructions. The VCL library requires the SSE2 instruction set as a minimum, and supports SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, XOP, FMA3, FMA4, and AVX512F/VL/BW/DQ, as well as the new AVX512VBMI/VBMI2 and AVX512-FP16.
\vspacesmall
\section{Platforms supported} \label{PlatformsSupported}
VCL has support for Windows, Linux, and Mac, 32-bit and 64-bit, with Intel, AMD, or VIA x86 or x86-64 instruction set processors.
\vspacesmall
There are no plans to support ARM or other instruction sets in VCL. If you need other platforms than x86 and x86-64 then you may use the cross-platform function wrapper library named Highway. You can find it at \url{https://github.com/google/highway}.
\vspacesmall
A special version of the vector class library for the (now obsolete) Intel Knights Corner coprocessor has been developed at CERN. It is available from
\url{https://bitbucket.org/veclibknc/vclknc.git} or
\url{https://bitbucket.org/edanor/umesimd/}
\vspacesmall
\section{Compilers supported} \label{CompilersSupported}
The vector class library could not have been made with any other programming language than C++, because only C++ combines all the necessary features: low-level programming such as bit manipulation and intrinsic functions, high-level programming features such as classes and templates, operator overloading, metaprogramming, compiling to machine code without any intermediate byte code, and highly optimizing compilers with support for the many different instruction sets and platforms.
\vspacesmall
The vector class library works with Gnu, Clang, Microsoft, and Intel C++ compilers. It is recommended to use the newest version of the compiler if the newest instruction sets are used. The best optimization is obtained with the Gnu and Clang compilers. You may use any integrated development environment, make utility, or build system.
\vspacesmall
The vector class library version 2.xx requires the C++17 or later standard for the C++ language. The vector class library version 1.xx, using standard C++0x, should only be used if it is not possible to use a compiler with C++17 support.
\vspacesmall
\section{Intended use} \label{IntendedUse}
This vector class library is intended for experienced C++ programmers. It is useful for improving code performance where speed is critical and where the compiler is unable to vectorize the code automatically in an optimal way. Combining explicit vectorization by the programmer with other kinds of optimization done by the compiler, it has the potential for generating highly efficient code. This can be useful for optimizing library functions and critical innermost loops (hotspots) in CPU-intensive programs. There is no reason to use it in less critical parts of a program.
\vspacesmall
\section{How VCL uses metaprogramming} \label{HowVCLUsesMetaprogramming}
The vector class library uses metaprogramming extensively to resolve as much work as possible at compile time rather than at run time. Especially, it uses metaprogramming to find the optimal instructions and algorithms, depending on constants in the code and the selected instruction set.
\vspacesmall
VCL version 1.xx is written for older versions of the C++ language that does not have very good metaprogramming features, but the VCL makes the best use of the available features such as preprocessing directives and templates. Furthermore, it relies extensively on optimizing compilers for doing calculations with constant inputs at compile time and for removing not-taken branches.
\vspacesmall
VCL version 2.xx is taking advantage of \codei{constexpr} branches, \codei{constexpr} functions, and other advanced features in
C++14 and C++17 for explicitly telling the compiler what calculations to do at compile time, and to remove not-taken branches. This makes the code clearer and more efficient. It is recommended to use the latest version of VCL, if possible.
\vspacesmall
The following cases illustrate the use of metaprogramming in VCL:
\begin{itemize}
\item Compiling for different instruction sets. If you are using a bigger vector size than supported by the instruction set, then the VCL code will split the big vector into multiple smaller vectors. If you compile the same code again for a higher instruction set, then you will get a more efficient program with full-size vector registers.
\item Permute, blend, and gather functions. There are many different machine instructions that move data between different vector elements. Some of these instructions can only do very specific data permutations. The VCL uses quite a lot of metaprogramming to find the instruction or sequence of instructions that best fits the specified permutation pattern. Often, the higher instruction sets give more efficient results.
\item Integer division. Integer division can be done faster by a combination of multiplication and bit-shifting. The VCL can use metaprogramming to find the optimal division method and calculate the multiplication factor and shift count at compile time if the divisor is a known constant.
See page \pageref{HowVCLUsesMetaprogramming} for details.
\item Raising to a power. Calculating $x^8$ can be done faster by squaring $x$ three times rather than by a loop that multiplies seven times. The VCL can determine the optimal way of raising floating point vectors to an integer or rational power in the functions \codei{pow\_const} and \codei{pow\_rational}.
\end{itemize}
\vspacesmall
\section{Availability} \label{Availability}
The newest version of the vector class library is available from
\href{https://github.com/vectorclass}{github.com/vectorclass}
\vspacesmall
\section{Support} \label{Support}
The vector class library is not a commercial product, but free and open source.
You cannot expect the kind of support you would get with a paid product.
\vspacesmall
A discussion board for software developed by Agner Fog is currently provided at
\href{https://www.agner.org/forum/viewforum.php?f=1}{www.agner.org/forum/viewforum.php?f=1}. This is intended for general discussion and suggestions, but not for programming support.
\vspacesmall
Programming questions should preferably be asked at
\href{https://stackoverflow.com}{Stackoverflow.com} using the tag {\bfseries vector-class-library}{\normalfont .} % why does it not return to normal font without this?
\vspacesmall
\section{License}\label{License}
The Vector class library is licensed under the Apache License, version 2.0.
\vspacesmall
You may not use the files except in compliance with this License.
You may obtain a copy of the license at
\href{https://www.apache.org/licenses/LICENSE-2.0}{www.apache.org/licenses/LICENSE-2.0}
\vspacesmall
I have previously sold commercial licenses to VCL. Now, I have decided for a more permissive license. Instead of selling commercial licenses, I am now suggesting that commercial users make a donation to an open source project of your own choosing or to an organization promoting open source software.
\vspacesmall
\chapter{The basics}\label{chap:TheBasics}
\section{How to compile} \label{HowToCompile}
Copy the latest version of the header files (*.h) to the same folder as your C++ source files. The header files from any add-on package should be included too if needed. Alternatively, you may put the VCL header files in a separate folder and specify an include path to this folder.
\vspacesmall
Include the header file vectorclass.h in your C++ source file.
Several other header files will be included automatically.
\vspacesmall
Set your compiler options to the desired instruction set. It is recommended to compile for 64-bit mode. Use C++ version 17 or higher.
The instruction set must be at least SSE2. See table \ref{table:CommandLineOptions} on page \pageref{table:CommandLineOptions} for a list of compiler options.
\vspacesmall
A command line for the Clang compiler may, for example, look like this:\\
clang++ -m64 -O2 -std=c++17 -mfma -mavx2 -fabi-version=0 myprogram.cpp
\vspacesmall
You may compile multiple versions for different instruction sets as explained in chapter \ref{CPUDispatching}.
\vspacesmall
%If you are using the Gnu compiler version 3.x or 4.x then you must set the ABI version to 4 or more, or 0 for a reasonable default.
The following simple C++ example may help you get started:
\begin{example}
\label{exampleArrayLoop3}
\end{example}
\begin{lstlisting}[frame=single]
// Simple vector class example C++ file
#include <stdio.h>
#include "vectorclass.h"
int main() {
// define and initialize integer vectors a and b
Vec4i a(10,11,12,13);
Vec4i b(20,21,22,23);
// add the two vectors
Vec4i c = a + b;
// Print the results
for (int i = 0; i < c.size(); i++) {
printf(" %5i", c[i]);
}
printf("\n");
return 0;
}
\end{lstlisting}
\vspacesmall
\section{Overview of vector classes} \label{OverviewOfVectorClasses}
The vector class library supports vectors of 8-bit, 16-bit, 32-bit and 64-bit signed and unsigned integers, 32-bit single precision floating point numbers, and 64-bit double precision floating point numbers. See page \pageref{HalfPrecision} for optional support for 16-bit half precision floating point numbers.
\vspacesmall
A vector contains multiple elements of the same type to a total size of 128, 256 or 512 bits. The vector elements are indexed, starting at 0 for the first element.
\vspacesmall
The constant MAX\_VECTOR\_SIZE indicates the maximum vector size. The default maximum vector size is 512 in the current version and possibly larger in future versions. You can disable 512-bit vectors by defining
\begin{lstlisting}[frame=none]
#define MAX_VECTOR_SIZE 256
\end{lstlisting}
before including the vector class header files.
\vspacesmall
The vector class library also defines boolean vectors. These are mainly used for conditionally selecting elements from vectors.
\vspacesmall
The following vector classes are defined:
\begin {table}[H]
\caption{Integer vector classes}
\label{table:integerVectorClasses}
\begin{tabular}{|p{18mm}|p{18mm}|p{18mm}|p{18mm}|p{18mm}|p{30mm}|}
\hline
\bfseries Vector class & \bfseries Integer size bits & \bfseries Signed & \bfseries Elements per vector & \bfseries Total bits & \bfseries Minimum
\newline recommended \newline instruction set \\ \hline
Vec16c & \centering 8 & signed & \centering 16 & \centering 128 & SSE2 \\ \hline
Vec16uc & \centering 8 & unsigned & \centering 16 & \centering 128 & SSE2 \\ \hline
Vec8s & \centering 16 & signed & \centering 8 & \centering 128 & SSE2 \\ \hline
Vec8us & \centering 16 & unsigned & \centering 8 & \centering 128 & SSE2 \\ \hline
Vec4i & \centering 32 & signed & \centering 4 & \centering 128 & SSE2 \\ \hline
Vec4ui & \centering 32 & unsigned & \centering 4 & \centering 128 & SSE2 \\ \hline
Vec2q & \centering 64 & signed & \centering 2 & \centering 128 & SSE2 \\ \hline
Vec2uq & \centering 64 & unsigned & \centering 2 & \centering 128 & SSE2 \\ \hline
Vec32c & \centering 8 & signed & \centering 32 & \centering 256 & AVX2 \\ \hline
Vec32uc & \centering 8 & unsigned & \centering 32 & \centering 256 & AVX2 \\ \hline
Vec16s & \centering 16 & signed & \centering 16 & \centering 256 & AVX2 \\ \hline
Vec16us & \centering 16 & unsigned & \centering 16 & \centering 256 & AVX2 \\ \hline
Vec8i & \centering 32 & signed & \centering 8 & \centering 256 & AVX2 \\ \hline
Vec8ui & \centering 32 & unsigned & \centering 8 & \centering 256 & AVX2 \\ \hline
Vec4q & \centering 64 & signed & \centering 4 & \centering 256 & AVX2 \\ \hline
Vec4uq & \centering 64 & unsigned & \centering 4 & \centering 256 & AVX2 \\ \hline
Vec64c & \centering 8 & signed & \centering 64 & \centering 512 & AVX512BW \\ \hline
Vec64uc & \centering 8 & unsigned & \centering 64 & \centering 512 & AVX512BW \\ \hline
Vec32s & \centering 16 & signed & \centering 32 & \centering 512 & AVX512BW \\ \hline
Vec32us & \centering 16 & unsigned & \centering 32 & \centering 512 & AVX512BW \\ \hline
Vec16i & \centering 32 & signed & \centering 16 & \centering 512 & AVX512 \\ \hline
Vec16ui & \centering 32 & unsigned & \centering 16 & \centering 512 & AVX512 \\ \hline
Vec8q & \centering 64 & signed & \centering 8 & \centering 512 & AVX512 \\ \hline
Vec8uq & \centering 64 & unsigned & \centering 8 & \centering 512 & AVX512 \\ \hline
\end{tabular}
\end{table}
\vspacesmall
\begin {table}[H]
\caption{Floating point vector classes}
\label{table:FloatVectorClasses}
\begin{tabular}{|p{18mm}|p{18mm}|p{18mm}|p{18mm}|p{30mm}|}
\hline
\bfseries Vector class & \bfseries Precision & \bfseries Elements per vector & \bfseries Total bits & \bfseries Minimum
\newline recommended \newline instruction set \\ \hline
Vec4f & \centering single & \centering 4 & \centering 128 & SSE2 \\ \hline
Vec2d & \centering double & \centering 2 & \centering 128 & SSE2 \\ \hline
Vec8f & \centering single & \centering 8 & \centering 256 & AVX \\ \hline
Vec4d & \centering double & \centering 4 & \centering 256 & AVX \\ \hline
Vec16f & \centering single & \centering 16 & \centering 512 & AVX512 \\ \hline
Vec8d & \centering double & \centering 8 & \centering 512 & AVX512 \\ \hline
\end{tabular}
\end{table}
\vspacesmall
\begin {table}[H]
\caption{Boolean vector classes}
\label{table:BooleanVectorClasses}
\begin{tabular}{|p{18mm}|p{30mm}|p{18mm}|p{18mm}|p{30mm}|}
\hline
\bfseries Boolean vector class & \bfseries For use with & \bfseries Elements per vector & \bfseries Total size, bits & \bfseries Minimum
\newline recommended \newline instruction set\\ \hline
Vec16cb & \centering Vec16c, Vec16uc & \centering 16 & \centering 16 or 128 & SSE2 \\ \hline
Vec8sb & \centering Vec8s, Vec8us & \centering 8 & \centering 8 or 128 & SSE2 \\ \hline
Vec4ib & \centering Vec4i, Vec4ui & \centering 4 & \centering 8 or 128 & SSE2 \\ \hline
Vec2qb & \centering Vec2q, Vec2uq & \centering 2 & \centering 8 or 128 & SSE2 \\ \hline
Vec32cb & \centering Vec32c, Vec32uc & \centering 32 & \centering 32 or 256 & AVX2 \\ \hline
Vec16sb & \centering Vec16s, Vec16us & \centering 16 & \centering 16 or 256 & AVX2 \\ \hline
Vec8ib & \centering Vec8i, Vec8ui & \centering 8 & \centering 8 or 256 & AVX2 \\ \hline
Vec4qb & \centering Vec4q, Vec4uq & \centering 4 & \centering 8 or 256 & AVX2 \\ \hline
Vec64cb & \centering Vec64c, Vec64uc & \centering 64 & \centering 64 or 512 & AVX512BW \\ \hline
Vec32sb & \centering Vec32s, Vec32us & \centering 32 & \centering 32 or 512 & AVX512BW \\ \hline
Vec16ib & \centering Vec16i, Vec16ui & \centering 16 & \centering 16 or 512 & AVX512 \\ \hline
Vec8qb & \centering Vec8q, Vec8uq & \centering 8 & \centering 8 or 512 & AVX512 \\ \hline
Vec4fb & \centering Vec4f & \centering 4 & \centering 8 or 128 & SSE2 \\ \hline
Vec2db & \centering Vec2d & \centering 2 & \centering 8 or 128 & SSE2 \\ \hline
Vec8fb & \centering Vec8f & \centering 8 & \centering 8 or 256 & SSE2 \\ \hline
Vec4db & \centering Vec4d & \centering 4 & \centering 8 or 256 & SSE2 \\ \hline
Vec16fb & \centering Vec16f & \centering 16 & \centering 16 & AVX512 \\ \hline
Vec8db & \centering Vec8d & \centering 8 & \centering 8 & AVX512 \\ \hline
\end{tabular}
\end{table}
The size of the boolean vectors depends on the instruction set (see page \pageref{tableBooleanVectorSizes}).
\vspacebig
\section{Half precision floating point vectors} \label{HalfPrecision}
Half precision floating point numbers are represented by 16 bits (one sign bit, five exponent bits, and ten bits for the mantissa). Half precision is useful for sound, video, and artificial intelligence applications where the low precision is acceptable. A 512-bit vector register can contain 32 half-precision numbers and do 32 arithmetic operations simultaneously with a single instruction.
\vspacesmall
Microprocessors have various levels of support for half precision. The F16C instruction set extension supports conversion between half precision and single precision, but not arithmetic operations on half precision numbers. F16C has been included in Intel and AMD processors since 2013-2014. The newer AVX512-FP16 instruction set extension implements a full set of arithmetic operations on half precision numbers.
\vspacesmall
The Vector Class Library supports half precision floating point vectors when the following header file is included:
\begin{lstlisting}[frame=single]
#include "vectorfp16.h"
\end{lstlisting}
\vspacesmall
The performance of half-precision vector calculations is highly dependent on the instruction set. Full performance is obtained only when the AVX512-FP16 instruction set is supported by the microprocessor and enabled in the compiler options.
\vspacesmall
The earlier F16C instruction set allows efficient conversion between single precision and half precision, but not arithmetic operations on half precision vectors. With F16C, arithmetic operations are emulated by converting the operands from half precision to single precision and converting each result back to half precision. This will be inefficient because intermediate results are converted back and forth, as illustrated in this example:
\vspacesmall
\begin{lstlisting}[frame=single]
#include "vectorfp16.h"
Vec8h a, b, c, d; // vectors of eight half precision numbers
d = a + b + c; // calculate sums
\end{lstlisting}
\vspacesmall
If this example is compiled with F16C, but not AVX512-FP16, then the code will convert a and b from half precision to single precision, calculate a+b with single precision, convert a+b back to half precision, then convert a+b to single precision again, convert c to single precision, do the next addition with single precision, and convert the final sum a+b+c back to half precision. This is of course not efficient. It is more efficient to do all the intermediate calculations with single precision:
\vspacesmall
\begin{lstlisting}[frame=single]
#include "vectorfp16.h"
Vec8h a, b, c, d; // half precision vectors
Vec8f aa = to_float(a); // convert to single precision
Vec8f bb = to_float(b);
Vec8f cc = to_float(c);
Vec8f dd = aa + bb + cc; // do the calculations
d = to_float16(dd); // convert the result to half precision
\end{lstlisting}
\vspacesmall
The ability to emulate half precision calculations as illustrated in the first example is useful for verifying half-precision code. This allows you to test whether half precision is sufficient for a particular task even when you do not have access to a computer with AVX512-FP16. If the goal is to get maximum performance then you should use half precision only on microprocessors with AVX512-FP16, but use single precision on microprocessors without AVX512-FP16.
\vspacesmall
The half precision code can run even on microprocessors without F16C, but this will be extremely slow because every conversion between single and half precision requires a long sequence of instructions. Therefore, it is important to enable F16C in the compiler when it is present. F16C is supported by some AMD and Intel processors that have AVX and all currently known processors that have AVX2 and later instruction sets. It may be useful to make a version of the code that uses conversion between single and half precision for processors that have both AVX2 and F16C, and a more efficient version that does calculations with half precision for processors that have AVX512-FP16. The functions \codei{hasF16C()} and \codei{hasAVX512FP16()} in instrset\_detect.cpp can be used for detecting microprocessor support for these instruction sets.
\vspacesmall
\subsection{Compiler support}
The AVX512-FP16 instruction set is supported by the following compilers and later versions:\\
g++ version 12.1.0 with binutils 2.38 \\
clang++ version 14.0.0 \\
Intel c++ compiler version 2021.2\\
Microsoft Visual Studio version 17.2 has incomplete support
\vspacesmall
The proper type for half precision scalars is \codei{\_Float16}. This type is supported by the g++ and some Intel compilers. It is supported on the Clang compiler only when AVX512-FP16 is enabled. The vector class library will emulate a type named \codei{Float16} if \codei{\_Float16} is not supported by the compiler. This emulation includes only the most basic operations and operators on half precision floating point scalars, such as $+-*/$ and conversion to and from \codei{float}. Other operators and functions on Float16 are not emulated. \codei{Float16} is defined as \codei{\_Float16} whenever \codei{\_Float16} is supported by the compiler.
\vspacesmall
Do not use the types \codei{\_\_fp16} and \codei{\_\_bf16} that are available on some compilers. \codei{\_\_fp16} is an interchange format, not an arithmetic format. This means that variables of type \codei{\_\_fp16} will be immediately converted to \codei{float} before any operation. \codei{\_\_bf16} is an incompatible format available on some systems. \codei{\_\_bf16} has 8 exponent bits and 7 mantissa bits where \codei{\_Float16} and \codei{\_\_fp16} have 5 exponent bits and 10 mantissa bits.
\vspacesmall
\subsection{Half precision vector classes}
\begin {table}[H]
\caption{Half precision floating point vector classes}
\label{table:HalfVectorClasses}
\begin{tabular}{|p{18mm}|p{18mm}|p{18mm}|p{18mm}|p{30mm}|}
\hline
\bfseries Vector class & \bfseries Precision & \bfseries Elements per vector & \bfseries Total bits & \bfseries Minimum
\newline recommended \newline instruction set \\ \hline
Vec8h & \centering half & \centering 8 & \centering 128 & AVX512-FP16 \\ \hline
Vec16h & \centering half & \centering 16 & \centering 256 & AVX512-FP16 \\ \hline
Vec32h & \centering half & \centering 32 & \centering 512 & AVX512-FP16 \\ \hline
\end{tabular}
\end{table}
\vspacesmall
The corresponding boolean vector classes are Vec8hb, Vec16hb, and Vec32hb.
\vspacesmall
Subnormal numbers are supported for these vector classes regardless of the floating point control word. The floating point control word (see page \pageref{FPControlWordManipulationFunctions}) has no effect on half precision subnormal numbers.
\vspacesmall
\subsection{Functions and operators}
The half precision vectors can be used with the same operators and general functions as single and double precision vectors. Some mathematical functions are supported for half precision, including exponential and trigonometric functions.
\vspacesmall
Complex number algebra and functions with half precision are supported. See complexvec\_manual.pdf at
\href{https://github.com/vectorclass/add-on/tree/master/complex}{github}.
\vspacesmall
The following functions are available for conversion between single precision and half precision:
\begin {table}[H]
\caption{Conversion between single and half precision}
\label{table:HalfVectorClasses}
\begin{tabular}{|p{30mm}|p{35mm}|p{70mm}|}
\hline
\bfseries Function & \bfseries Conversion & \bfseries Comment \\ \hline
convert8h\_4f & Vec8h -> Vec4f & only lower half is converted \\ \hline
to\_float & Vec8h -> Vec8f & \\ \hline
to\_float & Vec16h -> Vec16f & \\ \hline
convert4f\_8h & Vec4f -> Vec8h & upper half is zero \\ \hline
to\_float16 & Vec8f -> Vec8h & \\ \hline
to\_float16 & Vec16f -> Vec16h & \\ \hline
\end{tabular}
\end{table}
\vspacesmall
Vec32h cannot be converted directly to and from single precision because there is no Vec32f. Conversion to and from Vec32h can be coded as follows:
\vspacesmall
\begin{lstlisting}[frame=single]
#include "vectorfp16.h"
Vec16f a, b; // single precision vectors
Vec32h h; // half precision vector
// conversion from single to half precision:
h = Vec32h(to_float16(a), to_float16(b));
// conversion from half to single precision:
a = to_float(h.get_low()); // lower half
b = to_float(h.get_high()); // upper half
\end{lstlisting}
\vspacebig
\section{Constructing vectors and loading data into vectors} \label{ConstructingVectors}
There are many ways to create vectors and put data into vectors. These methods are listed here.
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & default constructor \\ \hline
\bfseries Defined for & all vector classes \\ \hline
\bfseries Description & the vector is created but not initialized. The value is unpredictable \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\vspacesmall
\begin{lstlisting}[frame=none]
// Example:
Vec4i a; // creates a vector of 4 signed integers
\end{lstlisting}
\vspacesmall
\begin{tabular}[l]{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & constructor with one parameter \\ \hline
\bfseries Defined for & all vector classes \\ \hline
\bfseries Description & all elements get the same value \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(5); // all four elements = 5
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & assignment to scalar \\ \hline
\bfseries Defined for & all vector classes \\ \hline
\bfseries Description & all elements get the same value \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a = 6; // all four elements = 6
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & constructor with one parameter for each vector element \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & each element gets a specified value. The parameter for element number 0 comes first. \\ \hline
\bfseries Efficiency & good for constant. Medium for variables as parameters \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Examples:
Vec4i a(10,11,12,13); // a = (10,11,12,13)
Vec4i b = Vec4i(20,21,22,23); // b = (20,21,22,23)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & constructor with one parameter for each half vector \\ \hline
\bfseries Defined for & all vector classes if a similar class of half the size exists \\ \hline
\bfseries Description & Concatenates two 128-bit vectors into one 256-bit vector. \newline
Concatenates two 256-bit vectors into one 512-bit vector \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(10,11,12,13);
Vec4i b(20,21,22,23);
Vec8i c(a, b); // c = (10,11,12,13,20,21,22,23)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & insert(index, value) \\ \hline
\bfseries Defined for & all vector classes \\ \hline
\bfseries Description & changes the value of element number (index) to (value). The index starts at 0 \\ \hline
\bfseries Efficiency & good if AVX512VL, medium otherwise \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(0);
a.insert(2, 9); // a = (0,0,9,0)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & load(const pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & loads all elements from an array \\ \hline
\bfseries Efficiency & good, except immediately after inserting elements one by one into the array \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
int list[8] = {10,11,12,13,14,15,16,17};
Vec4i a, b;
a.load(list); // a = (10,11,12,13)
b.load(list+4); // b = (14,15,16,17)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & load\_a(const pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & loads all elements from an aligned array \\ \hline
\bfseries Efficiency & good, except immediately after inserting elements separately into the array. \\ \hline
\end{tabular}
This method does the same as the \codei{load} method (see above), but requires that the pointer points to an address divisible by 16 for 128-bit vectors, by 32 for 256-bit vectors, or by 64 for 512 bit vectors. If you are not certain that the array is properly aligned then use \codei{load} instead of \codei{load\_a}. There is hardly any difference in efficiency between \codei{load} and \codei{load\_a} on newer microprocessors.
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & load\_partial(int n, const pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & loads n elements from an array into a vector. Sets remaining elements to 0. 0 $\leq$ n $\leq$ (vector size). \\ \hline
\bfseries Efficiency & good if AVX512VL, medium otherwise \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
float list[3] = {1.0f, 1.1f, 1.2f};
Vec4f a;
a.load_partial(2, list); // a = (1.0, 1.1, 0.0, 0.0)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & cutoff(int n) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & leaves the first n elements unchanged and sets the remaining elements to zero. 0 $\leq$ n $\leq$ (vector size). \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(10, 11, 12, 13);
a.cutoff(2); // a = (10, 11, 0, 0)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & gather\textless indexes\textgreater (array) \\ \hline
\bfseries Defined for & floating point vector classes and integer vector classes with 32-bit and 64-bit elements \\ \hline
\bfseries Description & gather non-contiguous data from an array. \\ \hline
\bfseries Efficiency & medium \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
int list[8] = {10,11,12,13,14,15,16,17};
Vec4i a = gather4i<0,2,1,6>(list); // a = (10,12,11,16)
\end{lstlisting}
\vspacesmall
\section{Getting data from vectors} \label{GettingDataFromVectors}
There are many ways to extract elements or parts of a vector. These methods are listed here.
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & store(pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & stores all elements into an array \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(10,11,12,13);
Vec4i b(20,21,22,23);
int list[8];
a.store(list);
b.store(list+4); // list contains (10,11,12,13,20,21,22,23)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & store\_a(pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & stores all elements into an aligned array \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\vspacesmall
This method does the same as the store method (see above), but requires that the pointer points to an address divisible by 16 for 128-bit vectors, by 32 for 256-bit vectors, or by 64 for 512-bit vectors. If you are not certain that the array is properly aligned then use \codei{store} instead of \codei{store\_a}.
There is hardly any difference in efficiency between \codei{store} and \codei{store\_a} on newer microprocessors.
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & store\_nt(pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & stores all elements into an aligned array without caching \\ \hline
\bfseries Efficiency & recommended only for very large arrays \\ \hline
\end{tabular}
\vspacesmall
This method does the same as the \codei{store\_a} method (see above), but without using the cache. This is optimal only for very large arrays when it is unlikely that the data will stay cached until they are read again.
As a rule of thumb, use \codei{store\_nt} for memory blocks bigger than half the size of the last-level cache. You will get a runtime error if the pointer is not properly aligned.
\vspacebig
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & store\_partial(int n, pointer) \\ \hline
\bfseries Defined for & all integer and floating point vector classes \\ \hline
\bfseries Description & stores the first n elements into an array. The rest of the array is untouched.
0 $\leq$ n $\leq$ (vector size) \\ \hline
\bfseries Efficiency & good if AVX512VL, medium otherwise \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
float list[4] = {9.0f, 9.0f, 9.0f, 9.0f};
Vec4f a(1.0f, 1.1f, 1.2f, 1.3f);
a.store_partial(2, list); // list contains (1.0, 1.1, 9.0, 9.0)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & extract(index) \\ \hline
\bfseries Defined for & all integer, floating point and boolean vector classes \\ \hline
\bfseries Description & gets a single element from a vector \\ \hline
\bfseries Efficiency & good if AVX512VL, medium otherwise \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(10,11,12,13);
int b = a.extract(2); // b = 12
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & operator [ ] \\ \hline
\bfseries Defined for & all integer, floating point and boolean vector classes \\ \hline
\bfseries Description & gets a single element from a vector \\ \hline
\bfseries Efficiency & good if AVX512VL, medium otherwise \\ \hline
\end{tabular}
The operator [ ] does exactly the same as the extract method. Note that you can read a vector element with the [ ] operator, but not write an element.
\vspacesmall
\begin{lstlisting}[frame=none]
// Example:
Vec4i a(10,11,12,13);
int b = a[2]; // b = 12
a[3] = 5; // not allowed!
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & get\_low() \\ \hline
\bfseries Defined for & all vector classes of 256 bits or more \\ \hline
\bfseries Description & gets the lower half of a 256-bit vector as a 128-bit vector.\newline
gets the lower half of a 512-bit vector as a 256-bit vector.
\\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec8i a(10,11,12,13,14,15,16,17);
Vec4i b = a.get_low(); // b = (10,11,12,13)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & get\_high() \\ \hline
\bfseries Defined for & all vector classes of 256 bits or more \\ \hline
\bfseries Description & gets the upper half of a 256-bit vector as a 128-bit vector.\newline
gets the upper half of a 512-bit vector as a 256-bit vector.
\\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec8i a(10,11,12,13,14,15,16,17);
Vec4i b = a.get_high(); // b = (14,15,16,17)
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & size() \\ \hline
\bfseries Defined for & all vector classes \\ \hline
\bfseries Description & static constant member function indicating the number of elements that the vector can contain \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec8f a;
int s = a.size(); // s = 8
\end{lstlisting}
\vspacesmall
\begin{tabular}{|p{25mm}|p{100mm}|}
\hline
\bfseries Method & elementtype() \\ \hline
\bfseries Defined for & all vector classes \\ \hline
\bfseries Description & static constant member function indicating the type of elements that the vector contains: \newline
1: bits (internal base class) \newline
2: bool (compact) \newline
3: bool (broad) \newline
4: int8\_t \newline
5: uint8\_t \newline
6: int16\_t \newline
7: uint16\_t \newline
8: int32\_t \newline
9: uint32\_t \newline
10: int64\_t \newline
11: uint64\_t \newline
15: half precision floating point \newline
16: float (single precision floating point) \newline
17: double (double precision floating point) \\ \hline
\bfseries Efficiency & good \\ \hline
\end{tabular}
\begin{lstlisting}[frame=none]
// Example:
Vec16s a;
int t = a.elementtype(); // t = 6
\end{lstlisting}
%\indenton % undo \flushleft
\vspacesmall
\section{Arrays and vectors} \label{ArraysOfVectors}
Vectors are very useful for array loops with large data sets.
The add-on package named 'containers' provides efficient container class templates for implementing arrays with fixed size and dynamic size, as well as matrixes.
See containers\_manual.pdf for details.
\vspacesmall
If you are not using the add-on package 'containers' or you are making your own containers then you need to consider the following.
\vspacesmall
Data arrays may have fixed size or variable size. A fixed size array is particularly efficient if the size is known when the program is compiled, or a reasonable upper limit can be set. For example:
\begin{lstlisting}[frame=none]
int const datasize = 1024; // size of dataset, constant
float mydata[datasize]; // constant size array
...
Vec8f x;
for (int i = 0; i < datasize; i += 8) {
x.load(mydata+i);
x = x * 0.1f + 2.0f;
x.store(mydata+i);
}
\end{lstlisting}
\vspacesmall
If the size of the array is determined at runtime then the most efficient solution is to allocate the array using the operator \codei{new}:
\begin{lstlisting}[frame=none]
int datasize = 1024; // size of dataset, variable
float *mydata = new float[datasize]; // allocate variable size array
...
Vec8f x;
for (int i = 0; i < datasize; i += 8) {
x.load(mydata+i);
x = x * 0.1f + 2.0f;
x.store(mydata+i);
}
...
delete[] mydata; // remember to free the allocated data
\end{lstlisting}
\vspacesmall
It is recommended to align an array by the vector size for optimal performance. See page \pageref{Alignment} for details.
\vspacesmall
See page \pageref{NotAMultipleOfVectorSize} for discussion of the case where the data size is not a multiple of the vector size.
\vspacesmall
A matrix or multidimensional array can be implemented in various ways. If the length of each row is not more than the vector size, then it is convenient to use one VCL vector for each row. Longer rows can be contained in multiple VCL vectors. If the number of columns is variable then it is recommended to store the rows one after another in a linear array. Use padding space at the end of each row, if necessary, to align the next row by the vector length.
\vspacesmall
The standard C++ container classes are often inefficient. It is unfortunately common to implement matrixes as nested container classes such as \codei{std::vector$<$std::vector$<$data\_type$>>$}. Such constructs are inefficient and should be avoided.
\vspacebig
\section{Using a namespace} \label{UsingANamespace}
In general, there is no need to put the vector class library into a separate namespace. Therefore, the use of a namespace is optional. You can give the vector class library a namespace, if necessary, by defining \codei{VCL\_NAMESPACE}, for example:
\begin{lstlisting}[frame=single]
#define VCL_NAMESPACE vcl
#include "vectorclass.h"
using namespace vcl;
// your vector code here...
\end{lstlisting}
\vspacesmall
\end{document}