03_problem.tex

\section{What Problems Fit to GPU?}


\subsection{What are GPUs good for?}

\par
An answer from~\textbf{Stack Exchange}~\cite{stackexchange}:~\emph{From a metaphorical point of view, the GPU can be seen as a person lying on a bed of nails. The person lying on top is the data and in the base of each nail there is a processor, so the nail is actually an arrow pointing from processor to memory. All nails are in a regular pattern, like a grid. If the body is well spread, it feels good (performance is good), if the body only touches some spots of the nail bed, then the pain is bad (bad performance).}


\par
GPU computing is well-suited to problems that involve large amounts of data parallelism.
Specifically, you can expect good performance on GPUs for:
\begin{itemize}
    \item \textbf{Large-scale matrix and vector operations}: Common in machine learning, scientific computing, and image processing.
    \item \textbf{Fourier transforms}: Also common in machine learning, scientific computing, and image processing.
    \item \textbf{Monte Carlo simulations}: Used across finance, physics, and other fields to simulate complex systems.
    \item \textbf{Molecular dynamics simulations}: Used in chemistry, biochemistry and physics.
    \item \textbf{Computational fluid dynamics}: Used in engineering, physics, and other fields.
    \item \textbf{Convolutional neural networks and computer vision algorithms}.
    \item \textbf{Big data analytics}: Clustering, classification, regression, $etc$.
    \item \textbf{Graphics rendering}: Original use-case for GPUs.
\end{itemize}


% -------------------------------------------------------------------- %


\subsection{What are GPUs not good for?}


\par
Not all programming problems can efficiently leverage the parallelism offered by GPUs.
Some types of problems that do not fit well on a GPU include:
\begin{itemize}
    \item \textbf{Sequential tasks}: Problems that require a series of dependent steps, where each step relies on the outcome of the previous step, are not well-suited for parallel processing. Examples include recursive algorithms, certain dynamic programming problems, and some graph traversal algorithms.
    \item \textbf{Fine-grained branching}: GPUs perform best when the code being executed across different threads follows a similar control flow. When there is extensive branching ($i.e.$, many if statements) within a kernel or algorithm, performance may suffer due to the divergence in execution paths among the GPU threads.
    \item \textbf{Low arithmetic intensity}: GPUs excel at performing a large number of mathematical operations quickly. If a problem has low arithmetic intensity ($i.e.$, a low ratio of arithmetic operations to memory accesses), the GPU may not be able to efficiently utilize its computational power, leading to underperformance.
    \item \textbf{Small data sets}: If the problem involves a small data set that does not require significant parallelism, using a GPU may not result in noticeable performance gains. In such cases, the overhead of transferring data between the CPU and GPU, and the time spent initializing the GPU, may outweigh any potential benefits.
    \item \textbf{Limited parallelism}: Some algorithms have inherent limitations on the degree of parallelism that can be achieved. In these cases, using a GPU may not lead to significant performance improvements.
    \item \textbf{Memory-bound problems}: GPUs generally have less memory available compared to CPUs, and their memory bandwidth can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations, it may not be well-suited for a GPU.
\end{itemize}


% -------------------------------------------------------------------- %


\subsection{Examples of GPU acceleration}

\par
To give a flavor of what type of performance gains we can achieve by porting a calculations to a GPU, let’s look at a few case examples.


\subsubsection{Matrix multiplication}


\par
Here is a case of matrix multiplication in the Julia language, as shown in List~\ref{lst:03_example_julia_matrix_multiplication}, and consider following questions:
\begin{itemize}
    \item How much faster do you think the GPU version is compared to running on a single CPU core?
    \item Julia automatically parallelises matrix multiplication over available CPU cores. Will the GPU version be faster than running on 64 cores?
    \item Does the size of the array affect how much the performance improves?
\end{itemize}


\lstinputlisting[language=c++, caption={An example of matrix multiplication in Julia.}, label={lst:03_example_julia_matrix_multiplication}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/03_example_julia_matrix_multiplication.jl}


\par
The solution for this matrix multiplication example running on LUMI (MI250X AMD GPU, 64-core AMD Trento CPUs) is listed in Table~\ref{tbl:example_julia_matrix_multiplication}.


\begin{table}[h]
\centering\caption{Computational results of the matrix multiplication example shown in List~\ref{lst:03_example_julia_matrix_multiplication}.}\label{tbl:example_julia_matrix_multiplication}
\begin{tabular}{ |c|c|c|c|c| } 
\hline
\textbf{Matrix size} & \textbf{1 CPU core} & \textbf{64 CPU cores} & \textbf{1 GPU} & \textbf{GPU speedup} \\
\hline
(512, 512) &     5.472 ms & 517.722 $\mu$s & 115.805 $\mu$s & $\sim$47x/$\sim$5x \\
(1024, 1024) &  43.364 ms & 2.929     ms   & 173.316 $\mu$s & $\sim$250x/$\sim$17x \\
(2048, 2048) & 344.364 ms & 30.081    ms   & 866.348 $\mu$s & $\sim$400x/$\sim$35x \\
(4096, 4096) &   3.221  s & 159.563   ms   & 5.910  ms      & $\sim$550x/$\sim$27x \\
\hline
\end{tabular}
\end{table}


\subsubsection{Electronic structure calculations}


\par
VASP (Vienna Ab initio Simulation Package) is a popular software package used for electronic structure calculations.
Fig.~\ref{fig:vasp_gpu} and Fig.~\ref{fig:vasp_energy} present the speedup observed in a recent benchmark study on the Perlmutter and Cori supercomputers, along with an analysis of total energy usage.


\begin{figure}[!h]
\centering\includegraphics[width=0.8\textwidth]{fig_problem/vasp_gpu.jpg}
\caption{VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs).}\label{fig:vasp_gpu}
\end{figure}


\begin{figure}[!h]
\centering\includegraphics[width=0.8\textwidth]{fig_problem/vasp_energy.jpg}
\caption{Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to the Cori Haswell usage. (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs).}\label{fig:vasp_energy}
\end{figure}


\subsubsection{Computational Chemistry}


\par
A great deal of computational resources are spent in Quantum Chemical calculations which involve the solution of the Hartree-Fock eigenvalue problem, which requires the diagonalization of the Fock matrix whose elements are given by:
\begin{eqnarray}\label{eq:hf_equation}
F_{\alpha\beta} = H_{\alpha\beta}^{core} + \sum_{\gamma\delta}D_{\gamma\delta}\Big[(\alpha\beta|\gamma\delta)-\frac{1}{2}(\alpha\delta|\gamma\beta)\Big] \,.
\end{eqnarray}


\par
The first term is related to the one electron contributions and the second term is related to the electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix $D_{\gamma\delta}$.
One of the most expensive parts in the solution of the Hartree-Fock equations is the processing (digestion) of the ERIs, one algorithm to do this task is shown in Fig.~\ref{fig:hf_algorithm}.


\begin{figure}[!h]
\centering\includegraphics[width=0.4\textwidth]{fig_problem/hartree_fock_algorithms.png}
\caption{The pseudocode for a na\"ive ERI digestion algorithm~\cite{barca2021faster}.}\label{fig:hf_algorithm}
\end{figure}


\par
This algorithm is suitable for GPUs as it involves many arithmetic operations.
In addition to this, there are symmetries and properties of the integrals that could be used to rearrange the loops in an efficient manner that fit GPU architectures.


\subsubsection{Humanities}


\par
Below is a brief introduction into some of the work that is being done in the humanities that can benefit from utilizing GPUs.


\paragraph{Language models and NLP (natural language processing).}
With the recent popularity of ChatGPT, the use of language models has come into the mainstream, however such models have been used in the humanities many years already.
One of the biggest goals of humanities researchers is working with textual data which has increased exponentially over recent years due to the rise in social media.
Analyzing such textual data to gain insights into questions of sociology, linguistics and various other fields have become increasingly reliant on using language models.
Along with language models, the need for GPU access has become essential.


\paragraph{Archeology.}
The field of archeology also makes use of GPUs in their 3D modelling and rendering work.
The biggest problem with archeological sites is that once they are excavated, they are destroyed, so any researchers who aren’t present at the site, would lose valuable insights into how it looked when it was found.
However, with recent developments in technology and accessibility to high-performance computing, they are able to generate extremely detailed renderings of the excavation sites which act as a way to preserve the site for future researchers to gain critical insights and contribute to the research.


\paragraph{Cognitive Science.}
Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics.
MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces.
This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems.
By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations.


\par
By offloading the computational workload to GPUs, researchers can experience substantial speedup in the execution of MCMC algorithms.
This speedup allows for more extensive exploration of parameter spaces and facilitates the analysis of larger datasets, leading to more accurate and detailed insights into human behavior or population dynamics.
Examples of studies done using these methods can be found at the Center for Humanities Computing Aarhus (CHCAA) and Interacting Minds Centre (IMC) at Aarhus University.