A parallel algorithm to solve Poisson Equation using Jacobi with Successive Over-Relaxation(S.O.R) algorithm.
The goal of this project was to develop a parallel program that numerically solves Poisson's equation:
in range [-1,1]x[-1,1].
The program calculates the values of the matrix u
in the range above and also outputs the error between the numerical and the analytical solution which is:
The libraries we used to parallelize the algorithm are:
- MPI
- Hybrid MPI+OpenMp
- CUDA
- To run the MPI program, navigate to the
ParallelMPI
folder, compile usingmake
and run withmpirun jacobi_parallel_mpi.x -c < input
whereinput
is the input file with the parameters(example given in the folder). - To run the Hybrid MPI+OpenMp program, navigate to the
HybridMPI
folder, compile usingmake
and run withmpirun --bind-to none jacobi_hybrid_mpi.x -c < input
whereinput
is the input file with the parameters(example given in the folder). - To run the CUDA program, navigate to the
CUDA
folder, compile with make and run with./jacobi_cuda.x < input
for 1 GPU or./jacobi_cuda_2gpus.x < input
for 2 GPUs.
NOTE: All the programs come with a pbs script, because the benchmarks were made in a cluster provided by our university called argo
. To run them on your own system, the process may be different.
The benchmarks below, were made in a cluster provided by our university called argo
which consists of the following nodes:
- A front-end or master node (Dell OptiPlex 7050) to which we remotely connect to and has the following specs:
- Quad-Core Intel-Core i5-6500 @3.20GHz
- RAM 16GB DDR4 2.4GHz
- 10 compute nodes (Dell PowerEdge 1950) connected to the master node with ethernet 1G which were used to make the MPI and Hybrid MPI+OpenMP benchmarks. Each one of those has the following specs:
- Dual XEON E5340 @2.66GHz quad core (total 8 cores each node)
- RAM 16GB DDR2 667MHz
- Bus interconnection
- Connected together with 10G ethernet
- 1 GPU compute node (Dell Precision 7920 Rack):
- Intel Xeon Silver 4114 @ 2.2 GHz
- 10 Cores with hyperthreading
- RAM 16GB (2X8GB) DDR4 2666MHz
- Dual Nvidia Quatro P4000:
- 1792 CUDA cores
- RAM 8GB
- Memory Bandwidth: 243 GB/s
In the benchmarks below, Speedup and Efficiency are defined as follows:
- The Speedup is calculated as
S/p
whereS
is the sequeltial time andp
is the parallel time - The Efficiency is calculated as
s/t
wheres
is the speedup andt
is the number of processes or threads used.
The default parameters used are:
- alpha - Helmholtz constant = 0.8
- relax - successive over-relaxation parameter = 1.0
- tol - error tolerance for the iterrative solver = 1e-13
- mits - maximum solver iterations = 50
Here are the benchmarks we made in each program using the above cluster:
Grid Size | 840x840 | 1680x1680 | 3360x3360 | 6720x6720 | 13440x13440 | 26880x26880 |
---|---|---|---|---|---|---|
Time(seconds) | 0.511 | 1.905 | 7.48 | 29.77 | 118.953 | 477.191 |
Error | 0.000633904 | 0.000317239 | 0.000158679 | 7.93528e-05 | 3.96795e-05 | 1.98405e-05 |
Timings | |||||||||
---|---|---|---|---|---|---|---|---|---|
Grid Size/ Processes | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 80 |
840x840 | 0.800 | 0.203 | 0.097 | 0.072 | 0.045 | 0.046 | 0.039 | 0.039 | 0.042 |
1680x1680 | 3.185 | 0.816 | 0.377 | 0.215 | 0.147 | 0.105 | 0.095 | 0.076 | 0.076 |
3360x3360 | 12.700 | 3.241 | 1.474 | 0.933 | 0.561 | 0.411 | 0.299 | 0.227 | 0.194 |
6720x6720 | 50.775 | 12.839 | 5.794 | 3.604 | 2.191 | 1.636 | 1.149 | 0.928 | 0.761 |
13440x13440 | 202.996 | 51.198 | 23.022 | 14.184 | 8.555 | 6.385 | 4.408 | 3.610 | 2.903 |
26880x26880 | 815.197 | 204.764 | 91.834 | 56.695 | 33.995 | 25.227 | 17.460 | 14.304 | 11.420 |
Speedup | |||||||||
---|---|---|---|---|---|---|---|---|---|
Grid Size/ Processes | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 80 |
840x840 | 1 | 3.94 | 8.21 | 11.04 | 17.77 | 17.50 | 20.47 | 20.46 | 18.91 |
1680x1680 | 1 | 3.90 | 8.46 | 14.81 | 21.67 | 30.26 | 33.58 | 42.05 | 41.78 |
3360x3360 | 1 | 3.92 | 8.62 | 13.61 | 22.65 | 30.89 | 42.52 | 55.92 | 65.45 |
6720x6720 | 1 | 3.95 | 8.76 | 14.09 | 23.17 | 31.04 | 44.17 | 54.72 | 66.71 |
13440x13440 | 1 | 3.96 | 8.82 | 14.31 | 23.73 | 31.79 | 46.05 | 56.22 | 69.93 |
26880x26880 | 1 | 3.98 | 8.88 | 14.38 | 23.98 | 32.31 | 46.69 | 56.99 | 71.39 |
Efficiency | |||||||||
---|---|---|---|---|---|---|---|---|---|
Grid Size/ Processes | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 80 |
840x840 | 1 | 0.99 | 0.91 | 0.69 | 0.71 | 0.49 | 0.42 | 0.32 | 0.24 |
1680x1680 | 1 | 0.98 | 0.94 | 0.93 | 0.87 | 0.84 | 0.69 | 0.66 | 0.52 |
3360x3360 | 1 | 0.98 | 0.96 | 0.85 | 0.91 | 0.86 | 0.87 | 0.87 | 0.82 |
6720x6720 | 1 | 0.99 | 0.97 | 0.88 | 0.93 | 0.86 | 0.90 | 0.85 | 0.83 |
13440x13440 | 1 | 0.99 | 0.98 | 0.89 | 0.95 | 0.88 | 0.94 | 0.88 | 0.87 |
26880x26880 | 1 | 1.00 | 0.99 | 0.90 | 0.96 | 0.90 | 0.95 | 0.89 | 0.89 |
Timings | |||||||||
---|---|---|---|---|---|---|---|---|---|
Grid Size/ Διεργασίες | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 80 |
840x840 | 0.803 | 0.211 | 0.110 | 0.080 | 0.078 | 0.074 | 0.064 | 0.043 | 0.081 |
1680x1680 | 3.183 | 0.826 | 0.376 | 0.234 | 0.166 | 0.155 | 0.128 | 0.111 | 0.108 |
3360x3360 | 12.700 | 3.244 | 1.479 | 0.936 | 0.579 | 0.457 | 0.333 | 0.250 | 0.270 |
6720x6720 | 50.779 | 12.849 | 5.783 | 3.604 | 2.183 | 1.666 | 1.180 | 0.969 | 0.796 |
13440x13440 | 202.993 | 51.211 | 22.956 | 14.189 | 8.546 | 6.393 | 4.440 | 3.623 | 2.930 |
26880x26880 | 817.426 | 204.736 | 91.542 | 56.489 | 33.825 | 25.254 | 17.367 | 14.220 | 11.436 |
Speedup | |||||||||
---|---|---|---|---|---|---|---|---|---|
Grid Size/ Διεργασίες | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 80 |
840x840 | 1 | 3.80 | 7.28 | 9.98 | 10.35 | 10.91 | 12.52 | 18.50 | 9.93 |
1680x1680 | 1 | 3.85 | 8.47 | 13.62 | 19.19 | 20.53 | 24.78 | 28.76 | 29.44 |
3360x3360 | 1 | 3.92 | 8.59 | 13.57 | 21.92 | 27.79 | 38.12 | 50.80 | 46.98 |
6720x6720 | 1 | 3.95 | 8.78 | 14.09 | 23.26 | 30.48 | 43.02 | 52.38 | 63.79 |
13440x13440 | 1 | 3.96 | 8.84 | 14.31 | 23.75 | 31.75 | 45.72 | 56.02 | 69.28 |
26880x26880 | 1 | 3.99 | 8.93 | 14.47 | 24.17 | 32.37 | 47.07 | 57.48 | 71.48 |
Efficiency | |||||||||
---|---|---|---|---|---|---|---|---|---|
Grid Size/ Διεργασίες | 1 | 4 | 9 | 16 | 25 | 36 | 49 | 64 | 80 |
840x840 | 1 | 0.95 | 0.81 | 0.62 | 0.41 | 0.30 | 0.26 | 0.29 | 0.12 |
1680x1680 | 1 | 0.96 | 0.94 | 0.85 | 0.77 | 0.57 | 0.51 | 0.45 | 0.37 |
3360x3360 | 1 | 0.98 | 0.95 | 0.85 | 0.88 | 0.77 | 0.78 | 0.79 | 0.59 |
6720x6720 | 1 | 0.99 | 0.98 | 0.88 | 0.93 | 0.85 | 0.88 | 0.82 | 0.80 |
13440x13440 | 1 | 0.99 | 0.98 | 0.89 | 0.95 | 0.88 | 0.93 | 0.88 | 0.87 |
26880x26880 | 1 | 1.00 | 0.99 | 0.90 | 0.97 | 0.90 | 0.96 | 0.90 | 0.89 |
Timings | |||||||
---|---|---|---|---|---|---|---|
Grid Size/ Processes - Threads | 1 | 1p-4t | 2p-8t | 4p-16t | 9p-36t | 16p-64t | 20p-80t |
840x840 | 0.800 | 0.205 | 0.106 | 0.071 | 0.037 | 0.041 | 0.068 |
1680x1680 | 3.185 | 0.810 | 0.446 | 0.226 | 0.108 | 0.083 | 0.056 |
3360x3360 | 12.700 | 3.211 | 1.801 | 0.931 | 0.420 | 0.241 | 0.195 |
6720x6720 | 50.775 | 12.820 | 7.091 | 3.563 | 1.655 | 0.948 | 0.776 |
13440x13440 | 202.996 | 51.167 | 28.149 | 14.185 | 6.434 | 3.668 | 2.936 |
26880x26880 | 815.197 | 207.841 | 115.079 | 56.544 | 25.572 | 14.486 | 11.744 |
Speedup | |||||||
---|---|---|---|---|---|---|---|
Grid Size/ Threads | 1 | 4 | 8 | 16 | 36 | 64 | 80 |
840x840 | 1 | 3.90 | 7.54 | 11.26 | 21.84 | 19.39 | 11.69 |
1680x1680 | 1 | 3.93 | 7.14 | 14.08 | 29.52 | 38.53 | 57.07 |
3360x3360 | 1 | 3.96 | 7.05 | 13.65 | 30.25 | 52.67 | 65.26 |
6720x6720 | 1 | 3.96 | 7.16 | 14.25 | 30.68 | 53.56 | 65.44 |
13440x13440 | 1 | 3.97 | 7.21 | 14.31 | 31.55 | 55.34 | 69.15 |
26880x26880 | 1 | 3.92 | 7.08 | 14.42 | 31.88 | 56.27 | 69.42 |
Efficiency | |||||||
---|---|---|---|---|---|---|---|
Grid Size/ Threads | 1 | 4 | 8 | 16 | 36 | 64 | 80 |
840x840 | 1 | 0.98 | 0.94 | 0.70 | 0.61 | 0.30 | 0.15 |
1680x1680 | 1 | 0.98 | 0.89 | 0.88 | 0.82 | 0.60 | 0.71 |
3360x3360 | 1 | 0.99 | 0.88 | 0.85 | 0.84 | 0.82 | 0.82 |
6720x6720 | 1 | 0.99 | 0.90 | 0.89 | 0.85 | 0.84 | 0.82 |
13440x13440 | 1 | 0.99 | 0.90 | 0.89 | 0.88 | 0.86 | 0.86 |
26880x26880 | 1 | 0.98 | 0.89 | 0.90 | 0.89 | 0.88 | 0.87 |
Timings | |||||||
---|---|---|---|---|---|---|---|
Grid Size/ Processes - Threads | 1 | 1p-4t | 2p-8t | 4p-16t | 9p-36t | 16p-64t | 20p-80t |
840x840 | 0.803 | 0.205 | 0.114 | 0.071 | 0.051 | 0.068 | 0.059 |
1680x1680 | 3.183 | 0.811 | 0.456 | 0.237 | 0.127 | 0.093 | 0.074 |
3360x3360 | 12.700 | 3.208 | 1.776 | 0.938 | 0.434 | 0.264 | 0.242 |
6720x6720 | 50.779 | 12.813 | 7.070 | 3.599 | 1.653 | 0.959 | 0.778 |
13440x13440 | 202.993 | 51.122 | 28.082 | 14.242 | 6.427 | 3.674 | 3.108 |
26880x26880 | 817.426 | 207.031 | 115.019 | 56.769 | 25.465 | 14.352 | 11.619 |
Speedup | |||||||
---|---|---|---|---|---|---|---|
Grid Size/ Threads | 1 | 4 | 8 | 16 | 36 | 64 | 80 |
840x840 | 1 | 3.92 | 7.04 | 11.30 | 15.75 | 11.73 | 13.53 |
1680x1680 | 1 | 3.93 | 6.98 | 13.40 | 25.05 | 34.27 | 43.05 |
3360x3360 | 1 | 3.96 | 7.15 | 13.54 | 29.25 | 48.16 | 52.41 |
6720x6720 | 1 | 3.96 | 7.18 | 14.11 | 30.73 | 52.97 | 65.26 |
13440x13440 | 1 | 3.97 | 7.23 | 14.25 | 31.59 | 55.25 | 65.32 |
26880x26880 | 1 | 3.95 | 7.11 | 14.40 | 32.10 | 56.95 | 70.35 |
Efficiency | |||||||
---|---|---|---|---|---|---|---|
Grid Size/ Threads | 1 | 4 | 8 | 16 | 36 | 64 | 80 |
840x840 | 1 | 0.98 | 0.88 | 0.71 | 0.44 | 0.18 | 0.17 |
1680x1680 | 1 | 0.98 | 0.87 | 0.84 | 0.70 | 0.54 | 0.54 |
3360x3360 | 1 | 0.99 | 0.89 | 0.85 | 0.81 | 0.75 | 0.66 |
6720x6720 | 1 | 0.99 | 0.90 | 0.88 | 0.85 | 0.83 | 0.82 |
13440x13440 | 1 | 0.99 | 0.90 | 0.89 | 0.88 | 0.86 | 0.82 |
26880x26880 | 1 | 0.99 | 0.89 | 0.90 | 0.89 | 0.89 | 0.88 |
Timings | ||
---|---|---|
Grid Size | 1 GPU | 2 GPUs |
840x840 | 0.011 | 0.007 |
1680x1680 | 0.040 | 0.022 |
3360x3360 | 0.160 | 0.085 |
6720x6720 | 0.569 | 0.317 |
13440x13440 | 1.923 | 1.040 |
26880x26880 | - | 3.748 |
*The big grid could not be computed with 1 GPU because of insufficient VRAM.
As you can see, the CUDA program has the best performance because GPUs have tons of threads (which means they can execute tons of computations in parallel), and extremely fast vram.