Skip to content

Core Parallel Loops

Sam Reeve edited this page May 18, 2023 · 2 revisions

Overview

Cabana offers multiple types of parallel execution extending Kokkos on-node parallelism. We highly recommend reviewing the Kokkos documentation for parallel dispatch before using the Cabana parallel for operations if you are not already familiar.

Cabana parallelism maps to the particle data layouts, SIMD parallel for, and to particle neighbor iteration, neighbor parallel for/reduce.

Implementation

Header File: Cabana_Parallel.hpp

Usage: Parallel iteration over Cabana AoSoA or neighbor list objects.

SIMD Parallel For

The Cabana Slice is designed to represent single data members within an AoSoA as if they were a single multidimensional array. In most applications we will want to execute parallel kernels that operate on slices using the given hardware on a compute node (e.g. multicore CPUs, GPUs, or other accelerators). To realize the full benefit of using an AoSoA data structure in an application, different types of parallel loops will need to be deployed for different types of kernels. We identify two primary kernel types in an application:

  1. Kernels that allow for sequential data access and therefore readily vector/coalesce using 2D indexing.
  2. Kernels that do not easily vector/coalesce and therefore will possibly benefit from the maximum parallelism afforded by 1D indexing.

Cabana handles both of these types of kernels by providing 1D and 2D indexing into the Slice data structure along with a new parallel for loop paradigm specifically for 2D indexing. Our concept builds directly on, and is complementary to, the Kokkos parallel_for concept (see the Kokkos wiki for more details). In the examples below we will demonstrate using the Cabana Slice in parallel for loops with kernels of both types. Note that in the examples below we directly use existing Kokkos infrastructure (e.g. KOKKOS_LAMBDA) as a means of easily implementing the computational kernels.

Vectorized/Coalesced Kernels

If we have a kernel we know that will easily vectorize/coalesce then we want to use 2D indexing to access Slice data. On a CPU, this 2D indexing can be used to write a vector-length inner loop that will vectorize due to stride-1 access over the slice data while on a GPU a 2D thread layout is generated to promote coalescing.

Consider the following data structure:

using DataTypes = Cabana::MemberTypes<double,double>;
using DeviceType = Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>;
const int VectorLength = 8;
const int num_tuple = 100;
Cabana::AoSoA<DataTypes,DeviceType,VectorLength> aosoa( num_tuple );
auto slice_0 = slice<0>( aosoa );
auto slice_1 = slice<1>( aosoa );

Next we will write a simple kernel known to vectorize:

auto vector_kernel =
    KOKKOS_LAMBDA( const int s, const int a )
    { slice_0.access(s,a) = slice_1.access(s,a); };

There are a few salient features of this kernel as compared to a functor that would normally be written for use with Kokkos::parallel_for. First note that there are 2 index arguments instead of one. The first index, s, corresponds to the struct index of the data element on which the operations will be performed and the second index, a, is the array index (i.e. the location of the data element in the vector-length array in the struct). Exposing two indices allows the compiler to easily see that data access will be stride-1 over index a.

Next, we make an execution policy to use with the new Cabana parallel for loop concept. Again, this is a complementary concept and data structure to those already implemented in Kokkos. In this case we make a SimdPolicy:

using ExecutionSpace = Kokkos::OpenMP;
Cabana::SimdPolicy<VectorLength,ExecutionSpace> simd_policy( 0, num_tuple );

The Cabana SIMD execution policy specifies the Kokkos execution space in which the functor will execute, the vector length of the data structure which is used to compose parallel loop bounds, and the range of 1D indices over which to execute the functor. The parallel for loop will automatically create an appropriate 2D index range and pass these indices to the user-defined functor. This includes 1D index ranges that do not start or stop directly on the boundary of a struct (i.e. begin and end are not a multiple of VectorLength).

Finally, we execute the SIMD kernel using the Cabana parallel for concept:

Cabana::simd_parallel_for( simd_policy, vector_kernel, "vector_op" );

Like Kokkos::parallel_for, arguments for this function include the execution policy, the SIMD kernel, and a name for the parallel code region. When certain memory spaces are used (e.g. Kokkos::CudaUVMSpace) a call to Kokkos::fence() may be needed after the kernel is executed to ensure the computation finishes before data is accessed on the host.

If a functor object is used instead of a lambda function approach, a work tag can also be used to decide which implementation of operator() to dispatch in a given kernel in identical fashion to other Kokkos execution policies. To define which work tag to use, the tag is appended to the end of the execution policy type definition:

using WorkTag = MyFunctorTag;
Cabana::SimdPolicy<VectorLength,ExecutionSpace,WorkTag> simd_policy( 0, num_tuple );

Kernels that do not Vectorize/Coalesce

It will often be the case that a given kernel will be difficult to implement in a way that allows for vectorization/coalescing due to constraints such as irregular memory access patterns. In this case, we still want to take advantage of the memory-locality offered by the AoSoA data structure, but we want to expose more parallelism than the Cabana:simd_parallel_for would create in cases that do not vectorize. If a kernel does not vectorize, then on some architectures (such as traditional CPUs) the inner vector-length for loop generated by the Cabana::simd_parallel_for will be executed in serial rather then in a single vector operation, thus introducing more instructions on fewer threads.

For cases such as this we directly use the Kokkos::parallel_for concept instead over a 1D range of indices. Consider the following kernel with a random data access pattern:

using PoolType = Kokkos::Random_XorShift64_Pool<ExecutionSpace>;
using RandomType = Kokkos::Random_XorShift64<ExecutionSpace>;
PoolType pool( 342343901 );
auto rand_kernel =
    KOKKOS_LAMBDA( const int i )
    {
        auto gen = pool.get_state();
        auto rand_idx = Kokkos::rand<RandomType,int>::draw(gen,0,num_tuple);
        slice_1(i) = slice_0( rand_idx );
        pool.free_state( gen );
    };

This kernel is likely best executed in a 1D parallel loop:

Kokkos::RangePolicy<ExecutionSpace> linear_policy( 0, num_tuple);
Kokkos::parallel_for( linear_policy, rand_kernel, "rand_op" );

Where now we have directly used a Kokkos::RangePolicy to perform the parallel loop.

Other Parallel Dispatch Modes

It should be noted that the Cabana Slice and AoSoA can also be directly used with the remaining Kokkos parallel concepts including Kokkos::parallel_scan and Kokkos::reduce. As well as the available hierarchical parallelism.

Examples

Example: SIMD Parallel For Tutorial

Example: SIMD Parallel For Unit Test

Neighbor Parallel For

In addition to parallelism mapping to the underlying data layouts, Cabana provides a particle specific extension for looping over both particles and their neighbors. The Cabana::neighbor_parallel_for takes a Kokkos range policy, Cabana neighbor list, a tag to specify both what the kernel will use (first or second neighbors), and a tag to specify the level of threaded parallelism (either serial or threaded for neighbors). The first tag determines the valid functor interface for the neighbor kernel:

auto first_neighbor_kernel =
    KOKKOS_LAMBDA( const int i, const int j )
    {
        // Operations for each neighbor
    }

Which can be used as follows:

using MemorySpace = Kokkos::HostSpace;
using ExecutionSpace = Kokkos::OpenMP;
Kokkos::RangePolicy<ExecutionSpace> policy( 0, num_particle );
Cabana::VerletList<MemorySpace, Cabana::FullNeighborTag, Cabana::VerletLayout2D> verlet_list( ... );
Cabana::neighbor_parallel_for( policy, first_neighbor_kernel, verlet_list,
                               Cabana::FirstNeighborsTag(),
                               Cabana::TeamOpTag(), "ex_1st_team" );

Similarly, for a kernel that uses both first and second nearest neighbors:

auto second_neighbor_kernel =
    KOKKOS_LAMBDA( const int i, const int j, const int k )
    {
        // Operations for each neighbor
    }
Cabana::neighbor_parallel_for( policy, second_neighbor_kernel, verlet_list,
                               Cabana::SecondNeighborsTag(),
                               Cabana::TeamOpTag(), "ex_2nd_team" );

In both cases, the kernel is called for each particle, neighbor, and (if applicable) neighbor's neighbors.

To flexibly set and change the parallel threading, the tags are: Cabana::SerialOpTag (threaded central particles only), Cabana::TeamOpTag (threading for particles and first neighbors), and Cabana::TeamVectorOpTag (threading for particles and first and second neighbors, only valid with the Cabana::SecondNeighborsTag).

Neighbor Parallel Reduce

Cabana::neighbor_parallel_reduce offers a similar parallel capability as the parallel for, extending Kokkos for reductions. The only difference here is that the functor and function interfaces add a reduction variable - the same tags and syntax are used:

auto neighbor_reduce_kernel =
        KOKKOS_LAMBDA( const int i, const int j, double& sum )
    {
    };
Cabana::neighbor_parallel_reduce( policy, neighbor_reduce_kernel, verlet_list,
                                  Cabana::FirstNeighborsTag(),
                                  Cabana::TeamOpTag(), team_sum, "ex_1st_reduce_team" );

Examples

Example: Neighbor Parallel Reduce Unit Test

This is part of the Programming Guide series

Clone this wiki locally