List Functor level Interface in Batched Functions

BLAS and LAPACK Functions

Kokkos Batched APIs are designed and implemented for a specific application needs. Most of functions described here are used to create a block tridiagonal line solver where the blocksizes are tiny. This implies that computational intensity is not so high and we assume that the data fits into a cache. Thus, loop unrolling (Algo::Blocked) can be benefitial for some architectures. To the contrary, one may consider the Algo::Unblocked with a team to increase fine-grained parallelism on Cuda.

AddRadial

/// [template]MemberType: member type derived from a team policy
/// [in]tiny: tiny scalar value to perturb diagonals of A
/// [in/out]A: 2d view
int SerialAddRadial::invoke(const Scalartype tiny, const AViewType &A);
int TeamAddRadial<MemberType>::invoke(const MemberType &member,
                                      const Scalartype tiny, const AViewType &A);

This perturbs diagonal values to be away from zeros: A = A + diag(epsilon*sign(diag(A))).

Headers

KokkosBatched_AddRadial_Decl.hpp
KokkosBatched_AddRadial_Impl.hpp

Copy

/// [template]MemberType: member type derived from a team policy
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [in]A: 1d or 2d view  
/// [out]B: 1d or 2d view
int SerialCopy<TransType>::invoke(const AViewType &A, const BViewType &B);
int TeamCopy<MemberType,TransType>::invoke(const MemberType &member,
                                           const AViewType &A, const BViewType &B);

Copy the elements of A into B performing B=op(A).

Headers

KokkosBatched_Copy_Decl.hpp
KokkosBatched_Copy_Impl.hpp

Scale

/// [template]MemberType: member type derived from a team policy
/// [in]alpha: scalar value 
/// [in/out]A: 1d or 2d view
int SerialScale::invoke(const ScalarType alpha, const AViewType &A);
int TeamScale::invoke(const MemberType &member,
                      const ScalarType alpha, const AViewType &A);

Scales the values of A with a scalar alpha.

Headers

KokkosBatched_Scale_Decl.hpp
KokkosBatched_Scale_Impl.hpp

Set

/// [template]MemberType: member type derived from a team policy
/// [in]alpha: scalar value 
/// [out]A: 1d or 2d view
int SerialSet::invoke(const ScalarType alpha, const AViewType &A);
int TeamSet::invoke(const MemberType &member,
                    const ScalarType alpha, const AViewType &A);

Set the values of A with a scalar alpha.

Headers

KokkosBatched_Set_Decl.hpp
KokkosBatched_Set_Impl.hpp

Gemv

/// [template]MemberType: member type derived from a team policy
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: scalar value
/// [in]A: 2d view
/// [in]x: 1d view
/// [in]beta: scalar value
/// [in/out]y: 1d view
int SerialGemv<TransType,AlgoType>
    ::invoke(const ScalarType alpha, const AViewType &A, const xViewType &x,
             const ScalarType beta, const yViewType &y); 
int TeamGemv<MemberType,TransType,AlgoType>
    ::invoke(const MemberType &member,
             const ScalarType alpha, const AViewType &A, const xViewType &x,
             const ScalarType beta, const yViewType &y);

Performs a general matrix-vector multiplication: y = beta y + alpha op(A) x.

Headers

KokkosBatched_Gemv_Decl.hpp
KokkosBatched_Gemv_Serial_Impl.hpp
KokkosBatched_Gemv_Team_Impl.hpp

Trsv

/// [template]MemberType: member type derived from a team policy
/// [template]UploType: indicates either upper triangular or lower triangular; Uplo::Upper, Uplo::Lower
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [template]DiagType: diagonals; Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: scalar value
/// [in]A: 2d view
/// [in]x: 1d view
/// [in]beta: scalar value
/// [in/out]y: 1d view
int SerialTrsv<UploType,TransType,DiagType,AlgoType>
    ::invoke(const ScalarType alpha, const AViewType &A, const xViewType &b);
int TeamGemv<MemberType,TransType,DiagType,AlgoType>
    ::invoke(const MemberType &member,
             const ScalarType alpha, const AViewType &A, const xViewType &b);

Performs triangular solve operations: b = op(A) x and b is overwritten as A^{-1}b. The DiagType indicates that it considers diagonals as unit values or not.

Headers

KokkosBatched_Trsv_Decl.hpp
KokkosBatched_Trsv_Serial_Impl.hpp
KokkosBatched_Trsv_Team_Impl.hpp

Gemm

/// [template]MemberType: member type derived from a team policy
/// [template]TransType: transpose of A; Trans::NoTranspose, Trans::Transpose
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: scalar value
/// [in]A: 2d view
/// [in]x: 1d view
/// [in]beta: scalar value
/// [in/out]y: 1d view
int SerialGemm<ATransType,BTransType,AlgoType>
    ::invoke(const ScalarType alpha, const AViewType &A, const BViewType &B,
             const ScalarType beta, const cViewType &C); 
int TeamGemm<MemberType,ATransType,BTransType,AlgoType>
    ::invoke(const MemberType &member,
             const ScalarType alpha, const AViewType &A, const BViewType &B,
             const ScalarType beta, const CViewType &C);

Performs general matrix-matrix multiplications: C = beta C + alpha op(A) op(B).

Headers

KokkosBatched_Gemm_Decl.hpp
KokkosBatched_Gemm_Serial_Impl.hpp
KokkosBatched_Gemm_Team_Impl.hpp

LU without pivoting

/// [template]MemberType: member type derived from a team policy
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in/out]A: 2d view
/// [in]tiny: a magnitude scalar value compatible to the value type of A
int SerialLU<AlgoType>
    ::invoke(const AViewType &A, 
             const MagnitudeScalarType tiny = 0);
int TeamLU<MemberType,AlgoType>
    ::invoke(const MemberType &member,
             const AViewType &A, 
             const MagnitudeScalarType tiny = 0);

Performs LU factorization without pivoting. Static pivots are applied with a tiny value to avoid division by a zero. This LU factorization is not in general. However, it is useful for computing small block matrices; for instance, the routine is used to contruct a block Jacobi preconditioner.

Headers

KokkosBatched_LU_Decl.hpp
KokkosBatched_LU_Serial_Impl.hpp
KokkosBatched_LU_Team_Impl.hpp

Trsm

/// [template]MemberType: member type derived from a team policy
/// [template]SideType: Side::Left or Side::Right
/// [template]UploType: Uplo::Upper or Uplo::Lower
/// [template]TransType: Trans::NoTranspose or Trans::Transpose
/// [template]DiagType: Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked, Blocked, CompatMKL
/// [in]alpha: a scalar value
/// [in]A: 2d view
/// [in/out]B: 2d view
int SerialTrsm<SideType,UploType,TransType,DiagType,AlgoType>
    ::invoke(const ScalarType alpha, const AViewType &A, const BViewType &B);
int TeamLU<MemberType,SideType,UploType,TransType,DiagType,AlgoType>
    ::invoke(const MemberType &member,
             const ScalarType alpha, const AViewType &A, const BViewType);

Perform triangular solve with multiple right-hand side. With Side::Left, it solves op(A) X = alpha B and B is overwritten by alpha op(A)^{-1} B. With Side::Right, it solves X op(A) = alpha B and B is overwritten by alpha B op(A)^{-1}.

Headers

KokkosBatched_Trsm_Decl.hpp
KokkosBatched_Trsm_Serial_Impl.hpp
KokkosBatched_Trsm_Team_Impl.hpp

Trmm

/// [template]SideType: Side::Left or Side::Right
/// [template]UploType: Uplo::Upper or Uplo::Lower
/// [template]TransType: Trans::NoTranspose or Trans::Transpose
/// [template]DiagType: Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked
/// [in]alpha: a scalar value
/// [in]A: 2d view
/// [in/out]B: 2d view
int SerialTrmm<SideType,UploType,TransType,DiagType,AlgoType>
    ::invoke(const ScalarType alpha, const AViewType &A, const BViewType &B);

Solves triangular matrix multiply with multiple right-hand sides. With Side::Left, it solves B = alpha * op(A) * B. With Side::Right, it solves B = alpha * B * op(A). op is non-transpose, transpose, or conjugate transpose if trans is "n", "t", or "c", respectivley.

Headers

KokkosBatched_Trmm_Decl.hpp
KokkosBatched_Trmm_Serial_Impl.hpp

Trtri

/// [template]UploType: Uplo::Upper or Uplo::Lower
/// [template]DiagType: Diag::Unit or Diag::NonUnit
/// [template]AlgoType: Unblocked
/// [in/out]A: 2d view
int SerialTrtri<UploType,DiagType,AlgoType>
    ::invoke(const AViewType &A);

Finds the inverse of the triangular matrix, A. A = inv(A).

Headers

KokkosBatched_Trtri_Decl.hpp
KokkosBatched_Trtri_Serial_Impl.hpp

SVD (singular value decomposition)

Full decomposition

/// [template] AViewType
/// [template] UViewType
/// [template] VtViewType
/// [template] SViewType
/// [template] WViewType
/// [in] A: 2D view (m x n).
///      Will be overwritten by routine and contents after return are undefined.
/// [out] U: 2D view (m x m).
///       Will contain left singular vectors (in columns).
/// [out] s: 1D view (min(m, n)).
///       Will contain singular values.
/// [out] Vt: 2D view (n x n).
///       Will contain right singular vectors, transposed (in rows).
/// W: 1D work view (length max(m, n) or greater).
///      Must be contiguous. Contents undefined after return.
///      Unlike LAPACK, there's no faster path for when it's larger than the minimum.
int KokkosBatched::SerialSVD::invoke(
  KokkosBatched::SVD_USV_Tag,
  const AViewType &A,
  const UViewType &U,
  const SViewType &s,
  const VtViewType &Vt,
  const WViewType &W);

Computes the full singular value decomposition (SVD) of a general matrix A. On output, A == U * diag(s) * Vt. U and Vt will be orthogonal. s will contain the singular values: nonnegative values in descending order. Note that Vt (V^T) is the transposed right singular vectors, like in LAPACK's dgesvd interface.

Singular values only

/// [template] AViewType
/// [template] SViewType
/// [template] WViewType
/// [in] A: 2D view (m x n).
///      Will be overwritten by routine and contents after return are undefined.
/// [out] s: 1D view (min(m, n)).
///       Will contain singular values.
/// W: 1D work view (length at least max(m, n)).
///       Must be contiguous. Contents undefined after return.
int KokkosBatched::SerialSVD::invoke(
  KokkosBatched::SVD_S_Tag,
  const AViewType &A,
  const SViewType &s,
  const WViewType &W);

Computes the singular value decomposition (SVD) of a general matrix A, producing just the singular values (not vectors). Otherwise, the same as the full version above.

Example usage

KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_USV_Tag(), A, U, sigma, Vt, work);
KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_S_Tag(), A, sigma, work);

Headers

KokkosBatched_SVD_Decl.hpp

SAND2021-11374 O

SIMD

DefaultVectorLength

/// [template]ValueType: float, double, Kokkos::complex<float> and Kokkos::complex<double>
/// [template]MemorySpace: Kokkos::HostSpace, Kokkos::CudaSpace and Kokkos::CudaUVMSpace
enum DefaultVectorLength<ValueType,MemorySpace>::value;

For a different architecture, this enum provides a compile time SIMD length. For instance, Intel compiler flag __AVX512F__ set the length a) 16 for float b) 8 for double c) 8 for complex float and d) 4 for complex double. With a CUDA memory space, the default length is set as 16 for all value types. As the vectorization on a GPU is explicit and flexible, a user must understand the best SIMD length for different GPUs.

Vector

/// [template]T: value type
/// [template]l: SIMD length
struct Vector<SIMD<T>,l>

This struct includes a static array of values with a given size. The struct is specialized with a certain vector register type if the architecture supports the corresponding vector instructions. For more detailed information, please see the file KokkosBatched_Vector_SIMD.hpp. As the vector type and instructions are dependent on hardware, someone wants to test this type with a new architecture. For such a case, one can provide a new set of SIMD type by creating a different tag type instead of SIMD. The SIMD type comes with full arithmetic overloading (e.g., +,-,*,/, etc.) and basic math functions (e.g., sqrt,cbrt,log,exp,pow,etc.). For a full description, see KokkosBatched_Vector_SIMD_Arith/Math.hpp.

One tricky part in writin a generic routine with this SIMD type is logical and comparison operators. As it is impossible to write a generic routine with if statement using a SIMD type (each array element needs to behave differently as a result of comparison), we provide overloaded operators returning a Vector<SIMD<bool>,l. Then, the array of bools can be used as a predicate in a sequence of operations. A simple example is shown to explain the usage.

/// We want to add some values for only positive values in the SIMD array
double alpha = some_value;
Vecotr<SIMD<double>,4> a = random_value(); // set random values
const auto is_positive = a > 0;            // construct a predicate
a += is_positive*alpha;                    // add the some_value for positive member of a

For more complex use cases, we provide the following conditional assign statements and collective functions.

/// Performs conditional assignment: cond ? if_true_value : if_false_value
/// [template]T: value type
/// [template]l: SIMD length
/// [in]cond: predicate
/// [in]if_true_value: values to be assigned for a true bool in the prediate
/// [in]if_false_value: values to be assigned for a false bool in the predicate
Vector<SIMD<T>,l> conditional_assign(const Vector<SIMD<bool>,l> cond,
                                     const Vector<SIMD<T>,l> if_true_value,
                                     const Vector<SIMD<T>,l> if_false_value);				     
/// Performs reduction with a given binary operator
/// [template]T: value type
/// [template]BinaryOp: a binary functor providing func(a,b) interface
/// [in]val: input value
/// [in]func: binary functor
/// [return]: r_val = func(...func(func(val[0],val[1]),val[2])...,val[l-1])
T reduce(const Vector<SIMD<T>,l> val, const BinaryOp &func);

/// Returns true if all values of the input SIMD type are true
bool is_all_true(const Vector<SIMD<bool>,l> cond);

/// Returns true if any values of the input SIMD type are true
bool is_any_true(const Vector<SIMD<bool>,l> cond);

/// Returns minimum value of the input SIMD type
T min(const Vector<SIMD<T>,l> val);

/// returns maximum value in the input SIMD type
T max(const Vector<SIMD<T>,l> val);

See more details in KokkosBatched_Vector_SIMD_Misc.hpp.

Headers

KokkosBatched_Vector.hpp		     
KokkosBatched_Vector_SIMD_Arith.hpp   
KokkosBatched_Vector_SIMD.hpp	     
KokkosBatched_Vector_SIMD_Logical.hpp 
KokkosBatched_Vector_SIMD_Math.hpp    
KokkosBatched_Vector_SIMD_Misc.hpp    
KokkosBatched_Vector_SIMD_Relation.hpp

SAND2018-14051 W