Skip to content

Latest commit

 

History

History
645 lines (448 loc) · 29.3 KB

CHANGELOG.md

File metadata and controls

645 lines (448 loc) · 29.3 KB

Changelog for rocBLAS

rocBLAS documentation is available at https://rocm.docs.amd.com/projects/rocBLAS/en/latest/index.html.

(Unreleased) rocBLAS 4.4.0

Added

  • rocTX support in rocBLAS (not available on Windows or in the static library version on Linux)
  • On gfx12, all functions now support full rocblas_int dynamic range for batch_count
  • --ninja build option
  • Support for GPU_TARGETS cmake variable

Changed

  • rocblas-test client removes the stress tests unless YAML-based testing or gtest_filter adds them
  • rocblas clients OpenMP default threading is reduced to be less than the logical core count
  • gemm_ex testing and timing reuses device memory
  • gemm_ex timing initializes matrices on device

Optimized

  • Significantly reduced workspace memory requirements for Level 1 ILP64: iamax and iamin
  • Reduced workspace memory requirements for Level 1 ILP64: dot, asum, nrm2

Resolved issues

  • gfx12: ger, geam, geam_ex, dgmm, trmm, symm, hemm, ILP64 gemm, and larger data support
  • Added a gfortran package dependency for Azure Linux OS
  • Outdated SLES OS package dependencies (cxxtools and joblib) in install.sh -d
  • Code object stripping for RPM packages

Upcoming changes

  • Deprecated the cmake variable AMDGPU_TARGETS. Use GPU_TARGETS instead.

rocBLAS 4.3.0 for ROCm 6.3

Added

  • Level 3 and EX functions have an additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments

Changed

  • amdclang is used as the default compiler instead of hipcc
  • Internal performance scripts use amd-smi instead of the deprecated rocm-smi

Optimized

  • Improved performance of Level 2 gbmv
  • Improved performance of Level 2 gemv for float and double precisions for problem sizes (TransA == N && m==n && m % 128 == 0) measured on a gfx942 GPU

Resolved issues

  • Fixed stbsv_strided_batched_64 Fortran binding

Upcoming changes

  • rocblas_Xgemm_kernel_name APIs are deprecated

rocBLAS 4.2.1 for ROCm 6.2.1

Removals

  • Removed Device_Memory_Allocation.pdf link in documentation.

Resolved issues

  • Fixed error/warning message during rocblas_set_stream() call.

rocBLAS 4.2.0 for ROCm 6.2

Additions

  • Level 2 functions and level 3 trsm have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments
  • Cache flush timing for gemm_batched_ex, gemm_strided_batched_ex, axpy
  • Benchmark class for common timing code
  • An environment variable "ROCBLAS_DEFAULT_ATOMICS_MODE" to set default atomics mode during creation of 'rocblas_handle'
  • Extended dot_ex to support single-precision (fp32_r) input and double-precision (fp64_r) output and compute types

Optimizations

  • Improved performance of Level 1 dot_batched and dot_strided_batched for all precisions. Performance enhanced by 6 times for bigger problem sizes measured on MI210 GPU

Changes

  • Linux AOCL dependency updated to release 4.2 gcc build
  • Windows vcpkg dependencies updated to release 2024.02.14
  • Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40

Deprecations

  • rocblas_gemm_ex3, gemm_batched_ex3 and gemm_strided_batched_ex3 are deprecated and will be removed in the next major release of rocBLAS. Please refer to hipBLASLt for future 8 bit float usage https://github.com/ROCm/hipBLASLt

rocBLAS 4.1.0 for ROCm 6.1

Additions

  • Level 1 and Level 1 Extension functions have additional ILP64 API for both C and Fortran (_64 name suffix) with int64_t function arguments
  • Cache flush timing for gemm_ex

Changes

  • Some Level 2 function argument names have changed m to n to match legacy BLAS; there is no change in implementation
  • Standardized the use of non-blocking streams for copying results from device to host

Fixes

  • Fixed host-pointer mode reductions for non-blocking streams

rocBLAS 4.0.0 for ROCm 6.0

Additions

  • Beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3
  • Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched: rocblas_hshgemv_batched/strided_batched, rocblas_hssgemv_batched/strided_batched, rocblas_tstgemv_batched/strided_batched and rocblas_tssgemv_batched/strided_batched
  • Use of rocblas_status_excluded_from_build when calling functions that require Tensile (when using rocBLAS built without Tensile)
  • System for asynchronous kernel launches that set a rocblas_status failure based on a hipPeekAtLastError discrepancy

Optimizations

  • TRSM performance for small sizes (m < 32 && n < 32)

Deprecations

  • Atomic operations will be disabled by default in a future release of rocBLAS

Removals

  • rocblas_gemm_ext2 API function
  • In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and out-of-place trmm
  • int8x4 support is removed (int8 support is unchanged)
  • #define __STDC_WANT_IEC_60559_TYPES_EXT__ is removed from rocblas-types.h (if you want ISO/IEC TS 18661-3:2015 functionality, you must define __STDC_WANT_IEC_60559_TYPES_EXT__ before including float.h, math.h, and rocblas.h)
  • The default build removes device code for gfx803 architecture from the fat binary

Fixes

  • Made offset calculations for 64-bit rocBLAS functions safe
    • Fixes for very large leading dimension or increment potentially causing overflow:
      • Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, and tpsv
  • Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files, based on device architecture
  • Guards against no-op kernel launches that result in a potential hipGetLastError

Changes

  • Reduced the default verbosity of rocblas-test (you can see all tests by setting the GTEST_LISTENER=PASS_LINE_IN_LOG environment variable)

rocBLAS 3.1.0 for ROCm 5.7

Additions

  • YAML lock step argument scans for rocblas-bench and rocblas-test clients
  • rocblas-gemm-tune is used to find the best performing GEMM kernel for each set of GEMM problems

Fixes

  • Made offset calculations for 64-bit rocBLAS functions safe
    • Fixes for very large leading dimensions or increments potentially causing overflow:
      • Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin,and nrm2`
      • Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, and trsv
      • Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, and geam
      • General: set_vector, get_vector, set_matrix, and get_matrix
      • Related fixes: internal scalar loads with > 32-bit offsets
      • In-place functionality for all trtri sizes

Changes

  • Dot when using rocblas_pointer_mode_host is now synchronous in order to match legacy BLAS as it stores results in host memory
  • Enhanced reporting of installation issues caused by runtime libraries (Tensile)
  • Standardized internal rocBLAS C++ interface across most functions

Deprecations

  • __STDC_WANT_IEC_60559_TYPES_EXT__ define will be removed in a future release

Dependencies

  • Optional use of AOCL BLIS 4.0 on Linux for clients
  • Optional build tool-only dependency on Python psutil

rocBLAS 3.0.0 for ROCm 5.6

Optimizations

  • Level 2 rocBLAS GEMV performance on gfx90a GPU for non-transposed problems that have small matrices (m and n <= 32) and large batch counts (batch_count >= 256)
  • rocBLAS syr2k performance for single, double, and double-complex precision
  • rocBLAS her2k performance for double-complex precision
  • Improved performance for general sizes on gfx90a

Additions

  • bf16 inputs and f32 compute support to Level 1 rocBLAS extension functions: axpy_ex, scal_ex, and nrm2_ex

Deprecations

  • In-place trmm has been replaced with trmm that has in-place and out-of-place functionality
  • rocblas_query_int8_layout_flag()
  • rocblas_gemm_flags_pack_int8x4
  • rocblas_set_device_memory_size() will be replaced with rocblas_increase_device_memory_size()
  • rocblas_is_user_managing_device_memory()

Removals

  • is_complex helper: use rocblas_is_complex instead
  • The enum truncate_t: use rocblas_truncate_t instead
  • The value truncate: use rocblas_truncate instead
  • rocblas_set_int8_type_for_hipblas
  • rocblas_get_int8_type_for_hipblas

Dependencies

  • Python joblib build-only dependency (used in Tensile builds)

Fixes

  • Made 64-bit trsm offset calculations safe
  • CMake install fixed on some operating systems when using install.sh -d --cmake_install

Changes

  • Refactored ROTG test code

rocBLAS 2.47.0 for ROCm 5.5.0

Additions

  • rocblas_geam_ex functionality for matrix-matrix minimum operations
  • HIP Graph support (beta feature) for rocBLAS Level 1, Level 2, and Level 3 (pointer mode host) functions
  • Beta features API, exposed using compiler define ROCBLAS_BETA_FEATURES_API
  • Support for vector initialization in the rocBLAS test framework with negative increments
  • Windows build documentation for HIP SDK support
  • Scripts for plotting the performance of multiple functions

Optimizations

  • Performance improvements for Level 2 rocBLAS GEMV for float and double precision (150-200% improvement for certain problem sizes when (m==n) measured on a gfx90a GPU)
  • Performance improvements for Level 2 rocBLAS GER for float, double, and complex float precisions (5-7% improvement for certain problem sizes when measured on a gfx90a GPU)
  • Performance improvements for Level 2 rocBLAS SYMV for float and double precisions (120-150% improvement for certain problem sizes measured on both gfx908 and gfx90a GPUs)

Fixes

  • Executable mode setting on rocblas_gentest.py client script to avoid potential permission errors with clients rocblas-test and rocblas-bench
  • Deprecated API compatibility with Visual Studio compiler
  • Test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory

Changes

  • install.sh internally runs rmake.py (also used on Windows) and rmake.py can be used directly on Linux (use --help)
  • rocBLAS client executables all now begin with the rocblas- prefix

Removals

  • install.sh no longer has the options -o --cov because Tensile will now use the default COV format, which is set by cmake define Tensile_CODE_OBJECT_VERSION=default

rocBLAS 2.46.0 for ROCm 5.4.0

Additions

  • client smoke test dataset added for quick validation using command rocblas-test --yaml rocblas_smoke.yaml
  • Added stream order device memory allocation as a non-default beta option.

Optimized

  • Improved trsm performance for small sizes by using a substitution method technique
  • Improved syr2k and her2k performance significantly by using a block-recursive algorithm

Changes

  • Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behavior.
  • Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm
  • Allow hipBLAS to select int8 datatype
  • Disallow B == C && ldb != ldc in rocblas_xtrmm_outofplace

Fixes

  • Fortran interfaces generalized for Fortran compilers other than GFortran
  • fix for trsm_strided_batched rocblas-bench performance gathering
  • Fix for rocm-smi path in commandrunner.py script to match ROCm 5.2 and above

rocBLAS 2.45.0 for ROCm 5.3.0

Additions

  • install.sh option --upgrade_tensile_venv_pip to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIP
  • install.sh option --relocatable or -r adds rpath and removes ldconf entry on rocBLAS build
  • install.sh option --lazy-library-loading to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization
  • Support for RHEL9 and CS9
  • Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value

Optimizations

  • trmm_outofplace performance improvements for all sizes and data types using block-recursive algorithm
  • herkx performance improvements for all sizes and data types using block-recursive algorithm
  • syrk/herk performance improvements by utilising optimised syrkx/herkx code
  • symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm

Changes

  • Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now.
  • Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour
  • Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset
  • Add quick-return for internal trmm and gemm template functions
  • Moved function block sizes to a shared header file
  • Level 1, 2, and 3 functions use rocblas_stride datatype for offset
  • Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions
  • Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure
  • Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions

Fixes

  • Improved logic to #include vs <experimental/filesystem>
  • install.sh -s option to build rocblas as a static library.
  • dot function now sets the device results asynchronously for N <= 0

Deprecations

  • is_complex helper is now deprecated. Use rocblas_is_complex instead
  • The enum truncate_t and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition

Removals

  • install.sh options --hip-clang, --no-hip-clang, --merge-files, --no-merge-files are removed

rocBLAS 2.44.0 for ROCm 5.2.0

Additions

  • Packages for test and benchmark executables on all supported operating systems using CPack
  • Added denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions
  • Added denormal number detection to the numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions
  • Added NaN initialization tests to the YAML files of Level 2 rocBLAS batched and strided-batched functions for testing purposes
  • Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests

Optimizations

  • Improved performance of non-batched and batched her2 for all sizes and data types
  • Improved performance of non-batched and batched amin for all data types using shuffle reductions
  • Improved performance of non-batched and batched amax for all data types using shuffle reductions
  • Improved performance of trsv for all sizes and data types

Changes

  • Modifying gemm_ex for HBH (high-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16
  • Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions
  • For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset
  • For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset
  • AMD copyright year for all rocBLAS files
  • For gemv (transpose-case), typecasted the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions

Fixes

  • For function her2 avoid overflow in offset calculation
  • For trsm when alpha == 0 and on host, allow A to be nullptr
  • Fixed memory access issue in trsv
  • Fixed git pre-commit script to update only AMD copyright year
  • Fixed dgmm, geam test functions to set correct stride values
  • For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose
  • Fixed compilation error for clients-only build

Removals

  • Remove Navi12 (gfx1011) from fat binary

rocBLAS 2.43.0 for ROCm 5.1.0

Additions

  • Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs)
  • Option to install script to build clients without using any Fortran (--clients_no_fortran)
  • rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.
  • Added tests for output of reduction functions when given bad input
  • Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench

Optimizations

  • Improved performance of trsm with side == left and n == 1
  • Improved performance of trsm with side == left and m <= 32 along with side == right and n <= 32

Changes

  • For syrkx and trmm internal API use rocblas_stride datatype for offset
  • For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match
  • Test client dependencies updated to GTest 1.11
  • non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives
  • Help menu messages in install.sh
  • For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions
  • Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench

Fixes

  • For function trmv (non-transposed cases) avoid overflow in offset calculation
  • Fixed cppcheck errors/warnings
  • Fixed Doxygen warnings

rocBLAS 2.42.0 for ROCm 5.0.0

Additions

  • Added rocblas_get_version_string_size convenience function
  • Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm
  • Added hpl and trig initialization for gemm_ex to rocblas-bench
  • Added source code gemm. It can be used as an alternative to Tensile for debugging and development
  • Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex

Optimizations

  • Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
  • Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.

Changes

  • Instantiate templated rocBLAS functions to reduce size of librocblas.so
  • Removed static library dependency on msgpack
  • Removed boost dependencies for clients

Fixes

  • Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
  • Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input
  • Fix for dgmm with side == rocblas_side_left and a negative incx
  • Fixed out-of-bounds read for small trsm
  • Fixed numerical checking for tbmv_strided_batched

rocBLAS 2.41.0 for ROCm 4.5.0

Optimizations

  • Improved performance of non-batched and batched syr for all sizes and data types
  • Improved performance of non-batched and batched hemv for all sizes and data types
  • Improved performance of non-batched and batched symv for all sizes and data types
  • Improved memory utilization in rocblas-bench, rocblas-test gemm functions, increasing possible runtime sizes.
  • Improved performance of non-batched and batched dot, dotc, and dot_ex for small n. e.g. sdot n <= 31000.
  • Improved performance of non-batched and batched trmv for all sizes and matrix types.
  • Improved performance of non-batched and batched gemv transpose case for all sizes and datatypes.
  • Improved performance of sger and dger for all sizes, in particular the larger dger sizes.
  • Improved performance of syrkx for for large size including those in rocBLAS Issue #1184.

Changes

  • Update from C++14 to C++17.
  • Packaging split into a runtime package (called rocblas) and a development package (called rocblas-dev for .deb packages, and rocblas-devel for .rpm packages). The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.

Fixes

  • For function geam avoid overflow in offset calculation.
  • For function syr avoid overflow in offset calculation.
  • For function gemv (Transpose-case) avoid overflow in offset calculation.
  • For functions ssyrk and dsyrk, allow conjugate-transpose case to match legacy BLAS. Behavior is the same as the transpose case.

rocBLAS 2.39.0 for ROCm 4.3.0

Optimizations

  • Improved performance of non-batched and batched rocblas_Xgemv for gfx908 when m <= 15000 and n <= 15000
  • Improved performance of non-batched and batched rocblas_sgemv and rocblas_dgemv for gfx906 when m <= 6000 and n <= 6000
  • Improved the overall performance of non-batched and batched rocblas_cgemv for gfx906
  • Improved the overall performance of rocblas_Xtrsv

Changes

  • Internal use only APIs prefixed with rocblas_internal_ and deprecated to discourage use

rocBLAS 2.38.0 for ROCm 4.2.0

Additions

  • Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
  • Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
    • Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
      • Set the rocblas_gemm_flags_pack_int8x4 when using packed int8x4, this should be always set on GPUs before gfx908.
      • For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
      • Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
  • Added a query function rocblas_query_int8_layout_flag to get the preferable layout of int8 for gemm by device

Optimizations

  • Improved performance of single precision copy, swap, and scal when incx == 1 and incy == 1
  • Improved performance of single precision axpy when incx == 1, incy == 1 and batch_count =< 8192
  • Improved performance of trmm

Changes

  • Change cmake_minimum_required to VERSION 3.16.8

rocBLAS 2.36.0 for ROCm 4.1.0

Additions

  • Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions
  • Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions

Fixes

  • Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
  • Make functions compliant with Legacy Blas for special values alpha == 0, k == 0, beta == 1, beta == 0

Optimizations

  • Improved performance of single precision axpy_batched and axpy_strided_batched: batch_count >= 8192
  • Improved performance of trmm.

rocBLAS 2.34.0 for ROCm 4.0.0

Additions

  • Add changelog.
  • Improved performance of gemm_batched for small m, n, k and NT, NC, TN, TT, TC, CN, CT, CC
  • Improved performance of gemv, gemv_batched, gemv_strided_batched: small n large m
  • Removed support for legacy hcc compiler
  • Add rot_ex, rot_batched_ex, and rot_strided_batched_ex

Fixes

  • Removed -DUSE_TENSILE_HOST from roc::rocblas CMake usage requirements. This is a rocblas internal variable, and does not need to be defined in user code

rocBLAS 2.32.0 for ROCm 3.10.0

Additions

  • Improved performance of gemm_batched for NN, general m, n, k, small m, n, k

rocBLAS 2.30.0 for ROCm 3.9.0

Additions

  • Slight improvements to FP16 Megatron BERT performance on MI50
  • Improvements to FP16 Transformer performance on MI50
  • Slight improvements to FP32 Transformer performance on MI50
  • Improvements to FP32 DLRM Terabyte performance on gfx908

rocBLAS 2.28.0 for ROCm 3.8.0

Additions

  • added two functions:

    • rocblas_status rocblas_set_atomics_mode(rocblas_atomics_mode mode)
    • rocblas_status rocblas_get_atomics_mode(rocblas_atomics_mode mode)
  • added enum rocblas_atomics_mode. It can have two values

    • rocblas_atomics_allowed
    • rocblas_atomics_not_allowed The default is rocblas_atomics_not_allowed
  • function rocblas_Xdgmm algorithm corrected and incx=0 support added

  • dependencies:

    • rocblas-tensile internal component requires msgpack instead of LLVM
  • Moved the following files from /opt/rocm/include to /opt/rocm/include/internal:

    • rocblas-auxillary.h
    • rocblas-complex-types.h
    • rocblas-functions.h
    • rocblas-types.h
    • rocblas-version.h
    • rocblas_bfloat16.h

    These files should NOT be included directly as this may lead to errors. Instead, /opt/rocm/include/rocblas.h should be included directly. /opt/rocm/include/rocblas_module.f90 can also be directly used

rocBLAS 2.26.0 for ROCm 3.7.0

Additions

  • Improvements to rocblas_Xgemm_batched performance for small m, n, k
  • Improvements to rocblas_Xgemv_batched and rocblas_Xgemv_strided_batched performance for small m (QMCPACK use)
  • Improvements to rocblas_Xdot (batched and non-batched) performance when both incx and incy are 1
  • Improvements to FP32 ONNX BERT performance for MI50
  • Significant improvements to FP32 Resnext, Inception Convolution performance for gfx908
  • Slight improvements to FP32 DLRM Terabyte performance for gfx908
  • Significant improvements to FP32 BDAS performance for gfx908
  • Significant improvements to FP32 BDAS performance for MI50 and MI60
  • Added substitution method for small trsm sizes with m <= 64 && n <= 64. Increases performance drastically for small batched trsm

rocBLAS 2.24.0 for ROCm 3.6.0

Additions

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions (improvements on bf16, f16, f32 data types)
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

rocBLAS 2.22.0 for ROCm 3.5.0

Additions

  • add geam complex, geam_batched, and geam_strided_batched
  • add dgmm, dgmm_batched, and dgmm_strided_batched
  • Optimized performance
    • ger
      • rocblas_sger, rocblas_dger
      • rocblas_sger_batched, rocblas_dger_batched
      • rocblas_sger_strided_batched, rocblas_dger_strided_batched
    • geru
      • rocblas_cgeru, rocblas_zgeru
      • rocblas_cgeru_batched, rocblas_zgeru_batched
      • rocblas_cgeru_strided_batched, rocblas_zgeru_strided_batched
    • gerc
      • rocblas_cgerc, rocblas_zgerc
      • rocblas_cgerc_batched, rocblas_zgerc_batched
      • rocblas_cgerc_strided_batched, rocblas_zgerc_strided_batched
    • symv
      • rocblas_ssymv, rocblas_dsymv, rocblas_csymv, rocblas_zsymv
      • rocblas_ssymv_batched, rocblas_dsymv_batched, rocblas_csymv_batched, rocblas_zsymv_batched
      • rocblas_ssymv_strided_batched, rocblas_dsymv_strided_batched, rocblas_csymv_strided_batched, rocblas_zsymv_strided_batched
    • sbmv
      • rocblas_ssbmv, rocblas_dsbmv
      • rocblas_ssbmv_batched, rocblas_dsbmv_batched
      • rocblas_ssbmv_strided_batched, rocblas_dsbmv_strided_batched
    • spmv
      • rocblas_sspmv, rocblas_dspmv
      • rocblas_sspmv_batched, rocblas_dspmv_batched
      • rocblas_sspmv_strided_batched, rocblas_dspmv_strided_batched
  • improved documentation.
  • Fix argument checking in functions to match legacy BLAS.
  • Fixed conjugate-transpose version of geam.

Known Issues

  • Compilation for GPU Targets: When using the install.sh script for "all" GPU Targets, which is the default, you must first set an environment variable HCC_AMDGPU_TARGET listing the GPU targets, e.g. HCC_AMDGPU_TARGET=gfx803,gfx900,gfx906,gfx908 If building for a specific architecture(s) using the -a | --architecture flag, you should also set the environment variable HCC_AMDGPU_TARGET to match. Mismatching the environment variable to the -a flag architectures creates builds that may result in SEGFAULTS when running on GPUs which weren't specified.