Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic device and initial support in portBLAS #566

Merged
merged 10 commits into from
Sep 30, 2024

Conversation

s-Nick
Copy link
Contributor

@s-Nick s-Nick commented Sep 5, 2024

Description

Checklist

All Submissions

  • Do all unit tests pass locally? Attach a log.
    • I don't have a non Intel/NVIDIA/AMD device to test with. Tests are successful when forcing Intel devices to go through generic_device path.

Following logs show that all domains still work properly:
arc_mkl_BLAS_log.txt
arc_mkl_DFT_log.txt
arc_mkl_lapack_log.txt
arc_mkl_rng_log.txt

  • Have you formatted the code using clang-format?

New features

  • Have you provided motivation for adding a new feature?
  • Have you added relevant tests?
    The generic device doesn't requires new tests.

hjabird and others added 6 commits August 30, 2024 14:45
* oneMKL Interfaces currently only supports known targets: Intel CPU/GPU, AMD GPU, Nvidia GPU
* This PR:
  * Enables a new generic target
  * Enables the generic target to use the portBLAS backend
  * Adds documentation
possible devices.

This commit remove the option ENABLE_GENERIC_DEVICE and instead add generic_device
to the backends_table. The check for unsupported_device exception is moved to
table_initializer and to keep it as informative as it is, it is required a
change to the function_tables operator[]. This change allows to use
portBLAS (and in a possible future) all "port" libraries with  any
device supported.
This patch enables the possibility to run tests with generic_device for
devices that have an OpenCL backend.
@s-Nick s-Nick requested a review from Rbiessy September 5, 2024 09:00
Moved pragma and simplified if-statement to increase code readability
Copy link
Contributor

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me, thanks!

@s-Nick s-Nick marked this pull request as ready for review September 6, 2024 12:35
@s-Nick
Copy link
Contributor Author

s-Nick commented Sep 6, 2024

@oneapi-src/onemkl-maintain Hi all, in this PR I am adding a new generic device to allow portable libraries such as portBLAS and portFFT to be compiled and run on devices that are not known to us right now, but that have an OpenCL backend available.

Changes are few and simple, but since I am making an important one to table_initializer and it touches all domains, I would like to have your approval on it. These changes don't cause any issue to current backends as you can check from logs attached.

@Rbiessy Rbiessy requested a review from a team September 6, 2024 12:55
Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful change, I have just a couple questions.

"OpenCL") != std::string::npos)
unsigned int vendor_id =
static_cast<unsigned int>(dev.get_info<sycl::info::device::vendor_id>());
/* Do not test for OpenCL backend on Intel GPU */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this logic continue to do what we want on AMD and nVidia GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review @andrewtbarker
As far as I know NVIDIA and AMD GPUs don't use OpenCL backends, so I expect they continue to work as intended. In any case, I am happy to add a condition to check if the vendor is one of the currently known one and skip the OpenCL backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this logic continue to do what we want on AMD and nVidia GPUs?

What is the behavior that we want on AMD and Nvidia exactly? We've been wondering whether we should remove this check entirely since it's not clear why would we want to skip OpenCL backends. To me users should use ONEAPI_DEVICE_SELECTOR instead.

@@ -59,14 +59,20 @@ class table_initializer {
using dlhandle = std::unique_ptr<LIB_TYPE, handle_deleter>;

public:
function_table_t &operator[](oneapi::mkl::device key) {
auto lib = tables.find(key);
function_table_t &operator[](std::pair<oneapi::mkl::device, sycl::queue> device_queue_pair) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why this change is necessary. Where do we need the sycl::queue in the lookup tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I moved the unsupported_device exception in the table initializer, I need the queue to query the device name. This is the only reason why I need the sycl::queue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. It seems like a large change for just the exception - I would support reverting the function table and throwing a more generic exception (or maybe implement mkl::unsupported_device with no device argument?). But it is also okay as is.

Previously if some specific backend where enabled the test suite always added a
cpu to the device to run test on, even if another if condition already
added them. This behaviour cause linking time issue if a cpu device is
not available.
This commit removes it and it adds missing pragma to
the other device selection, fixing the linking issue.
@al3x-jp
Copy link

al3x-jp commented Sep 16, 2024

Hi, I got the following error when trying to build this checkout:

[ 51%] Linking CXX executable ../../bin/test_main_blas_ct
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): No device of requested type 'info::device_type::cpu' available. -1 (PI_ERROR_DEVICE_NOT_FOUND)
CMake Error at /usr/share/cmake-3.22/Modules/GoogleTestAddTests.cmake:83 (message):
Error running test executable.

Path: '/<path>/oneMKL/build/bin/test_main_blas_ct'
Result: Subprocess aborted
Output:

Call Stack (most recent call first):
/usr/share/cmake-3.22/Modules/GoogleTestAddTests.cmake:179 (gtest_discover_tests_impl)

gmake[2]: *** [tests/unit_tests/CMakeFiles/test_main_blas_ct.dir/build.make:388: bin/test_main_blas_ct] Error 1

Does the build require a SYCL CPU target being available?
Thanks

@s-Nick
Copy link
Contributor Author

s-Nick commented Sep 16, 2024

Hi @al3x-jp, thank you for highlighting this issue here.
We were able to reproduce the same linking error using the develop branch and an open source DPC++ build.
In any case, commit 29e94fc addresses what we think caused it and on our machine and everything works fine. Could you please test again this PR including this last commit? Your feedback would be very helpful for us.

Copy link
Contributor

@lhuot lhuot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@s-Nick s-Nick merged commit afb9d5c into uxlfoundation:develop Sep 30, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants