Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add a conditional definition of AccThreadSeq for the Omp4 backend #157

Closed
wants to merge 1 commit into from

Conversation

sbastrakov
Copy link
Member

This is now consistent with the Omp2Blocks backend, fixes #156.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 19, 2020

This doesn't harm, but it doesn't help my test case either.

Before:

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 123.52 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 42.24 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 52493 us

After:

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128.16 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 42.3 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 52445.2 us

While using alpaka directly:

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 132.66 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 41.14 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 391.15 us

@sbastrakov
Copy link
Member Author

Thanks for testing @fwyzard .
@psychocoderHPC do you have an idea? I feel the change fixes an actual shortcoming, but perhaps this is just another thing unrelated to the issue.

@sbastrakov
Copy link
Member Author

Ah, the IsThreadSeqAcc trait was also not specialized for Omp4, will add it to the PR.

…_KERNEL_OPTI

This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156
@sbastrakov
Copy link
Member Author

Updated the PR, maybe now it works.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 19, 2020

It does:

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 137.34 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 42.27 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 354.39 us

Thank you.

@psychocoderHPC
Copy link
Member

I will check this PR soon. I need to check if swapping is allowed for the OMP4 backend. Swapping block size and elements is more than less a workaround to be allowed to use alpaka backends which restrict the block to one thread. The OMP4 backend do not have this restriction.
Currently the swapping is only providing compatibility and not aims to increase the performance.
In general there should be no argument to use this feature to deliver better performance.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 19, 2020

I see... then I guess there is an underlying problem in the OpenMP 4 backend in alpaka, that leads to very poor performance.

@sbastrakov sbastrakov changed the title Add a conditional definition of AccThreadSeq for the Omp4 backend [WIP] Add a conditional definition of AccThreadSeq for the Omp4 backend Feb 20, 2020
@sbastrakov
Copy link
Member Author

sbastrakov commented Feb 20, 2020

I will close the PR for now as, although it provided a workaround for the issue initiating it, there are other ways to achieve it and the PR does not align well with the behavior for other accelerators. The reasoning is given in the discussion of #156.

@sbastrakov sbastrakov closed this Feb 20, 2020
@sbastrakov sbastrakov removed the bug label Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

threads vs elements when using the OpenMP 4.0 backend
3 participants