-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve OpenMP usage #2
base: master
Are you sure you want to change the base?
Conversation
1) reduced the number of fork-join per iteration 'omp parallel for' does a fork-join, which can get expensive at large thread-counts. when this construct is used many times in a function, it should be replaced with a single 'omp parallel' around multiple 'omp for'. the code previously found between parallel regions is assumed to require serialization and uses 'pragma omp single' for protectin. 'single' is used instead of 'master' to allow the first encountering thread in the team to do the work, rather than waiting for the master thread. technically, but never in practice, 'single' requires MPI_THREAD_SERIALZIED instead of MPI_THREAD_FUNNELED. 'master' only requires MPI_THREAD_FUNNELED. it is possible that 'single nowait' is sufficient, in which case a few barriers can be eliminated. (aside: 'master' does not imply a barrier). 2) pragma omp simd wherever pragma ivdep is used the OpenMP standard defines 'pragma omp simd' semantics identical to the convention meaning of the non-standard 'pragma ivdep'. Intel compiler treats 'pragma omp simd' as an assertion rather than a hint so if SIMD isn't appropriate, this pragma should be conditionalized using preprocessor (C99/C++11 _Pragma being the O(1) solution here).
I don't know what sort of QA is required to be sure the changes are correct. I'll be happy to run whatever you require. |
Jeff,
Have you run Intel Thread Inspector on the code to be sure there are no race conditions? Pushing the parallel region up higher makes it more likely that a race condition occurs.
From: Jeff Hammond <[email protected]>
Reply-To: lanl/PENNANT <[email protected]>
Date: Tuesday, February 7, 2017 at 1:01 PM
To: lanl/PENNANT <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [lanl/PENNANT] improved OpenMP usage (#2)
I don't know what sort of QA is required to be sure the changes are correct. I'll be happy to run whatever you require.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#2 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABPgLApGo_en5Moy9qLztiRq9lItmY4tks5raM2PgaJpZM4L59qE>.
|
No, I haven't used that tool before. The side-by-side changes were simple enough that I felt I could reason about the race conditions from the OpenMP semantics. I'll figure out Intel Thread Checker and try that. I hope it is not making any assumptions about x86 consistency in its analysis... |
Jeff - For QA I typically run the five small test problems (nohsmall, noh, sedovsmall, sedov, leblanc) and verify that the outputs match the gold standard to within roundoff error. If you could do that, that would be great; or I can do it next week (I'm offsite this week and am not set up to run remotely). |
You can run a gui version or the command line version of the Intel thread inspector. From our test script:
inspxe-cl -collect=ti2 -result-dir ${TEST_NAME} -- ${executable} ${TEST_NAME}.in >& ${TEST_NAME}_openmp.out || true
inspxe-cl -report problems -result-dir ${TEST_NAME}
The gui is launched with inspxe-gui and is a little more intuitive to use.
You need to use the intel compiler and the intel tools (intel-performance-tools/2017.1.024).
It will flag race conditions that might not appear until later.
From: Charles Ferenbaugh <[email protected]>
Reply-To: lanl/PENNANT <[email protected]>
Date: Wednesday, February 8, 2017 at 9:13 PM
To: lanl/PENNANT <[email protected]>
Cc: Robert Robey <[email protected]>, Comment <[email protected]>
Subject: Re: [lanl/PENNANT] improve OpenMP usage (#2)
Jeff - For QA I typically run the five small test problems (nohsmall, noh, sedovsmall, sedov, leblanc) and verify that the outputs match the gold standard to within roundoff error. If you could do that, that would be great; or I can do it next week (I'm offsite this week and am not set up to run remotely).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#2 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABPgLDX091BKHe2QmZdk4aL3pIN7G3m0ks5rapJ1gaJpZM4L59qE>.
|
omp parallel for
does a fork-join, which can get expensive at large thread-counts. when this construct is used many times in a function, it should be replaced with a singleomp parallel
around multipleomp for
.The code previously found between parallel regions is assumed to require serialization and uses
pragma omp single
for protection.single
is used instead ofmaster
to allow the first encountering thread in the team to do the work, rather than waiting for the master thread.Technically, but never in practice,
single
requiresMPI_THREAD_SERIALIZED
instead ofMPI_THREAD_FUNNELED
.master
only requiresMPI_THREAD_FUNNELED
.It is possible that
single nowait
is sufficient, in which case a few barriers can be eliminated. (aside:master
does not imply a barrier).pragma omp simd
whereverpragma ivdep
is usedThe OpenMP standard defines
pragma omp simd
semantics identical to the convention meaning of the non-standardpragma ivdep
.The Intel compiler treats
pragma omp simd
as an assertion rather than a hint so if SIMD isn't appropriate, this pragma should be conditionalized using preprocessor (C99/C++11_Pragma
being the O(1) solution here).