-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Templated MD arrays #39
base: main
Are you sure you want to change the base?
Conversation
ports-of-call/portable_arrays.hpp
Outdated
// set_value end-case | ||
template <std::size_t Ind, typename NX> | ||
constexpr void set_value(narr &ndat, NX value) { | ||
ndat[Ind] = value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is operator[]
of std::array
marked constexpr
? I thought this was something that preventing more widespread usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I intended to replace all these with appropriate PORTABLE_
s but missed a few
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you probably meant something else. We can use a raw array, if std::array
is messing with things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check that it runs on device, but I thought std::array
worked with --expt-relaxed-constexpr
but I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mauneyc-LANL you may be able to use std::get<Ind>(arr) = value;
. But we really need to check and make sure it works on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using --expt-relaxed-constexpr
this runs smoothly using volta-x86
partition, kokkos+cuda cuda_arch=70 +wrapper
, cmake .. -DPORTS_OF_CALL_BUILD_TESTING -DPORTABILITY_STRATEGY_KOKKOS=ON
ports-of-call/portable_arrays.hpp
Outdated
PORTABLE_FORCEINLINE_FUNCTION int GetSize() const { | ||
return nx1_ * nx2_ * nx3_ * nx4_ * nx5_ * nx6_; | ||
return std::accumulate(nxs_.cbegin(), nxs_.cend(), 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this on GPUs? I know Brendan had an issue and we had to write a custom accumulate. One could still use binary operators provided by the STL, but the accumulate itself wouldn't run. Perhaps we did something wrong, but I want to make sure this is still GPU callable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be concerned about std::accumulate
. Might be better to hardcode this one, or write a a recursive thing ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It "worked" but gave warnings (even with --expt-relaxed-constexpr
). I rewrote it as an explicit loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice cleanup. Before merging, I'd like to:
a) Get a sense of compile time differences
b) Know for sure it all works on device
ports-of-call/portable_arrays.hpp
Outdated
// maximum number of dimensions | ||
constexpr std::size_t MAXDIM = 6; | ||
// array type of dimensions/strides | ||
using narr = std::array<std::size_t, MAXDIM>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't love that this is lower-case. I'd find this type easier to interpret if it were, e.g.,
using narr = std::array<std::size_t, MAXDIM>; | |
using Narr_t = std::array<std::size_t, MAXDIM>; |
ports-of-call/portable_arrays.hpp
Outdated
// compute_index base case, i.e. fastest moving index | ||
template <std::size_t Ind> | ||
PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd, | ||
const size_t index) { | ||
return index; | ||
} | ||
|
||
// compute_index general case, computing slower moving index strides | ||
template <std::size_t Ind, typename... Tail> | ||
PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd, | ||
const size_t index, | ||
const Tail... tail) { | ||
return index * nd[Ind] + compute_index<Ind + 1>(nd, tail...); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice---assuming the array on device issues work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests complete on volta-x86
, though we need to put together some more tests for ports-of-call
// array type of dimensions/strides | ||
using narr = std::array<std::size_t, MAXDIM>; | ||
|
||
namespace detail { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter very much, by why detail
over impl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just habit
|
||
target_compile_options(${POCLIB} | ||
INTERFACE | ||
$<${with_cxx}:$<${ps_kokkos}:--expt-relaxed-constexpr>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event of +kokkos~cuda, wouldn't this result in a compile error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it, but POC doesn't have a USE_CUDA
or similar like singularity-eos
does, so I need to probe which compiler is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I've done it in other projects:
get_target_property(kokkos_link_libs Kokkos::kokkoscore
INTERFACE_LINK_LIBRARIES)
string(REGEX MATCH "CUDA"
kokkos_has_cuda_libs "${kokkos_link_libs}")
...
if(kokkos_has_cuda_libs)
do cuda stuff...
endif()
I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code). One question to ask is how much time should
The options are:
|
I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case? |
Would be work considering the context that I've included a simple test with the PR. As originally there wasn't very much being tested directly with |
Yeah fair questions:
No, almost never.
Also no.
Like you said, 48 bytes is basically nothing, so I lean towards (2). |
Previously we recalculated at every access it seems. It seems like the number of adds will be the same and the number of multiplies saved will only be noticeable with high rank arrays. Performance data for ranks 2-6 would helpful in this case. |
The more I think about this the more I think we should recompute the strides. Even so, I'd like to see performance differences. Integer operations can be noticeable on GPUs so pre-computed may be more performant, but a larger object may use more registers. There are a lot of competing factors so data is the best. |
now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release. |
I am in favor of moving to C++17 at this time. However, I would prefer to decouple the shift to 17 from this MR. Let's just update the cmake builds to require 17 and then merge other MRs when they are ready. |
@jonahm-LANL @dholladay00 I've pushed some benchmarks using I can take off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this---it's a significant cleanup from the original implementation, which was quite low-level. I want @dholladay00 's build system concerns addressed before merge. Also, just to confirm, this doesn't change the API forfunctionality at all right? It's just more general.
ports-of-call/portable_arrays.hpp
Outdated
return r; | ||
} | ||
PORTABLE_FORCEINLINE_FUNCTION | ||
decltype(auto) vp_prod() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the decltype
needed? Why not just an auto
return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, I'm returning a generic lambda and that's my habbit. But it's not necessary here (tho I don't think it's harmful either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the decltype
adds anything and it's a bit harder to read but I won't push back too much on this.
ports-of-call/portable_arrays.hpp
Outdated
PORTABLE_FORCEINLINE_FUNCTION auto make_nxs_array(NX... nxs) { | ||
std::array<std::size_t, MAXDIM> a; | ||
std::array<std::size_t, N> t{static_cast<std::size_t>(nxs)...}; | ||
for (auto i = 0; i < N; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna insist this be int
or size_t
. IMO this is not a style thing but a self-documenting thing. It's an index it's not a double.
If C++17 gets in before this is merged, then I do want to do another draft - it will make things cleaner and clearer. |
@mauneyc-LANL let us know when this is ready for re-review |
PR Summary
This is a prototype for "templated MD", that is the reduction of code of the type
To
The single-value dimension variables
int nxN_
have been replaced with a staticly sized containerstd::array<size_t, MAXDIM> nxs_
. Also addedPortableMDArray::rank_
member to avoid some reevaluations, though I'm not sure it's necessary.Currently WIP, so documentation is wanting and a few function names/namespaces are silly/inconsistent/gibberish. I know there can be some reluctance about this coding style, so I wanted to get a "at least compiles" version up and gauge interest in iterating on it. (Actually passes a few tests! but haven't tried
spiner
with this yet).It was working fine, why change it?
What are the downsides?
std::apply()
, nostd::tuple
, no fold expressions). C++14 can be "recursion heavy" in this regard, which may degrade performance.Any suggestions, comments, questions welcome! @jonahm-LANL @chadmeyer @dholladay00 @jhp-lanl @jdolence
PR Checklist