Draft: Templated MD arrays #39

mauneyc-LANL · 2023-10-02T21:10:01Z

PR Summary

This is a prototype for "templated MD", that is the reduction of code of the type

PORTABLE_FUNCTION PortableMDArray(T *data, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(1), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {}
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {
  }
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx3, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(nx3), nx4_(1), nx5_(1),
        nx6_(1) {}

To

  template <typename... NXs>
  PORTABLE_FUNCTION PortableMDArray(T *p, NXs... nxs) noexcept
      : PortableMDArray(p, nx_arr(nxs...), sizeof...(NXs)) {}

The single-value dimension variables int nxN_ have been replaced with a staticly sized container std::array<size_t, MAXDIM> nxs_. Also added PortableMDArray::rank_ member to avoid some reevaluations, though I'm not sure it's necessary.

Currently WIP, so documentation is wanting and a few function names/namespaces are silly/inconsistent/gibberish. I know there can be some reluctance about this coding style, so I wanted to get a "at least compiles" version up and gauge interest in iterating on it. (Actually passes a few tests! but haven't tried spiner with this yet).

It was working fine, why change it?

Update code to more modern style
Code is easier to maintain and diagnose
Allows for forward planning when C++ standards are updated
No change in down-facing API
For developers, the code is easier to read/glance

What are the downsides?

Some extra boilerplate to handle C++14's constrains on variadic programming (e.g. no std::apply(), no std::tuple, no fold expressions). C++14 can be "recursion heavy" in this regard, which may degrade performance.
API not "explicit" w.r.t. index ordering
(possibly) Longer compile times
For downstream users, code or errors passing through/originating in expanded template types can be agitating

Any suggestions, comments, questions welcome! @jonahm-LANL @chadmeyer @dholladay00 @jhp-lanl @jdolence

PR Checklist

Any changes to code are appropriately documented.
Code is formatted.
Install test passes.
Docs build.
If preparing for a new release, update the version in cmake.

dholladay00 · 2023-10-03T16:13:53Z

ports-of-call/portable_arrays.hpp

+// set_value end-case
+template <std::size_t Ind, typename NX>
+constexpr void set_value(narr &ndat, NX value) {
+  ndat[Ind] = value;


is operator[] of std::array marked constexpr? I thought this was something that preventing more widespread usage.

Sorry I intended to replace all these with appropriate PORTABLE_ s but missed a few

Oh, you probably meant something else. We can use a raw array, if std::array is messing with things

We should check that it runs on device, but I thought std::array worked with --expt-relaxed-constexpr but I'm not sure

@mauneyc-LANL you may be able to use std::get<Ind>(arr) = value;. But we really need to check and make sure it works on device.

Using --expt-relaxed-constexpr this runs smoothly using volta-x86 partition, kokkos+cuda cuda_arch=70 +wrapper, cmake .. -DPORTS_OF_CALL_BUILD_TESTING -DPORTABILITY_STRATEGY_KOKKOS=ON

ports-of-call/portable_arrays.hpp

dholladay00 · 2023-10-03T16:34:06Z

ports-of-call/portable_arrays.hpp

  PORTABLE_FORCEINLINE_FUNCTION int GetSize() const {
-    return nx1_ * nx2_ * nx3_ * nx4_ * nx5_ * nx6_;
+    return std::accumulate(nxs_.cbegin(), nxs_.cend(), 1,


Have you tested this on GPUs? I know Brendan had an issue and we had to write a custom accumulate. One could still use binary operators provided by the STL, but the accumulate itself wouldn't run. Perhaps we did something wrong, but I want to make sure this is still GPU callable.

I would also be concerned about std::accumulate. Might be better to hardcode this one, or write a a recursive thing ourselves.

It "worked" but gave warnings (even with --expt-relaxed-constexpr). I rewrote it as an explicit loop.

Yurlungur

This is a nice cleanup. Before merging, I'd like to:
a) Get a sense of compile time differences
b) Know for sure it all works on device

Yurlungur · 2023-10-03T18:08:37Z

ports-of-call/portable_arrays.hpp

+// maximum number of dimensions
+constexpr std::size_t MAXDIM = 6;
+// array type of dimensions/strides
+using narr = std::array<std::size_t, MAXDIM>;


don't love that this is lower-case. I'd find this type easier to interpret if it were, e.g.,

Suggested change

using narr = std::array<std::size_t, MAXDIM>;

using Narr_t = std::array<std::size_t, MAXDIM>;

Yurlungur · 2023-10-03T18:09:29Z

ports-of-call/portable_arrays.hpp

+// compute_index base case, i.e. fastest moving index
+template <std::size_t Ind>
+PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd,
+                                              const size_t index) {
+  return index;
+}
+
+// compute_index general case, computing slower moving index strides
+template <std::size_t Ind, typename... Tail>
+PORTABLE_INLINE_FUNCTION size_t compute_index(const narr &nd,
+                                              const size_t index,
+                                              const Tail... tail) {
+  return index * nd[Ind] + compute_index<Ind + 1>(nd, tail...);
+}


This is nice---assuming the array on device issues work.

Tests complete on volta-x86, though we need to put together some more tests for ports-of-call

Yurlungur · 2023-10-03T18:09:54Z

ports-of-call/portable_arrays.hpp

+// array type of dimensions/strides
+using narr = std::array<std::size_t, MAXDIM>;
+
+namespace detail {


It doesn't matter very much, by why detail over impl?

dholladay00 · 2023-10-04T16:37:25Z

CMakeLists.txt

+
+target_compile_options(${POCLIB}
+INTERFACE
+$<${with_cxx}:$<${ps_kokkos}:--expt-relaxed-constexpr>>)


In the event of +kokkos~cuda, wouldn't this result in a compile error?

Looks like it, but POC doesn't have a USE_CUDA or similar like singularity-eos does, so I need to probe which compiler is being used.

This is how I've done it in other projects:

get_target_property(kokkos_link_libs Kokkos::kokkoscore INTERFACE_LINK_LIBRARIES) string(REGEX MATCH "CUDA" kokkos_has_cuda_libs "${kokkos_link_libs}") ... if(kokkos_has_cuda_libs) do cuda stuff... endif()

mauneyc-LANL · 2023-10-05T04:31:39Z

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.

// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};

The options are:

at each compute_index call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to rank, rather than compute/store values up to MAXDIM
(current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that PortableMDArray needs to move from between contexts. Contra to (1), this array needs MAXDIM storage which is likely going to be mostly dead weight.

Yurlungur · 2023-10-05T20:31:22Z

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.
// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};
The options are:
1. at each `compute_index` call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to `rank`, rather than compute/store values up to `MAXDIM`

2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that `PortableMDArray` needs to move from between contexts. Contra to (1), this array needs `MAXDIM` storage which is likely going to be mostly dead weight.

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

mauneyc-LANL · 2023-10-05T23:08:32Z

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

Would be work considering the context that PortableMDArray is used in. Does it reshape often, does it get move on/off device frequently, ect. (2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

I've included a simple test with the PR. As originally there wasn't very much being tested directly with poc we may need to add some more for checking these details. I glanced over spiner tests for ideas but most of what is there looks more numerics based.

Yurlungur · 2023-10-06T21:57:12Z

Yeah fair questions:

Does it reshape often?

No, almost never.

does it get move on/off device frequently

Also no.

(2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

Like you said, 48 bytes is basically nothing, so I lean towards (2).

dholladay00 · 2023-10-09T18:25:10Z

Previously we recalculated at every access it seems. It seems like the number of adds will be the same and the number of multiplies saved will only be noticeable with high rank arrays. Performance data for ranks 2-6 would helpful in this case.

dholladay00 · 2023-10-30T18:04:54Z

The more I think about this the more I think we should recompute the strides. Even so, I'd like to see performance differences. Integer operations can be noticeable on GPUs so pre-computed may be more performant, but a larger object may use more registers. There are a lot of competing factors so data is the best.

dholladay00 · 2024-08-06T22:08:08Z

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

Yurlungur · 2024-08-06T22:09:57Z

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

I am in favor of moving to C++17 at this time. However, I would prefer to decouple the shift to 17 from this MR. Let's just update the cmake builds to require 17 and then merge other MRs when they are ready.

cmauney · 2024-08-14T17:56:46Z

@jonahm-LANL @dholladay00 I've pushed some benchmarks using catch2, right now just doing index computation through contiguous memory - (i, j, k) to n. I haven't tested this on device yet but I would be interested to see the results. I intend to put in the same benchmarks on random (rather than contiguous) memory.

I can take off Draft and move forward with review if you want to get this in. There are things I'd like to modify and add but there's a danger of code-creep and I don't want to hold up, the PR 'as-is' should be ready to go.

Yurlungur

I like this---it's a significant cleanup from the original implementation, which was quite low-level. I want @dholladay00 's build system concerns addressed before merge. Also, just to confirm, this doesn't change the API forfunctionality at all right? It's just more general.

Yurlungur · 2024-08-14T20:17:02Z

ports-of-call/portable_arrays.hpp

+  return r;
+}
+PORTABLE_FORCEINLINE_FUNCTION
+decltype(auto) vp_prod() {


is the decltype needed? Why not just an auto return value?

eh, I'm returning a generic lambda and that's my habbit. But it's not necessary here (tho I don't think it's harmful either)

I don't think the decltype adds anything and it's a bit harder to read but I won't push back too much on this.

Yurlungur · 2024-08-14T20:18:45Z

ports-of-call/portable_arrays.hpp

+  PORTABLE_FORCEINLINE_FUNCTION auto make_nxs_array(NX... nxs) {
+    std::array<std::size_t, MAXDIM> a;
+    std::array<std::size_t, N> t{static_cast<std::size_t>(nxs)...};
+    for (auto i = 0; i < N; ++i) {


I'm gonna insist this be int or size_t. IMO this is not a style thing but a self-documenting thing. It's an index it's not a double.

cmauney · 2024-08-15T14:35:41Z

If C++17 gets in before this is merged, then I do want to do another draft - it will make things cleaner and clearer.

Yurlungur · 2024-09-03T18:59:03Z

@mauneyc-LANL let us know when this is ready for re-review

Christopher Mauney added 2 commits October 2, 2023 14:45

initial commit

f29eaf7

spelling + comments

7a8e259

mauneyc-LANL self-assigned this Oct 3, 2023

mauneyc-LANL requested review from dholladay00, chadmeyer, jonahm-LANL and jhp-lanl October 3, 2023 15:51

dholladay00 reviewed Oct 3, 2023

View reviewed changes

ports-of-call/portable_arrays.hpp Outdated Show resolved Hide resolved

added PORTABLE_ function preambles to missing ones

4c13c2c

dholladay00 reviewed Oct 3, 2023

View reviewed changes

Yurlungur reviewed Oct 3, 2023

View reviewed changes

mauneyc-LANL added 2 commits October 3, 2023 18:38

getting build/exec on GPUs

4742b0b

fix indexing

4bf821e

dholladay00 reviewed Oct 4, 2023

View reviewed changes

some design changes

4863827

mauneyc-LANL mentioned this pull request Oct 16, 2023

EOSBuilder rewrite lanl/singularity-eos#311

Merged

6 tasks

Christopher Mauney added 2 commits August 14, 2024 11:47

tests and benchmarks

1492a98

Merge branch 'main' into mauneyc/templated_md

36e848c

Yurlungur reviewed Aug 14, 2024

View reviewed changes

mauneyc-LANL mentioned this pull request Aug 15, 2024

enable C++17 lanl/spiner#94

Merged

3 tasks

Christopher Mauney added 7 commits August 29, 2024 10:09

store

551910b

merge

9ab0b9b

Merge branch 'main' into mauneyc/templated_md

352077a

array and index algs

0f66ba1

comment cleanup and copyright

ece0a3c

short comments in algos

6668587

removed unneeded parameter

825301a

Christopher Mauney and others added 6 commits September 4, 2024 13:50

some sloppy juggling of types

703397a

working on stuff

195fecf

add span

f2431e1

merge main

e64866d

undo span changes

3c769f2

removed benchmark testing; fix some tests for gpu

9733f80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Templated MD arrays #39

Draft: Templated MD arrays #39

mauneyc-LANL commented Oct 2, 2023 •

edited

Loading

dholladay00 Oct 3, 2023

mauneyc-LANL Oct 3, 2023

mauneyc-LANL Oct 3, 2023

Yurlungur Oct 3, 2023

dholladay00 Oct 3, 2023

mauneyc-LANL Oct 4, 2023

dholladay00 Oct 3, 2023

Yurlungur Oct 3, 2023

mauneyc-LANL Oct 4, 2023

Yurlungur left a comment

Yurlungur Oct 3, 2023

Yurlungur Oct 3, 2023

mauneyc-LANL Oct 4, 2023

Yurlungur Oct 3, 2023

mauneyc-LANL Oct 4, 2023

dholladay00 Oct 4, 2023

mauneyc-LANL Oct 5, 2023

dholladay00 Oct 5, 2023

mauneyc-LANL commented Oct 5, 2023

Yurlungur commented Oct 5, 2023

mauneyc-LANL commented Oct 5, 2023

Yurlungur commented Oct 6, 2023

dholladay00 commented Oct 9, 2023

dholladay00 commented Oct 30, 2023

dholladay00 commented Aug 6, 2024

Yurlungur commented Aug 6, 2024

cmauney commented Aug 14, 2024

Yurlungur left a comment

Yurlungur Aug 14, 2024

mauneyc-LANL Aug 15, 2024

Yurlungur Aug 15, 2024

Yurlungur Aug 14, 2024

cmauney commented Aug 15, 2024

Yurlungur commented Sep 3, 2024

	using narr = std::array<std::size_t, MAXDIM>;
	using Narr_t = std::array<std::size_t, MAXDIM>;

Draft: Templated MD arrays #39

Are you sure you want to change the base?

Draft: Templated MD arrays #39

Conversation

mauneyc-LANL commented Oct 2, 2023 • edited Loading

PR Summary

It was working fine, why change it?

What are the downsides?

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauneyc-LANL commented Oct 5, 2023

Yurlungur commented Oct 5, 2023

mauneyc-LANL commented Oct 5, 2023

Yurlungur commented Oct 6, 2023

dholladay00 commented Oct 9, 2023

dholladay00 commented Oct 30, 2023

dholladay00 commented Aug 6, 2024

Yurlungur commented Aug 6, 2024

cmauney commented Aug 14, 2024

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmauney commented Aug 15, 2024

Yurlungur commented Sep 3, 2024

mauneyc-LANL commented Oct 2, 2023 •

edited

Loading