Skip to content

Examples dump: Gitter

Jim Pivarski edited this page Jul 6, 2022 · 2 revisions

This is a dump of Gitter's Awkward Array channel (https://gitter.im/Scikit-HEP/awkward-array) from the last year, 2021-07-06 through 2022-07-06, for the sake of mining for tutorial examples. Gitter channels are all public.

Edited to only include examples.


Date: 2021-07-06 20:28:05 From: Jim Pivarski (@jpivarski)

Using my example from GitHub:

>>> onetwo = ak.Array(ak.partition.IrregularlyPartitionedArray([one.layout, two.layout]))
>>> onetwo
<Array [1.1, 2.2, 3.3, 4.4, ... 3.3, 4.4, 5.5] type='10 * ?float64'>
>>> pickle.dumps(onetwo)
ValueError: the Form of partition 1:

    {
    "class": "UnmaskedArray",
    "content": {
        "class": "NumpyArray",
        "itemsize": 8,
        "format": "d",
        "primitive": "float64",
        "form_key": "node1"
    },
    "form_key": "node0"
}

differs from the first Form:

    {
    "class": "BitMaskedArray",
    "mask": "u8",
    "content": {
        "class": "NumpyArray",
        "itemsize": 8,
        "format": "d",
        "primitive": "float64",
        "form_key": "node1"
    },
    "valid_when": false,
    "lsb_order": true,
    "form_key": "node0"
}

Date: 2021-07-08 09:05:44 From: Angus Hollands (@agoose77:matrix.org)

@jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is erased.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure. I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

Date: 2021-07-08 09:17:16 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is erased.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure. I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

After taking another look,completely_flatten is actually probably doing what it should be - flattening the contents. It doesn't propagate the recordarray to the root, so that information is lost, but we also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

Date: 2021-07-08 09:17:50 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is erased.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure. I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

After taking another look,completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

Date: 2021-07-08 09:21:15 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure.

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

After taking another look,completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.

Date: 2021-07-08 09:21:33 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure.

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure After taking another look,completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.

Date: 2021-07-08 09:22:16 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure.

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure. The current implementation is more like ak.erase which would erase all structure + type information besides the dtype.

After taking another look, completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.

Date: 2021-07-08 09:24:02 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure.

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure. The current implementation is more like ak.erase which would erase all structure + type information besides the dtype.

I typically use flatten when I want to convert an array into a flat representation for e.g. histogramming. I expected flat = ak.flatten(arr) to produce a record array that could then be decomposed into the fields for a 2D historgram e.g. hist.fill(flat.x, flat.y).

After taking another look, completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.

Date: 2021-07-08 09:24:11 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: @jpivarski: yes, I was interested in establishing axis=None as the default for flatten. However, in this case, I'm actually asking about why the RecordArray layout is lost.

I typically use flatten when I want to convert an array into a flat representation for e.g. histogramming. I expected flat = ak.flatten(arr) to produce a record array that could then be decomposed into the fields for a 2D historgram e.g. hist.fill(flat.x, flat.y).

For example, take this array:

arr = ak.Array([{'x': 1, 'y': 'hi'}])

If we flatten with axis=None, then it actually merges the contents and loses the record structure.

>>> ak.flatten(arr, axis=None)
<Array [1, 104, 105] type='3 * int64'>

I think this is the wrong behaviour, not only for this case where we have non-mergeable types, but also in the general case - I see the record as part of the type, not structure. The current implementation is more like ak.erase which would erase all structure + type information besides the dtype.

After taking another look, completely_flatten is a part of the problem as it doesn't propagate the recordarray to the root, so that information is lost. We also don't handle records in https://github.com/scikit-hep/awkward-1.0/blob/a70a535a818d54c6dcdb60711dff09a283441541/src/awkward/operations/structure.py#L1833-L1840

I wonder what should happen in the case that there are nested records? Intuitively, I'd expect that all RecordArray layouts are pushed to the root, and their contents would be completely flat. Maybe flatten needs to take a new keep_records parameter that can be used to restore the current behaviour.

Date: 2021-07-08 12:42:54 From: Angus Hollands (@agoose77:matrix.org)

Hmm, maybe this is where ravel comes in. To my mind, the existing axis=None behaviour best fits the notion of ravelling, because ravel doesn't take an axis parameter. Given that flatten and unflatten are already well defined concepts in Awkward (and strongly necessitate the idea of an axis), maybe we could deprecate the existing axis=None behaviour and schedule its replacement with axis="records". At the same time, one could add an ak.ravel which only does ak.flatten(axis=None). I.e.

def flatten(array, axis=None):
    if axis is None:
        return ravel(array)
    ...

def ravel(array):
    ...

Date: 2021-07-19 16:08:05 From: Angus Hollands (@agoose77:matrix.org)

sample = ak.Array([
    [
        [1,2,3,4,5],
        [1,2,3,4,5]
    ],
    [
        [1,2,3,4,5],
        [1,2,3,4,5],
    ]
])
sample = ak.to_regular(sample, axis=-1)
data = ak.zip({"sample": sample, "two_sample": sample}, depth_limit=2)

The above data.sample isn't regular despite begin regular under the hood

Date: 2021-07-19 16:08:13 From: Jim Pivarski (@jpivarski)

What do you mean "erased"?

Date: 2021-07-19 16:09:08 From: Angus Hollands (@agoose77:matrix.org)

the RegularArray layout becomes ListOffsetArray, and in doing so, the fact that the layout is regular is lost

Date: 2021-07-19 16:10:52 From: Jim Pivarski (@jpivarski)

>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=True)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}).type
2 * var * var * {"a": int64, "b": int64}
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=False)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}).type
2 * var * var * {"a": int64, "b": int64}

I don't see a difference between RegularArray and NumpyArray with ndim > 1.

Date: 2021-07-19 16:13:51 From: Jim Pivarski (@jpivarski)

>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=True)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}, depth_limit=2).type
2 * var * {"a": var * int64, "b": var * int64}
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=False)
>>> sample.type
2 * 2 * 5 * int64
>>> ak.zip({"a": sample, "b": sample}, depth_limit=2).type
2 * var * {"a": var * int64, "b": var * int64}

Same thing happens here.

Date: 2021-07-19 16:24:30 From: Jim Pivarski (@jpivarski)

Regularness is preserved in simple broadcasting without ak.zip, which makes it sound like something in ak.zip is preventing this from taking effect.

>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=True)
>>> ak.broadcast_arrays(sample, sample)
[<Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>, <Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>]
>>> sample = ak.from_numpy(np.array([[[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]]), regulararray=False)
>>> ak.broadcast_arrays(sample, sample)
[<Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>, <Array [[[1, 2, 3, 4, 5], ... [1, 2, 3, 4, 5]]] type='2 * 2 * 5 * int64'>]

Date: 2021-07-20 14:48:46 From: Angus Hollands (@agoose77:matrix.org)

This will be the "problem". If you're writing your own eval-based routine, just create your own namespace, e.g.

records = dict(zip(ak.fields(array), ak.unzip(array)))
namespace = {**records, "np": np, ...}

This is assuming that your array behaves like an Awkward record array.

Date: 2021-07-20 15:07:11 From: Jim Pivarski (@jpivarski)

def recurse(node):
    if isinstance(node, ast.AST):
        for field in node._fields:
            recurse(getattr(node, field))
    elif isinstance(node, list):
        for x in node:
            recurse(x)
    else:
        pass

You can then put specialized checks at the beginning of this if-elif chain, such as

    if isinstance(node, ast.Name) and isinstance(node.ctx, ast.Load):
        # this may be a field of the NanoEvents or "np", etc.

Or you could change the AST, instrumenting the code, or return something else.

Already-parsed ASTs don't have to be parsed again; you can pass an AST to Python's built-in compile and then pass that to eval or exec. That would allow you to execute modified ASTs (if you have any reason to modify them).

Date: 2021-07-21 14:32:21 From: Jim Pivarski (@jpivarski)

@agoose77:matrix.org I've been investigating https://github.com/scikit-hep/awkward-1.0/issues/1022. Normally, the mixed variable and regular slicer that you made would be declared invalid: all-variable dimensions triggers Awkward advanced indexing and all-regular triggers NumPy advanced indexing; a mixture would be confusing, so it's not allowed.

However, it sneaks through because length-1 regular arrays in a slice are interpreted as SliceVarNewAxis: https://github.com/scikit-hep/awkward-1.0/pull/694/files#diff-a63810de74c2520ec41382cece2d156993c47ba9eb69772ce6b10a8262536e22

The use of the word "newaxis" is a little misleading; this is not representing a np.newaxis object in a slice, but a length-1 regular axis that was probably made by a np.newaxis. These tests demonstrate its use: https://github.com/scikit-hep/awkward-1.0/pull/694/files#diff-822ebabcc1ec64f9f91037a24a786bc869db7a9ebd334cf14189f5d8c0149988 (the np.newaxis modifies the slicer, which is then used to slice the array. SliceVarNewAxis objects are created when slicing array, not when slicing slicer.

This feature was added in a rush to prepare a tutorial that never happened (https://github.com/jpivarski-talks/2021-02-09-propose-scipy2021-tutorial/blob/main/prep/million-song.ipynb). It was the most minimal way I could see to add features that were necessary to do the analysis in that tutorial. The idea was that this is replicating a NumPy feature—boradcasting length-1 dimensions in slices—in Awkward advanced indexing. But Awkward advanced indexing fits a nested structure in the slicer to the nested structure that you're slicing, whereas NumPy advanced indexing slices each dimension by a different array, and all of those arrays in the tuple are broadcasted. NumPy advanced indexing is truly broadcasting because there are multiple arrays in the slicer; Awkward advanced indexing has only one array in the slicer, so it's not really broadcasting.

Treating a length-1 dimension differently from any other length makes this rule hard to predict. The idea was that you'd get the length-1 dimension from a np.newaxis, but in your case, you got it from a reducer with keepdims=True. I'm thinking this was a bad rule to have introduced: it has unforeseen consequences.

In the test suite, the rule is only triggered in the tests that were added to check it. The rule was never advertised (after all, that tutorial was never presented), and it is unlikely to have made its way into any analyses other than yours, since it's rather easy to trigger the FIXME (which is much older, and may yet be unreachable without the new SliceVarNewAxis rule). Slices that are enabled by the rule can still be performed without it—for instance, this test using the rule:

    array = ak.Array(
        [
            [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14]],
            [[15, 16, 17, 18, 19], [20, 21, 22, 23, 24], [25, 26, 27, 28, 29]],
        ]
    )
    slicer = ak.Array([[3, 4], [0, 1, 2, 3]])
    assert array[slicer[:, np.newaxis]].tolist() == [
        [[3, 4], [8, 9], [13, 14]],
        [[15, 16, 17, 18], [20, 21, 22, 23], [25, 26, 27, 28]],
    ]

can be replaced by the following, without the new rule:

    assert array[[[[3, 4], [3, 4], [3, 4]], [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]]].tolist() == [
        [[3, 4], [8, 9], [13, 14]],
        [[15, 16, 17, 18], [20, 21, 22, 23], [25, 26, 27, 28]],
    ]

Your use-case is also possible, but only if the slicer is all-variable (in keeping with the rule to avoid confusion between Awkward advanced indexing and NumPy advanced indexing):

>>> y = ak.Array([[[1, 2, 3, 4], [5, 6, 7, 8]]])
>>> t = ak.argmax(y, axis=-1, keepdims=True)
>>> y[t]
<Array [[[4], [8]]] type='1 * var * var * ?int64'>

(y is allowed to have regular dimensions, but t can't mix regular with irregular. You could keep y irregular or make t regular.)

So I think I'm going to remove it, which reverts only PR: https://github.com/scikit-hep/awkward-1.0/pull/694.

Date: 2021-07-21 16:59:21 From: Angus Hollands (@agoose77:matrix.org)

Off-hand, do you have a personal rule of thumb about when to move something into its own record array, vs keeping everything top level, e.g..

n * var * {"x" * float64, "y" * float64, "t" * float64}

vs

n * var * {"pos": {"x" * float64, "y" * float64}, "t" * float64}

I can't seem to settle on a "good" solution besides "if it needs its own behaviour, put it in a recordarray"

Date: 2021-07-22 22:58:14 From: Jim Pivarski (@jpivarski)

Oh! I thought of a reason one might choose more nesting: if a lot of fields (e.g. "temp", "pres", "dens") share a lot of subfields (e.g. "nominal", "up1sigma", "dn1sigma"), you can do these slice projections:

dataset[["temp", "pres", "dens"], "nominal"]

https://awkward-array.readthedocs.io/en/latest/_auto/ak.Array.html#nested-projection

That wouldn't be possible if they were named "temp_nominal", "temp_up1sigma", etc.

Date: 2021-07-22 23:00:18 From: Angus Hollands (@agoose77:matrix.org)

Another good point. I tend to think any time you have a prefixed name, e.g. "temp_nominal", "pres_nominal" it's a sign you should have those fields in a record

Date: 2021-07-24 13:38:05 From: Andrew Naylor (@asnaylor)

Trying to use a jagged array as an index for another jagged array and i'm seeing behaviour i don't understand.

This is how i'm using the jagged array indexing, selecting elements from a jagged array with a jagged list.

>>> arr = ak.Array([[0, 1, 2], [77], [3, 4], [5, 6], [7, 8, 9, 10]])
>>> print(arr[ [ [0], [], [0], [1], [3]] ])
[[0], [], [3], [6], [10]]

But when i decided i want to return the number 77 from the 2nd "event" in jagged array, it returns something completely different i don't understand

>>> print(arr[ [ [0], [0], [0], [1], [3]] ])
[[[0, 1, 2]], [[0, 1, 2]], [[0, 1, 2]], [[77]], [[5, 6]]]

But then it returns back what i would expect to see if i ask for more than 1 element from an "event":

print(arr[ [ [0, 1], [0], [0], [1], [3]] ])
[[0, 1], [77], [3], [6], [10]]

How come the 2nd example's behaviour is different and how do you get it to behave like the other examples?

Date: 2021-07-24 14:00:56 From: Andrew Naylor (@asnaylor)

I found you can just put the list in an awkward array and you get the expected behaviour

>>> array = ak.Array([[0, 1, 2], [77], [3, 4], [5, 6], [7, 8, 9, 10]])
>>> slicer = ak.Array([ [0], [0], [0], [1], [3]])
>>> print(array[slicer])
[[0], [77], [3], [6], [10]]
>>> print(array[ak.to_list(slicer)])
[[[0, 1, 2]], [[0, 1, 2]], [[0, 1, 2]], [[77]], [[5, 6]]]

Date: 2021-07-24 14:06:31 From: Andrew Naylor (@asnaylor)

But if the awkward array is a flat 1D array then I can't seem to get it to choose the elements of each "event" when it's the index

>>> slicer = ak.Array([0, 0, 0, 1, 3])
>>> unflatten = ak.unflatten(slicer, 1)
>>> print(unflatten)
[[0], [0], [0], [1], [3]]
>>> print(array[unflatten])
[[[0, 1, 2]], [[0, 1, 2]], [[0, 1, 2]], [[77]], [[5, 6]]]

Date: 2021-07-24 14:09:08 From: Andrew Naylor (@asnaylor)

Only if i do something like unflatten then convert to a list and then put that in awkward array

>>> array[ak.Array(ak.to_list(unflatten))]
<Array [[0], [77], [3], [6], [10]] type='5 * var * int64'>

Date: 2021-07-24 14:56:41 From: Angus Hollands (@agoose77:matrix.org)

@asnaylor: these are good questions, and the short answer is that you should check [the docs](for details on how this works). The long answer is: Awkward has "two" main indexing behaviours - NumPy (regular) and Awkward (jagged). The motivation is that Awkward should be compatible as a drop-in for NumPy when using regular NumPy arrays.

Date: 2021-07-24 14:57:48 From: Angus Hollands (@agoose77:matrix.org)

To index into multiple dimensions with NumPy arrays, you have to use a single array for each index-

>>> x = np.array([
... [0, 1, 2],
... [3, 4, 5],
... [6, 7, 8]
... ], dtype=np.int64)
>>>  x[[0, 1, 2], [0, 2, 2]]
array([0, 5, 8], dtype=np.int64)

Date: 2021-07-24 15:03:04 From: Angus Hollands (@agoose77:matrix.org)

let's say you wanted only elements which are even, or equal to 3, you'd do

>>> x[(x % 2 == 0) | (x == 3)]
array([0, 2, 3, 4, 6, 8], dtype=np.int64)

which loses the structure of the array

Date: 2021-07-24 17:19:31 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: To index into multiple dimensions with NumPy arrays, you have to use a single array for each axis-

>>> x = np.array([
... [0, 1, 2],
... [3, 4, 5],
... [6, 7, 8]
... ], dtype=np.int64)
>>>  x[[0, 1, 2], [0, 2, 2]]
array([0, 5, 8], dtype=np.int64)

Date: 2021-07-24 17:19:41 From: Angus Hollands (@agoose77:matrix.org)

☝️ Edit: To index into multiple dimensions with NumPy arrays, you have to use a single array for each dimension:

>>> x = np.array([
... [0, 1, 2],
... [3, 4, 5],
... [6, 7, 8]
... ], dtype=np.int64)
>>>  x[[0, 1, 2], [0, 2, 2]]
array([0, 5, 8], dtype=np.int64)

Date: 2021-07-24 19:44:25 From: Andrew Naylor (@asnaylor)

Thanks for the response @agoose77:matrix.org. I guess my question is now how do you convert a 1D list e.g. ak.Array([0, 0, 0, 1, 3]) into a jagged 1D list like using unflatten where 'var' is in the type.

If I just use unflatten then i don't see 'var' in the type.

>>> slicer = ak.Array([0, 0, 0, 1, 3])
>>> ak.unflatten(slicer, 1)
<Array [[0], [0], [0], [1], [3]] type='5 * 1 * int64'>

But if i convert it to a the unflattened array to a list and then to an awkward array, it then becomes a jagged array.

>>> ak.Array(ak.to_list(ak.unflatten(slicer, 1)))
<Array [[0], [0], [0], [1], [3]] type='5 * var * int64'>

Is there a simpler way to do this?

Date: 2021-07-24 19:44:57 From: Andrew Naylor (@asnaylor)

The docs link you posted just comes up as http://for%20details%20on%20how%20this%20works/ for me

Date: 2021-07-24 20:03:51 From: Andrew Naylor (@asnaylor)

@agoose77:matrix.org saw on the new awkward docs you can pass a 1d list to unflattend instead of a single int.

>>> slicer = ak.Array([0, 0, 0, 1, 3])
>>> ak.unflatten(slicer, np.ones(len(slicer), dtype=int))
<Array [[0], [0], [0], [1], [3]] type='5 * var * int64'>

So I can just use numpy to create a 1d list of the same size and now it returns it as a jagged array

Date: 2021-07-24 20:35:21 From: Angus Hollands (@agoose77:matrix.org)

You can also do

ak.from_regular(slicer[:, np.newaxis], axis=1)

Date: 2021-07-30 13:19:11 From: Angus Hollands (@agoose77:matrix.org)

Hey @jpivarski , sorry for the visual noise - I've determined that you can't edit messages on Matrix once someone has replied to them (well, you can, but Gitter displays copied messages 🤮).

Should two arrays with different dimensions and record types be broadcastable? E.g.

>>> x.type
4056 * var * {"u": ?int64, "v": ?int64}
>>> y.type
4056 * var * var * {"q": float64, "t": float64}
>>> ak.broadcast_arrays(x, y)
ValueError ...

We currently require the record fields to match if we allow records to broadcast against one another.

This behavior initially surprised me - I expected the record-part to be irrelevant and successfully left-broadcast; the record is part of the type, and we allow different dtypes to broadcast:

>>> ak.broadcast_arrays(x.u, y.t)
[<Array [[[66], [65]], ... 64], [68], [65]]] type='4056 * var * option[var * int64]'>,
 <Array [[[32], [199]], ... [31], [275]]] type='4056 * var * option[var * float64]'>]

However, I'm also not sure about the resulting type here - naively I'd expect the option to wrap the dtype, not the var dimension.

The broadcasting works if I use zip, because zip exits early from the broadcasting once the purelist_depth is 1 (i.e. it doesn't encounter the bare record).

>>> ak.zip((u, v)).type
4056 * var * var * ({"u": ?int64, "v": ?int64}, {"t": float64, "q": float64})

Most likely I am missing something here, but perhaps you could shed some light on this? @jpivarski

Date: 2021-09-01 23:03:57 From: Thomas A Caswell (@tacaswell)

I think I want a shape that looks like

3 * {"a": int64, "b": int64, "c": var * {"x": int64, "y": int64}}

Date: 2021-09-01 23:05:29 From: Thomas A Caswell (@tacaswell)

but have not found a way get it via random walks. I have gotten

3 * {"a": int64, "b": int64, "c": {"x": var * int64, "y": var * int64}}

and looking at just the nested part I can get

 3 * var * {"x": float32, "y": int32}

Date: 2021-09-01 23:42:21 From: Jim Pivarski (@jpivarski)

@tacaswell If you have JSON data or Python lists and dicts with this form, you can just pass them into ak.from_json or ak.from_iter.

But supposing you are starting with non-record arrays, some of them with a "var" dimension:

>>> a = ak.from_numpy(np.arange(10))
>>> b = ak.from_numpy(np.arange(100, 110))
>>> counts = np.random.poisson(2.5, 10)
>>> cx = ak.unflatten(np.random.normal(0, 1, counts.sum()), counts)
>>> cy = ak.unflatten(np.random.normal(0, 1, counts.sum()), counts)
>>> a
<Array [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] type='10 * int64'>
>>> b
<Array [100, 101, 102, 103, ... 107, 108, 109] type='10 * int64'>
>>> cx
<Array [[1.1, 0.283, -0.596], ... [1.41, 1.18]] type='10 * var * float64'>
>>> cy
<Array [[1.33, -0.512, 0.478, ... 1.79, -1.44]] type='10 * var * float64'>

Date: 2021-09-01 23:45:00 From: Thomas A Caswell (@tacaswell)

Starting from

d = [
    {"a": 1, "b": 2, "c": {"x": [1, 2, 3], "y": [4, 5, 6]}},
    {"a": 1, "b": 2, "c": {"x": [1, 2, 3, 4, 5], "y": [4, 5, 6, 7, 8]}},
    {"a": 1, "b": 2, "c": {"x": [1, 2, 10], "y": [4, 5, 9]}},
]

a = ak.Array(d)
# a.type == 3 * {"a": int64, "b": int64, "c": {"x": var * int64, "y": var * int64}}

Date: 2021-09-01 23:45:32 From: Jim Pivarski (@jpivarski)

This will do it: use a depth_limit on the outer ak.zip to prevent broadcasting the "a" and "b" into the "cx" and "cy" (which would end up with 10 * var * ...), but don't use a depth_limit on the inner one).

>>> array = ak.zip({"a": a, "b": b, "c": ak.zip({"x": cx, "y": cy})}, depth_limit=1)
>>> array.type
10 * {"a": int64, "b": int64, "c": var * {"x": float64, "y": float64}}

I'm looking at your actual example now...

Date: 2021-09-01 23:46:45 From: Jim Pivarski (@jpivarski)

To keep the names separate,

>>> d = [
...     {"a": 1, "b": 2, "c": {"x": [1, 2, 3], "y": [4, 5, 6]}},
...     {"a": 1, "b": 2, "c": {"x": [1, 2, 3, 4, 5], "y": [4, 5, 6, 7, 8]}},
...     {"a": 1, "b": 2, "c": {"x": [1, 2, 10], "y": [4, 5, 9]}},
... ]
>>> d = ak.Array(d)
>>> d
<Array [{a: 1, b: 2, c: {x: [1, ... 4, 5, 9]}}] type='3 * {"a": int64, "b": int6...'>

Date: 2021-09-01 23:47:26 From: Jim Pivarski (@jpivarski)

You get the projected-out "a", "b", "cx", and "cy" like

>>> d["a"]
<Array [1, 1, 1] type='3 * int64'>
>>> d["c", "x"]
<Array [[1, 2, 3], ... 2, 3, 4, 5], [1, 2, 10]] type='3 * var * int64'>

Date: 2021-09-01 23:47:43 From: Jim Pivarski (@jpivarski)

Or d.a, d.c.x.

Date: 2021-09-01 23:48:07 From: Jim Pivarski (@jpivarski)

>>> array = ak.zip({"a": d.a, "b": d.b, "c": ak.zip({"x": d.c.x, "y": d.c.y})}, depth_limit=1)
>>> array.type
3 * {"a": int64, "b": int64, "c": var * {"x": int64, "y": int64}}

Date: 2021-09-01 23:54:56 From: Thomas A Caswell (@tacaswell)

recs = [np.array(list(zip(_["c"]["x"], _["c"]["y"])), dtype=[("x", "f"), ("y", "i")]) for _ in d]

array = ak.zip(
    {"a": d.a, "b": d.b, "c": ak.concatenate([ak.from_numpy(r)[np.newaxis, :] for r in recs])}, depth_limit=1
)

Date: 2021-09-02 00:08:20 From: Jim Pivarski (@jpivarski)

>>> counts = np.random.poisson(2.5, 10)
>>> c = np.random.normal(0, 1, (counts.sum(), 1024, 760))
>>> ak.unflatten(c, counts)
<Array [[[[-0.497, -1.28, ... 0.241, -1.2]]]] type='10 * var * 1024 * 760 * float64'>

Date: 2021-09-02 00:23:53 From: Jim Pivarski (@jpivarski)

The from_datashape function is as-yet undocumented. (I noticed recently that it slipped through, and there are some subtleties like

>>> array = ak.unflatten(c, counts)
>>> str(array.type)
'10 * var * 1024 * 760 * float64'
>>> type(array.type)
<class 'awkward._ext.ArrayType'>
>>> ak.types.from_datashape(str(array.type))
10 * var * 1024 * 760 * float64
>>> type(ak.types.from_datashape(str(array.type)))
<class 'awkward._ext.RegularType'>

The distinction between RegularType (regular dimensions nested within an array) and ArrayType (everything at top-level is an ArrayType; it gives the length of the array) can't be distinguished from only the string, unless you know that the string represents a full array, not a partial one, and then replace the RegularType with an ArrayType.

Date: 2021-09-05 18:45:11 From: Thomas A Caswell (@tacaswell)

import numpy as np
import awkward as ak

r = {
    "A": {
        "G0": [
            {
                "tick": [0, 2, 3],
                "C1": [4.5, 4.6, 4.7],
            },
            {
                "tick": [7, 8, 9, 10],
                "C1": [5.5, 6.6, 7.7, 8.8],
            },
        ],
        "G1": [
            {
                "tick": [0, 2, 3],
                "C1": [4.5, 4.6, 4.7],
            },
            {
                "tick": [10, 11],
                "C1": [5.5, 6.6],
            },
            {
                "tick": [15, 16, 17, 18],
                "C1": [5.5, 6.6, 0.7, 0.8],
            },
        ],
    },
    "B": {
        "G0": [
            {
                "tick": [0, 2, 3],
                "C1": [4.5, 4.6, 4.7],
            },
            {
                "tick": [7, 8, 9, 10],
                "C1": [5.5, 6.6, 7.7, 8.8],
            },
            {
                "tick": [10, 11],
                "C1": [5.5, 6.6],
            },
        ],
        "G1": [
            {
                "tick": [0, 2, 3],
                "C1": [4.5, 4.6, 4.7],
            },
            {
                "tick": [15, 16, 17, 18],
                "C1": [5.5, 6.6, 0.7, 0.8],
            },
        ],
    },
}


as_dict = {k: ak.concatenate([ak.Array(c)[np.newaxis, :] for c in v]) for k, v in r['A'].items()}

# this fails
array = ak.zip(as_dict, depth_limit=1)

Date: 2021-09-05 18:45:36 From: Thomas A Caswell (@tacaswell)

where I think the type I want is

want = '{"A": {"G1": var * var * {"tick": int64, "C1": float64}, "G2":var * var * {"tick": int64, "C1": float64}}, "B":  {"G1": var * var * {"tick": int64, "C1": float64}, "G2":var * var * {"tick": int64, "C1": float64}}}'

Date: 2021-09-06 13:06:51 From: Jim Pivarski (@jpivarski)

@tacaswell and @agoose77:matrix.org, Sorry to be coming to this late, but here are some suggestions. First, if you're starting from some Python data like r in your example, the first step should probably be to get it into an Awkward Array in whatever form it's already taking. Restructuring columnar arrays into other columnar arrays are either O(1) metadata rearrangement or only O(n) in some compiled code that checks to see if list lengths are equal to allow deep zipping, rather than any iteration in Python. The concatenation idiom might have worked in one case, but it's not a good pattern to generalize, especially as things start to scale.

So the first step should be to take Python data r and turn it into Awkward data R. This structure is a dict at top-level, rather than a list, so the ak.Array constructor will attempt to read that as a dict of columns (like the Pandas constructor) instead of passing the data to ak.from_iter. You could use the ak.Record constructor instead of ak.Array (since you know that it's a dict at top-level), or you could just use ak.from_iter explicitly and then not need to know:

>>> R = ak.from_iter(r)
>>> R
<Record ... 18], C1: [5.5, 6.6, 0.7, 0.8]}]}} type='{"A": {"G0": var * {"tick": ...'>
>>> R.type
{"A": {"G0": var * {"tick": var * int64, "C1": var * float64}, "G1": var * {"tick": var * int64, "C1": var * float64}}, "B": {"G0": var * {"tick": var * int64, "C1": var * float64}, "G1": var * {"tick": var * int64, "C1": var * float64}}}

You wanted to zip the "tick" and "C1" fields together (both of which are var arrays, always with the same lengths). Unfortunately, that's deep inside the structure, the leaves of the tree; the commands we have for restructuring unpack and repack at the root of the tree.

Date: 2021-09-06 13:07:54 From: Jim Pivarski (@jpivarski)

Here is an expression that takes R to the desired type; I'll explain it afterward.

restructured = ak.zip(
    {
        "A": ak.zip(
            {
                "G0": ak.from_regular(
                    ak.zip({"tick": R.A.G0.tick, "C1": R.A.G0.C1})[np.newaxis], axis=1
                ),
                "G1": ak.from_regular(
                    ak.zip({"tick": R.A.G1.tick, "C1": R.A.G1.C1})[np.newaxis], axis=1
                ),
            },
            depth_limit=1,
        ),
        "B": ak.zip(
            {
                "G0": ak.from_regular(
                    ak.zip({"tick": R.B.G0.tick, "C1": R.B.G0.C1})[np.newaxis], axis=1
                ),
                "G1": ak.from_regular(
                    ak.zip({"tick": R.B.G1.tick, "C1": R.B.G1.C1})[np.newaxis], axis=1
                ),
            },
            depth_limit=1,
        ),
    },
    depth_limit=1,
)[0]

And just to check it:

>>> restructured
<Record ... C1: 0.7}, {tick: 18, C1: 0.8}]]}} type='{"A": {"G0": var * var * {"t...'>
>>> restructured.type
{"A": {"G0": var * var * {"tick": int64, "C1": float64}, "G1": var * var * {"tick": int64, "C1": float64}}, "B": {"G0": var * var * {"tick": int64, "C1": float64}, "G1": var * var * {"tick": int64, "C1": float64}}}

Date: 2021-09-06 13:12:46 From: Jim Pivarski (@jpivarski)

It's a big expression because we have to unpack the whole thing and pack it back up. It's best to see it working from the leaves up to the whole expression (which is how I derived it).

At the deepest level, it's not hard to zip "tick" and "C1":

>>> ak.zip({"tick": R.A.G0.tick, "C1": R.A.G0.C1})
<Array [[{tick: 0, C1: 4.5}, ... C1: 8.8}]] type='2 * var * {"tick": int64, "C1"...'>

Expressions like R.A.G0.tick and R.A.G0.C1 take the whole thing apart, and the ak.zip puts them back together again without a depth_limit, so that they share a var. @agoose77:matrix.org said that ak.zip gets stuck if it can't zip all the way down, I should clarify that it doesn't give up and give you a structure with limited depth—it raises an error, specifically the "cannot broadcast nested list" error that @tacaswell saw at the beginning of this.

Date: 2021-09-06 13:34:39 From: Jim Pivarski (@jpivarski)

Another "design choice" to keep in mind is that if "A" and "B" or "G0" and "G1" represent data that can grow, they should not be record fields. That is, if you expect to add "C", "D", "E", etc. or "G2", "G3", "G4", etc. as the experiment collects data, then this needs to be in the form of arrays, not records.

Suppose, for instance, that the "G" series grows with the data. Then you want to arrange it so that there are no fields named "G" and all of the {"tick": int64, "C1": float64} records are in a continuous array like

>>> ak.concatenate([R.A.G0, R.A.G1])
<Array [{tick: [0, 2, 3], ... 6.6, 0.7, 0.8]}] type='5 * {"tick": var * int64, "...'>

or at least remember their "G" structure as an array of lists:

>>> ak.unflatten(ak.concatenate([R.A.G0, R.A.G1]), [len(R.A.G0), len(R.A.G1)])
<Array [[{tick: [0, 2, 3, ... 6.6, 0.7, 0.8]}]] type='2 * var * {"tick": var * i...'>

If the words "G0", "G1", etc. are important, they can be a field of a record. Here are two ways to do it: the first broadcasts the names into every {"tick": int64, "C1": float64} record, the second makes a two-level structure in which the data are in lists and the names are not.

>>> full = ak.zip({"name": ["G0", "G1"], "data": ak.unflatten(ak.concatenate([R.A.G0, R.A.G1]), [len(R.A.G0), len(R.A.G1)])})
>>> full
<Array [[{name: 'G0', data: {tick: [, ... ] type='2 * var * {"name": string, "da...'>
>>> full.type
2 * var * {"name": string, "data": {"tick": var * int64, "C1": var * float64}}
>>> full.tolist()
[[{'name': 'G0', 'data': {'tick': [0, 2, 3], 'C1': [4.5, 4.6, 4.7]}}, {'name': 'G0', 'data': {'tick': [7, 8, 9, 10], 'C1': [5.5, 6.6, 7.7, 8.8]}}], [{'name': 'G1', 'data': {'tick': [0, 2, 3], 'C1': [4.5, 4.6, 4.7]}}, {'name': 'G1', 'data': {'tick': [10, 11], 'C1': [5.5, 6.6]}}, {'name': 'G1', 'data': {'tick': [15, 16, 17, 18], 'C1': [5.5, 6.6, 0.7, 0.8]}}]]

and

>>> full2 = ak.zip({"name": ["G0", "G1"], "data": ak.unflatten(ak.concatenate([R.A.G0, R.A.G1]), [len(R.A.G0), len(R.A.G1)])}, depth_limit=1)
>>> full2
<Array [{name: 'G0', data: [{tick: [, ... ] type='2 * {"name": string, "data": v...'>
>>> full2.type
2 * {"name": string, "data": var * {"tick": var * int64, "C1": var * float64}}
>>> full2.tolist()
[{'name': 'G0', 'data': [{'tick': [0, 2, 3], 'C1': [4.5, 4.6, 4.7]}, {'tick': [7, 8, 9, 10], 'C1': [5.5, 6.6, 7.7, 8.8]}]}, {'name': 'G1', 'data': [{'tick': [0, 2, 3], 'C1': [4.5, 4.6, 4.7]}, {'tick': [10, 11], 'C1': [5.5, 6.6]}, {'tick': [15, 16, 17, 18], 'C1': [5.5, 6.6, 0.7, 0.8]}]}]

Date: 2021-09-16 11:57:06 From: Angus Hollands (@agoose77:matrix.org)

@jpivarski: do you have any idea why builder.append is much slower than unrolling the loop in numba? e.g.

@nb.njit
def concat(arr, builder):
    for x in arr:
        builder.begin_list()
        for y in x:
            builder.integer(y)
        builder.end_list()
    return builder

@nb.njit
def concat2(arr, builder):
    for x in arr:
        builder.append(x)
    return builder

I'm wondering whether it's numba being fast or builder being slow.

arr is a list of variable-length int32 arrays of length 145, with the total number of elements 753202

Date: 2021-09-16 12:45:39 From: Jim Pivarski (@jpivarski)

The "disaster case" that I was worried about is not happening. concat2 is not making as many internal buffers as there are appends. This is the correct output:

>>> import awkward as ak
>>> import numba as nb
>>> array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> @nb.njit
... def concat(arr, builder):
...     for x in arr:
...         builder.begin_list()
...         for y in x:
...             builder.integer(y)
...         builder.end_list()
...     return builder
... 
>>> @nb.njit
... def concat2(arr, builder):
...     for x in arr:
...         builder.append(x)
...     return builder
... 
>>> one = concat(array, ak.ArrayBuilder()).snapshot()
>>> two = concat2(array, ak.ArrayBuilder()).snapshot()
>>> one.layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5 6 10]" offset="0" length="6" at="0x557d7abc1c50"/></offsets>
    <content><NumpyArray format="l" shape="10" data="0 1 2 3 4 5 6 7 8 9" at="0x557d7b54bdc0"/></content>
</ListOffsetArray64>
>>> two.layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5 6 10]" offset="0" length="6" at="0x557d7b455570"/></offsets>
    <content><IndexedArray64>
        <index><Index64 i="[0 1 2 3 4 5 6 7 8 9]" offset="0" length="10" at="0x557d7b46cf70"/></index>
        <content><NumpyArray format="l" shape="10" data="0 1 2 3 4 5 6 7 8 9" at="0x557d7ac79af0"/></content>
    </IndexedArray64></content>
</ListOffsetArray64>

There is an extra level of indirection, but the interior NumpyArray is linked in from the original: see the pointer position for the NumpyArrays contained in each (they're identical).

>>> array.layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5 6 10]" offset="0" length="6" at="0x557d7aca1920"/></offsets>
    <content><NumpyArray format="l" shape="10" data="0 1 2 3 4 5 6 7 8 9" at="0x557d7ac79af0"/></content>
</ListOffsetArray64>

That's what append of a non-leaf value is supposed to do.

Date: 2021-09-20 12:29:54 From: Angus Hollands (@agoose77:matrix.org)

@jpivarski: how do you feel about an ak.unmask like function, i.e.

def unmask(array, axis=1, highlevel=True, behavior=None):
    def getfunction(layout, depth, posaxis):
        posaxis = layout.axis_wrap_if_negative(posaxis)
        if depth == posaxis + 1 and isinstance(layout, ak._util.optiontypes):
            layout = layout.simplify()
            return lambda: layout.content
        return posaxis
    layout = ak.operations.convert.to_layout(array)
    out = ak._util.recursively_apply(
        layout, getfunction, pass_depth=True, pass_user=True, user=axis
    )

    return ak._util.maybe_wrap_like(out, array, behavior, highlevel)

Date: 2021-09-28 17:56:48 From: Jim Pivarski (@jpivarski)

@rcrah The problem is that you're trying to assign values in place, which is not allowed in Awkward Array. (They're immutable.) You also don't want to be iterating over the data in a for loop, as that would always be slow in Python.

The way to approach a problem like this is to multiply whole arrays in a single expression. I don't know if your hitr and rates are both jagged, both one-dimensional, or what, so for sake of an example, let's say hitr is an array of lists of records containing a field named "c1" and let's say rates is a one-dimensional array.

>>> import awkward as ak
>>> hitr = ak.Array([[{"c1": 1.1}, {"c1": 2.2}], [], [{"c1": 3.3}]])
>>> rates = ak.Array([100, 200, 300])

You can multiply them in a single expression,

>>> hitr["c1"] * rates
<Array [[110, 220], [], [990]] type='3 * var * float64'>

This case (the most complex) expanded the 100 to fit [1.1, 2.2] before multiplying, dropped the 200 (it lines up with an empty list), and multiplied the 300 with [3.3]. If hitr["c1"] and rates had the same shapes, so much the better.

As long as lengths match up, it will be possible to "broadcast" a mathematical operation like this. If hitr and rates were both jagged, for instance, all nested lists would have to have the same lengths, list by list. In any case, the outer lengths must always match. (The complex case above has scalars being promoted to variable-length lists, but that works because scalars don't have lengths.)

(Also, your c1 is a variable, not a string, but I'm assuming that it represents a string, possibly an unknown one.)

Date: 2021-10-21 15:57:29 From: Johannes Heuel (@jheuel)

Hello everyone, just started looking into awkward-array and I am struggling to find a short syntax to combine two Awkward arrays of records. As a simple example I have a and b:

a = ak.Array([{"x": 1}, {"x": 5}])
b = ak.Array([{"y": 2}, {"y": 6}])

and want to combine them to c of type: 2 * {"x": int64, "y": int64}. Is there some function to stack a and b together?

Date: 2021-10-21 16:31:47 From: Jim Pivarski (@jpivarski)

@jheuel There isn't a special function for that (yet, anyway). For two known fields, "x" and "y", it would be

>>> c = ak.zip({"x": a.x, "y": b.y})
>>> c
<Array [{x: 1, y: 2}, {x: 5, y: 6}] type='2 * {"x": int64, "y": int64}'>

If you didn't know the names of the fields or wanted to keep that general, it could be done using ak.fields, ak.unzip, and properties of Python.

>>> c = ak.zip(dict(zip(ak.fields(a) + ak.fields(b), ak.unzip(a) + ak.unzip(b))))
>>> c
<Array [{x: 1, y: 2}, {x: 5, y: 6}] type='2 * {"x": int64, "y": int64}'>

None of this iterates over the size of the datasets, so it's safe to use on large arrays. Also, ak.zip will prevent you from trying to combine a and b of different lengths. (It might also recurse too deeply for your purposes, in which case, you can set the depth_limit.)

Date: 2021-11-17 17:12:50 From: Raymond Ehlers (@raymondEhlers)

Hi, I'm running into an issue with Numba where the compilation in nopython mode breaks with: 'IndexedArrayType' object has no attribute 'arraytype'. My goal was to make a reproducer to report it + ask for debugging help, but the array is a bit complicated, so I tried to pickle it up for simplicity. However, I can't (yet) reproduce it with the pickled array. Looking at the layouts, I see that the layout before and after pickling are quite different. An excerpt from before:

          <ListArray64>
            <starts><Index64 i="[0 9 17 22 27 35 40 46 54 61 ... 1961 1969 1978 1988 2005 2016 2019 2023 2033 2040]" offset="0" length="257" at="0x55702e511110"/></starts>
            <stops><Index64 i="[9 17 22 27 35 40 46 54 61 65 ... 1969 1978 1988 2000 2016 2019 2023 2033 2040 2058]" offset="0" length="257" at="0x55702db16200"/></stops>
            <content><RecordArray length="2058">
                <parameters>
                    <param key="__record__">"Momentum4D"</param>
                </parameters>
                <field index="0" key="rho">
                    <IndexedArray64>
                        <index><Index64 i="[1 0 3 4 5 7 6 8 9 10 ... 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444]" offset="0" length="2058" at="0x55702cb66500"/></index>
                        <content><NumpyArray format="f" shape="2445" data="0.157209 1.53226 0.389549 0.1832 1.10865 ... 2.52209 0.862776 7.51934 3.71762 9.57922" at="0x55702cce3aa0"/></conten
t>
                    </IndexedArray64>
                </field>
                ....
          </ListArray64>

and after:

          <ListOffsetArray64>
            <offsets><Index64 i="[0 9 15 20 27 37 45 52 60 63 ... 1788 1797 1807 1820 1828 1832 1837 1847 1852 1867]" offset="0" length="258" at="0x55693ec0e670"/></offsets>
            <content><RecordArray length="1867">
                <parameters>
                    <param key="__record__">"Momentum4D"</param>
                </parameters>
                <field index="0" key="charged">
                    <NumpyArray format="f" shape="1867" data="-3 3 -3 3 -3 ... -3 -3 3 3 3" at="0x55693ec395c0"/>
                </field>
                ....
          </ListOffsetArray64>

It seems like the issue may be related to the layout before pickling. Any suggestions for reproducing, debugging, or working around it? Thanks!

edit: Adding code that triggers the issue:

@nb.njit  # type: ignore
def repro(
    generator_like_jet_constituents: ak.Array,
) -> None:
    for i, generator_like_constituents in enumerate(generator_like_jet_constituents):
        s = 0
        for generator_like_constituent in generator_like_constituents:
            s += generator_like_constituent.pt

Date: 2021-12-23 17:13:31 From: extrinsic (@extrinsic:matrix.org)

btw I think I ran into a weird bug when I tried to do:

   ...: first = ak.from_numpy(np.array([1,2,3]))
   ...: deltas = ak.from_numpy(np.array([[1,2],[1,2],[1,2,3]]))
   ...: deltas = ak.pad_none(deltas, 3, axis=-1)
   ...: dxs = ak.numpy.hstack((first, deltas))
   ...: counts_to_remove = ak.sum(ak.is_none(xs, axis=-1), axis=-1)
   ...: xs = ak.numpy.ma.cumsum(xs, axis=-1).filled(0)
   ...: xs = xs[:, counts_to_remove]

Date: 2021-12-23 17:13:53 From: extrinsic (@extrinsic:matrix.org)

<__array_function__ internals> in hstack(*args, **kwargs)

RecursionError: maximum recursion depth exceeded while calling a Python object

Date: 2021-12-24 12:29:44 From: Angus Hollands (@agoose77:matrix.org)

Currently we don't implement a stack/hstack/vstack array function, so these would be expected to fail. What you can do is use np.concatenate after adding a new dimension to the arrays, i.e.

def stack(
    arrays,
    axis: int = 0,
    merge: bool = True,
    mergebool: bool = True,
    highlevel: bool = True,
    behavior: dict = None,
):
    first, *others = arrays
    if any(array.ndim != first.ndim for array in others):
        raise ValueError("All arrays must have the same shape")

    first_layout = ak.to_layout(first, allow_record=False)

    axis = first_layout.axis_wrap_if_negative(axis)

    # Create index that preserves the leading dimensions
    # before inserting a new axis
    new_axis = axis + 1
    index = (slice(None, None),) * new_axis + (np.newaxis,)

    return ak.concatenate(
        [a[index] for a in arrays],
        new_axis,
        merge=merge,
        mergebool=mergebool,
        highlevel=highlevel,
        behavior=behavior,
    )

This is the generic version of this:

stack

np.stack((
    x,
    y
), axis=-1)

concatenate

np.concatenate((
    x[..., np.newaxis],
    y[..., np.newaxis]
), axis=-1)

Date: 2021-12-27 17:30:01 From: Jim Pivarski (@jpivarski)

@agoose77:matrix.org ak.drop_none, which would remove the option-type as well as the missing values, would be an added value. The idiom

array[~ak.is_none(array)]

can't remove the option-type, just the missing values, because after creating the boolean array with is_none, it no longer remembers that those False values correspond to 100% of the None values from array. However, new development should also have a v2 implementation so that the two versions don't diverge in capabilities. It would be okay to implement it only in v2, but this looks like a fully developed v1 implementation, no reason to not include it.

Date: 2021-12-27 18:38:42 From: Angus Hollands (@agoose77:matrix.org)

I've actually implemented both of these functions now, so we can compare: ak.drop_none ak.simplify_option

Date: 2022-01-11 21:36:58 From: Jim Pivarski (@jpivarski)

I'm working out an example to figure out which one is right:

>>> import awkward as ak
>>> from awkward._v2.tmp_for_testing import v1_to_v2
>>> stuffy = ak.Array([{"x": [[1, 2], []]}, {"x": []}, {"x": [[3]]}], with_name="Stuff")
>>> stuffy
<Array [{x: [[1, 2], []]}, ... {x: [[3]]}] type='3 * Stuff["x": var * var * int64]'>

This stuffy knows that the record type name (part of its parameters) is "Stuff".

Date: 2022-01-11 21:38:00 From: Jim Pivarski (@jpivarski)

>>> ak.flatten(stuffy, axis=2)
<Array [{x: [1, 2]}, {x: []}, {x: [3]}] type='3 * {"x": var * int64}'>

Flattening it loses the name.

Date: 2022-01-11 21:38:55 From: Jim Pivarski (@jpivarski)

Converting to v2:

>>> stuffy_v2 = ak._v2.highlevel.Array(v1_to_v2(stuffy.layout))
>>> stuffy_v2
<Array [{x: [[1, 2], []]}, {...}, {x: [[3]]}] type='3 * Stuff[x: var * var *...'>

and flattening that:

>>> ak._v2.highlevel.Array(stuffy_v2.layout.flatten(axis=2))
<Array [{x: [1, 2]}, {x: []}, {x: [3]}] type='3 * Stuff[x: var * int64]'>

The "Stuff" is not lost. This is the difference.

Date: 2022-01-11 21:43:18 From: Jim Pivarski (@jpivarski)

When the flattening is happening outside the records, that's different code, and it works in both v1 and v2:

>>> another = ak.Array([[{"x": 1}, {"x": 2}], [], [{"x": 3}]], with_name="Stuff")
>>> another
<Array [[{x: 1}, {x: 2}], [], [{x: 3}]] type='3 * var * Stuff["x": int64]'>

>>> ak.flatten(another, axis=1)
<Array [{x: 1}, {x: 2}, {x: 3}] type='3 * Stuff["x": int64]'>

>>> ak._v2.highlevel.Array(v1_to_v2(another.layout).flatten(axis=1))
<Array [{x: 1}, {x: 2}, {x: 3}] type='3 * Stuff[x: int64]'>

Date: 2022-01-28 14:58:25 From: Alexander Held (@alexander-held)

Hi, I have some data in awkward-arrays, which is in this shape only for convenience reasons: it represents yields in bins in different phase space regions. Instead of having a long list of bins across all regions, it has sub-structure to divide by region, e.g. [[1,2], [3]] for two regions, one with two bins and one with a single bin. It could also be represented as [1,2,3].

I want to do a matrix multiplication with that data. Is there a way to do this with awkward? Here is an example:

import awkward as ak
import numpy as np

# works
np.matmul(np.asarray([[1,1,0],[0,1,0],[0,0,1]]), np.asarray([1,2,3]))

# ValueError: one of the right matrices in np.matmul is not rectangular
np.matmul(np.asarray([[1,1,0],[0,1,0],[0,0,1]]), ak.from_iter([[1,2],[3]]))

I understand why this would not work out of the box with the data structure. What is the easiest way to flatten and re-shape the data so that the result has the same shape as what I started with?

Date: 2022-01-28 15:23:09 From: Jim Pivarski (@jpivarski)

@alexander-held Here's what's supposed to happen (where @ is just a synonym for np.matmul).

NumPy matrix multiplication treats the left and right hand sides as individual matrices:

>>> np.zeros((5, 2)) @ np.zeros((2, 7))
array([[0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0.]])
>>> np.zeros((4, 3)) @ np.zeros((3, 5))
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

Awkward matrix multiplication treats the left and right hand sides as arrays of matricies (possibly with different shapes each), to be multiplied in an array-at-a-time way:

>>> both = ak.Array([np.zeros((5, 2)), np.zeros((4, 3))]) @ ak.Array([np.zeros((2, 7)), np.zeros((3, 5))])
>>> both.type
2 * var * var * float64
>>> both.tolist()
[[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]], [[0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]]]

As far as I'm aware, it's not a widely used feature, and it's something to think about because Lukas Heinrich has also asked about it—he wants to autodiff through it. It's currently implemented in a "back door" kind of way, with Numba (it's the only one for which that's true). It will need to be converted to a standard, kernel-based implementation if we're going to autodiff through it.

Date: 2022-01-28 19:01:03 From: Alexander Held (@alexander-held)

Unless I now confused myself, here's the loop version:

v = ak.Array([[[5],[3, 4]], [[2],[1, 0]], [[4],[2,1]]])
A = ak.Array([[1,0,0], [0,1,0], [0,0,1]])

result = ak.zeros_like(v[0])
for i in range(3):
    for j in range(3):
        result = result + A[i][j]*v[i]*v[j]
print(result)

Date: 2022-01-31 10:37:46 From: Caglar Demir (@_CaglarDemir_twitter)

Dear all,

Thank you for the Awkward Array library. I reckon this was one of the missing pieces in the python ecosystem. While I was playing with the library, I happened to find out that string operations can not be broadcasted in the Awkward Array library ,e.g.,

import awkward as ak
ar=ak.Array([['a','b','c'],['b','c'],['x','y','xyz']])
ar + 'a' # [['aa,'ba','ca'],['ba','ca'],['xa','ya','xyza']]
ar - 'a' # [[','b','c'],['b','c'],['x','y','xyz']]
ar - 'x' # [[','b','c'],['b','c'],['','y','yz']]

I cant help but wonder whether there is a reason for not having such computation available.

Cheers!

Date: 2022-01-31 16:35:20 From: Jim Pivarski (@jpivarski)

The test in the gist-notebook is iterating over arrays. Iteration calls some_array[i] for each i, and some_array[i] is a slow operation (despite having a specialized path—it's just the __getitem__ indirection that's slow). We're not even attempting to make Python-like iterative access of Awkward Arrays fast, unless it's in a Numba-compiled function.

Here's another recent discussion about this, in the last section of https://github.com/scikit-hep/uproot4/discussions/550#discussioncomment-2046764 It was in the context of tolist, which is iterative in v1 and uses a different technique in v2, gaining a factor of 130× in speed:

In [1]: import awkward as ak, numpy as np

In [2]: array_v1 = ak.Array(np.random.normal(0, 1, (500000, 2)))

In [3]: %timeit as_list = array_v1.tolist()
15 s ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: array_v2 = ak._v2.Array(np.random.normal(0, 1, (500000, 2)))

In [5]: %timeit as_list = array_v2.tolist()
113 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Date: 2022-02-19 00:00:21 From: Angus Hollands (@agoose77:matrix.org)

@ianna: hi! Do you know if null is expected to work in layout builder at the moment? This form doesn't seem to like builder.null(), wondering if I'm misreading our docs?

builder = ak.layout.LayoutBuilder64("""
{
    "class": "ByteMaskedArray",
    "mask": "i8",
    "content": {
        "class": "RecordArray",
        "contents": {
            "params": {
                "class": "ListOffsetArray64",
                "offsets": "i64",
                "content": {
                    "class": "ListOffsetArray64",
                    "offsets": "i64",
                    "content": "float64"
                }
            },
            "cost": "float64",
            "optimality": "float64",
            "status": "int64"
        }
    },
    "valid_when": true
}
""")

Date: 2022-02-19 00:28:03 From: Angus Hollands (@agoose77:matrix.org)

Oh, looks like it's the ByteMaskedArray that we don't have implemented yet

Date: 2022-02-19 14:13:06 From: Ianna Osborne (@ianna)

@agoose77 - yes, here is an example:

def test_indexed_option_form(): form = """ { "class": "IndexedOptionArray64", "index": "i64", "content": { "class": "NumpyArray", "itemsize": 8, "format": "l", "primitive": "int64", "form_key": "node1" }, "form_key": "node0" } """

builder = ak.layout.LayoutBuilder32(form)

builder.null()
builder.int64(11)
builder.int64(22)
builder.null()
builder.int64(33)
builder.int64(44)
builder.null()
builder.int64(55)
builder.int64(66)
builder.int64(77)

assert ak.to_list(builder.snapshot()) == [
    None,
    11,
    22,
    None,
    33,
    44,
    None,
    55,
    66,
    77,
]

Date: 2022-04-14 17:30:06 From: Angus Hollands (@agoose77:matrix.org)

For me, this works as expected:

import awkward as ak
import numpy as np


class ThisArray(ak.Array):
    def that(self):
        print("this.that()!")


behavior = {("*", "this"): ThisArray}

array = ak.Array({"x": [[1, 2, 3], [4], [5, 6]]}, with_name="this", behavior=behavior)
next_array = ak.unflatten(array, [2,1,1,1,1], axis=-1)

assert isinstance(next_array, ThisArray)

Date: 2022-06-22 16:01:51 From: Jim Pivarski (@jpivarski)

There isn't a way to convert ISO 8601 strings to np.datetime64 on the fly, but it's not a bad idea because the JSON-reader also interprets user-specified strings as floating point nan, inf, -inf, and user-specified records as complex numbers (both features are lacking in JSON). Maybe we ought to have a boolean for "interpret ISO 8601 strings as dates", though that would mean every string would be checked against a date regex. Specifying it for just one field introduces a problem of how users are going to express that field when it might be deeply nested (and the user might not know which it would come up as).

You can do it not-on-the-fly:

>>> # get data
>>> data = ak.from_json("""[
...     {"x": "2022-06-22T10:50:01", "y": 1},
...     {"x": "2022-06-22T10:50:02", "y": 2},
...     {"x": "2022-06-22T10:50:03", "y": 3}
... ]""")
>>> data
<Array [{x: '2022-06-22T10:50:01', ... y: 3}] type='3 * {"x": string, "y": int64}'>

>>> # look at x, convert it to NumPy, and from NumPy strings into NumPy dates
>>> data["x"]
<Array ['2022-06-22T10:50:01', ... ] type='3 * string'>
>>> np.asarray(data["x"])
array(['2022-06-22T10:50:01', '2022-06-22T10:50:02',
       '2022-06-22T10:50:03'], dtype='<U19')
>>> np.asarray(data["x"]).astype("datetime64[ns]")
array(['2022-06-22T10:50:01.000000000', '2022-06-22T10:50:02.000000000',
       '2022-06-22T10:50:03.000000000'], dtype='datetime64[ns]')

>>> # one way to put the NumPy dates into the original data is with attr-assignment
>>> # making a new array with ak.zip would also work
>>> data["x"] = np.asarray(data["x"]).astype("datetime64[ns]")
>>> data
<Array [{y: 1, ... ] type='3 * {"y": int64, "x": datetime64}'>
>>> data.tolist()
[
    {'y': 1, 'x': numpy.datetime64('2022-06-22T10:50:01.000000000')},
    {'y': 2, 'x': numpy.datetime64('2022-06-22T10:50:02.000000000')},
    {'y': 3, 'x': numpy.datetime64('2022-06-22T10:50:03.000000000')},
]

(Date-handling is more cleanly implemented in v2 than v1, so if you have any problems with dates, consider using awkward._v2 instead of awkward. The date type was added late in v1, and it doesn't pass through Python's buffer protocol to C++, so a work-around was needed, but the Python-C++ interface is irrelevant to v2.)

Date: 2022-06-24 11:11:02 From: Angus Hollands (@agoose77:matrix.org)

Let's say that you want to invoke a method on each field, then re-wrap it. This would look something like

def transform(array, func, with_name=None):
    contents = ak.unzip(array)
    new_contents = [func(x) for x in contents]
    return ak.zip(dict(zip(ak.fields(array), new_contents)), with_name=with_name, depth_limit=array.ndim)

Date: 2022-06-28 14:08:35 From: Jim Pivarski (@jpivarski)

This sounds like something that isn't allowed in NumPy. Do you mean something like this?

>>> one = np.array([[1, 2, 3], [4, 5, 6]])
>>> two = np.array([[10], [20]])
>>> one.shape
(2, 3)
>>> two.shape
(2, 1)
>>> one + two
array([[11, 12, 13],
       [24, 25, 26]])
>>> (one + two).shape
(2, 3)
>>> two[:, :0].shape
(2, 0)
>>> one + two[:, :0].shape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (2,3) (2,)

Date: 2022-06-28 14:29:45 From: Jim Pivarski (@jpivarski)

>>> array = np.array([[1, 2, 3], [4, 5, 6]])
>>> array
array([[1, 2, 3],
       [4, 5, 6]])
>>> array.tolist()
[[1, 2, 3], [4, 5, 6]]
>>> array[:, :0]
array([], shape=(2, 0), dtype=int64)
>>> array[:, :0].tolist()
[[], []]

Date: 2022-06-28 14:31:53 From: Angus Hollands (@agoose77:matrix.org)

Here's some non-production-quality code I wrote a while back

    is_valid_event = ak.prod(mm.is_valid, axis=-2, keepdims=True)
    is_not_saturated_waveform = ak.all(mm.is_valid, axis=-1)

    # Convert this to jagged so that we can broadcast when we have 0
    j = ak.from_regular(ak.local_index(is_valid_event), axis=-1)

    # Find bounds of saturated region
    # Convert these to regular so that the `keepdims` dimension is 1
    i = ak.to_regular(ak.argmin(is_valid_event, axis=-1, keepdims=True), axis=1)
    is_after_i = j >= i
    k = ak.to_regular(ak.argmax(is_valid_event.mask[is_after_i], axis=-1, keepdims=True), axis=1)

    # Find peak
    w = ak.argmax(mm.sample, axis=-1, keepdims=True)
    w_in_saturated = (w >= (i - 2)) & (w < (k + 2))
    
    # Make this regular so that we can broadcast with dim-1
    q = ak.to_regular(mm.sample[make_jagged(w)], axis=-1)

    # Convert sample to jagged so that 0-broadcasting works
    sigma = estimate_fwhm(ak.from_regular(mm.sample, axis=-1), q)
    charge, sigma = np.broadcast_arrays(
        q, sigma.mask[(w_in_saturated & is_not_saturated_waveform)]
    )

I had to write each line of this by REPL in order to ensure that I had the right "Kind" of array at each line. Actually, I wrote it without the REPL and then fixed each bug as my job failed, but don't tell anyone else that.

Date: 2022-06-28 15:02:50 From: Angus Hollands (@agoose77:matrix.org)

Another quirk I noticed today, is the following behaviour significant here?

>>> array = ak.from_numpy(np.random.random(size=(10, 3, 5)))
>>> array
<Array [[[0.457, 0.919, ... 0.192, 0.925]]] type='10 * 3 * 5 * float64'>
>>> ak.argmax(array, axis=-1, keepdims=True)
<Array [[[1], [2], [0]], ... [[3], [1], [4]]] type='10 * 3 * 1 * ?int64'>
>>> masked = array.mask[array > 0]
>>> masked
<Array [[[0.457, 0.919, ... 0.192, 0.925]]] type='10 * 3 * 5 * ?float64'>
>>> ak.argmax(masked, axis=-1, keepdims=True)
<Array [[[1], [2], [0]], ... [[3], [1], [4]]] type='10 * 3 * var * ?int64'>

The 1 dimension becomes var due to (I presume) the option type.

Date: 2022-06-29 16:55:43 From: Jim Pivarski (@jpivarski)

@marromlam Merge how? (The Pandas page on "merging" is quite long, because there are a lot of things one might mean by that.) Looking at this example, I'm guessing that you want to end up with a length-2 array of records with 4 fields: "time", "B_ID", "foo", and "bar".

>>> import awkward as ak
>>> a = ak.from_iter([{'time': 4, 'id': -1}, {'time': 3, 'id': -1}])
>>> b = ak.from_iter( [{'foo': 1.2, 'bar': 2.4}, {'foo': 2.3, 'bar': -3.5}])
>>> merged = ak.zip({
...     "time": a["time"],
...     "id": a["id"],
...     "foo": b["foo"],
...     "bar": b["bar"],
... })
>>> merged
<Array [{time: 4, id: -1, ... bar: -3.5}] type='2 * {"time": int64, "id": int64,...'>
>>> merged.tolist()
[{'time': 4, 'id': -1, 'foo': 1.2, 'bar': 2.4}, {'time': 3, 'id': -1, 'foo': 2.3, 'bar': -3.5}]

Date: 2022-06-29 16:55:47 From: Angus Hollands (@agoose77:matrix.org)

@marromlam: it looks like you want to merge the arrays by fields, rather than by length. If that's the case, and both arrays have compatible lengths, you can use ak.zip. The easiest way to do this for a reasonable number of fields is to convert both arrays to dictionaries of arrays, and then merge them:

def array_as_dict(array):
    return dict(zip(ak.fields(array), ak.unzip(array)))

merged = ak.zip({**array_as_dict(a), **array_as_dict(b)})

Date: 2022-06-29 16:59:35 From: Angus Hollands (@agoose77:matrix.org)

Or, if you're using 3.9

def array_as_dict(array):
    return dict(zip(ak.fields(array), ak.unzip(array)))

merged = ak.zip(array_as_dict(a) | array_as_dict(b))

Date: 2022-06-29 20:21:42 From: Angus Hollands (@agoose77:matrix.org)

import awkward as ak
import numpy as np

def reshape_last(array, m):
    n_outer = np.asarray(ak.num(array, axis=-1))
    n_new = n_outer // m
    n_inner_new = np.full_like(n_outer, m)
    n_outer_new = np.repeat(n_new, n_inner_new)
    n_missing = n_outer - n_new*m
    n_missing_new = np.repeat(n_missing, n_inner_new)
    has_missing = np.repeat(n_missing > 0, n_inner_new)

    ix_fill_start = np.zeros(n_outer.size)
    ix_fill_start[1:] = np.cumsum(n_outer)[:-1]

    ix_fill_stop = ix_fill_start + n_missing
    jx = np.asarray(ak.local_index(n_outer_new))
    is_missing = np.sum((jx >= ix_fill_start[...,np.newaxis]) & (jx < ix_fill_stop[...,np.newaxis]), axis=0)
    np.add.at(n_outer_new, jx, is_missing)
    return ak.unflatten(arr, n_outer_new, axis=-1)

arr = ak.Array([[1,2,3,4],[1,2,3,4,5,6,7,8]])
reshaped = reshape_last(arr, 3)
print(reshaped.tolist())
Clone this wiki locally