Skip to content

Commit

Permalink
XRTRunner Utility Class & Programming Examples Cleanup (#673)
Browse files Browse the repository at this point in the history
* Introduces the XRTRunner class, which is used to run programs on the npu while using common numpy operations to check the input/output
* Remove spaghetti python path logic in examples
* Allows the data_transfer_transpose to run for both float and int 32-bit datatypes
* Makes it easier to change the sizes of things in the matrix_scalar_add example
* Percolate changes from #672 to other programming examples as appropriate
* Remove relative paths in makefiles
* Make matrix_scalar_add/multi_launch_channel code correct
* Adds/updates READMEs
  • Loading branch information
hunhoffe authored Jul 31, 2024
1 parent 67a8b7d commit 2c3fcf4
Show file tree
Hide file tree
Showing 78 changed files with 1,876 additions and 2,218 deletions.
16 changes: 10 additions & 6 deletions programming_examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,30 @@ These programming examples are provided so that application programmers can lear

## [2-Dimensional Shim DMA Passthrough](shim_dma_2d)

This example demonstrates how data may be moved using shim DMA operations. It also includes extra infrastructure that illustrates different ways to compile, build, run, and test programs written using the mlir-air python bindings.
This example demonstrates how data may be moved using shim DMA operations. It also includes extra infrastructure that illustrates different ways to compile, build, run, and test programs written using the mlir-air python bindings on an NPU.

## [Passthrough Examples](passthrough)

Three examples that copy data from the input to the output (a data passthrough). The data movement is done through either DMA or Channels, and there is a simple example of calling a an external function which performs a vectorized memcopy.
This directory contains three examples that each copy data from the input to the output (a data passthrough). The data movement is done through either DMA or Channels, and there is a simple example of calling a an external function which performs a vectorized memcopy.

## [Channel Examples](channel_examples)

This is a collection of simple examples that illustrate how to use channels.
This is a collection of simple examples that illustrate how to use *channels*. At a high level, channels are the abstraction for data movement in mlir-air. Some of the examples are experimental works-in-progress.

## [Matrix Scalar Addition](matrix_scalar_add)

This example provides logic to divide in input 2D matrix into *tiles* of data, and add a value to every element in every tile. It includes some description of the fundamental concepts of mlir-air, including *launches*, *herds*, and *channels*.
This example provides logic to divide an input 2D matrix into *tiles* of data, and add a value to every element in every tile. It includes some description of the fundamental concepts of mlir-air, including *launches*, *herds*, and *channels*. There are five different implementations of this example, some of which are experimental (and are currently works-in-progress).

## [Data Transfer Transpose](data_transfer_transpose)

Transposes a matrix with using either Channels or `dma_memcpy_nd`.
Transposes a matrix with using either air channels or `dma_memcpy_nd`.

## [Segment Alloc](segment_alloc)

While a *worker* (a compute unit managed as part of a *herd*) are able to allocate L1 memory, they are not able to allocate L2 memory. This must be done in the *segment*. This example shows how a segment can allocate L2 memory which is then accessed within the herd.

## [WIP: Multi-Segment Examples](multi_segment)

This is a collection of simple examples that illustrate how to use multiple segments.
This is a collection of simple examples that illustrate how to use multiple segments.

Warning: This example is a work-in-progress.
58 changes: 26 additions & 32 deletions programming_examples/channel_examples/README.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,61 @@
# Channel Examples

This example focuses on one of the key abstractions of air: *channels*. This is a collection of examples that use channels in various ways. The patterns shown here may be used to create more complex examples.
This collection of examples focuses on one of the key abstractions of air: *channels*. The patterns shown here may be used to create more complex examples.

## Running and Testing

#### ```herd-to-herd```: Using a channel to pass data between herd.

There are two part of this example: two herds within one segment (single segment), and one herd per segment for two segments (multi-segment)
There are two part of this example: two herds within one segment (single segment), and one herd per segment for two segments (multi-segment).

The single segment example example ([herd_to_herd/single_segment/herd_to_herd.py](herd_to_herd/single_segment/herd_to_herd.py)) defines two `herd`s within the same `launch` + `segment`. There is a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer herd*, which reads data form the `Herd2Herd` channel.

```bash
cd herd_to_herd/single_segment
make clean && make
```
The single segment example example ([herd_to_herd/single_segment/herd_to_herd.py](herd_to_herd/single_segment/herd_to_herd.py)) defines two *herds* within the same *launch* and *segment*. There is a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer herd*, which reads data form the `Herd2Herd` channel.

The multi-segment example ([herd_to_herd/multi_segment/herd_to_herd.py](herd_to_herd/multi_segment/herd_to_herd.py)) defines two `segment`s, each with one `herd`, within the same `launch`. There is a *producer_segment* with a *producer herd*, which writes data to a `Herd2Herd` channel, and a *consumer_segment* with a *consumer herd*, which reads data form the `Herd2Herd` channel.

Warning: The multi-segment example is a work in progress!

```bash
cd herd_to_herd/multi_segment
make clean && make
```

#### ```channel-size```: Use the channel size argument

This example ([channel_size/channel_size.py](channel_size/channel_size.py)) is a data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only instead of using a separately defined channel for each tile/core, a bundle of channels is created (using the `ChannelOp` `size` parameter) and indexed into (the `ChannelGet` and `ChannelPut` `indices` parameter).

```bash
cd channel_size
make clean && make
```

#### ```hierarchical```: Use channels for sending data from Launch to Segment to Herd and back again

This example ([hierarchical/hierarchical.py](hierarchical/hierarchical.py)) is a data passthrough example that uses a channel to send data from Launch to Segment (L3->L2 memory) and then from Segment to Herd (L2->L1 memory). The data is then sent back on an analogous path.

```bash
cd hierarchical
make clean && make
```

#### WIP: ```worker-to-self```:

This example ([worker_to_self/worker_to_self.py](worker_to_self/worker_to_self.py)) is a work-in-progress data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only the sole worker in the herd does some extra shuffling between input and output by putting the current data tile into a channel and then getting it from the same channel.

WARNING: This example currently fails because it is assumed channel gets/parts are not from the same memory region, and this example breaks this assumption.

```bash
cd worker_to_self
make clean && make
```
WARNING: This example currently fails for unknown reasons.

#### WIP: ```worker-to-worker```:

This example ([worker_to_worker/worker_to_worker.py](worker_to_worker/worker_to_worker.py)) is a work-in-progress data passthrough example using the same tiling structure as the [matrix_scalar_add/multi_core_channel](../matrix_scalar_add/multi_core_channel.py) examples, only the each worker trades a tile of input data to another worker in the herd by sending it via channel.

WARNING: This example currently fails for unknown reasons.

#### Usage (For All Examples)

To generate AIR MLIR from Python:
```bash
cd worker_to_worker
cd <example_dir>
make clean && make print
```

To run:
```bash
cd <example_dir>
make clean && make
``
```

#### WIP: more examples!
To run with verbose output:
```bash
cd <example_dir>
python <example_file>.py -v
```

You may be able to configure examples (data types, sizes); to get additional usage information, run:
```bash
cd <example_dir>
python <example_file>.py -h
```
13 changes: 9 additions & 4 deletions programming_examples/channel_examples/channel_size/Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
# Copyright (C) 2022, Advanced Micro Devices, Inc.
# (c) Copyright 2024 Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT
srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

targetname := $(shell basename ${srcdir})

all: run

print:
${powershell} python3 ${srcdir}/channel_size.py -p

run:
mkdir -p build
cd build && ${powershell} python3 ${srcdir}/run.py
mkdir -p ${srcdir}/build
cd ${srcdir}/build && ${powershell} python3 ${srcdir}/channel_size.py

clean:
rm -rf build __pycache__
rm -rf ${srcdir}/build ${srcdir}/__pycache__
85 changes: 62 additions & 23 deletions programming_examples/channel_examples/channel_size/channel_size.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,39 @@
# Copyright (C) 2024, Advanced Micro Devices, Inc.
# SPDX-License-Identifier: MIT
import argparse
import numpy as np

from air.ir import *
from air.dialects.air import *
from air.dialects.memref import AllocOp, DeallocOp, load, store
from air.dialects.func import FuncOp
from air.dialects.scf import for_, yield_
from air.backend.xrt_runner import XRTRunner, type_mapper

range_ = for_

IMAGE_WIDTH = 32
IMAGE_WIDTH = 48
IMAGE_HEIGHT = 16
IMAGE_SIZE = [IMAGE_WIDTH, IMAGE_HEIGHT]
IMAGE_SIZE = [IMAGE_HEIGHT, IMAGE_WIDTH]

TILE_WIDTH = 16
TILE_HEIGHT = 8
TILE_SIZE = [TILE_WIDTH, TILE_HEIGHT]
TILE_SIZE = [TILE_HEIGHT, TILE_WIDTH]

assert IMAGE_WIDTH % TILE_WIDTH == 0
assert IMAGE_HEIGHT % TILE_HEIGHT == 0
assert IMAGE_WIDTH % TILE_WIDTH == 0

INOUT_DATATYPE = np.int32


@module_builder
def build_module():
memrefTyInOut = MemRefType.get(IMAGE_SIZE, T.i32())
xrt_dtype = type_mapper(INOUT_DATATYPE)
memrefTyInOut = MemRefType.get(IMAGE_SIZE, xrt_dtype)

# Create an input/output channel pair per worker
ChannelOp("ChanIn", size=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT])
ChannelOp("ChanOut", size=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT])
ChannelOp("ChanIn", size=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH])
ChannelOp("ChanOut", size=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH])

# We will send an image worth of data in and out
@FuncOp.from_py_func(memrefTyInOut, memrefTyInOut)
Expand All @@ -40,32 +46,32 @@ def launch_body(a, b):
# Transfer one tile of data per worker
for h in range(IMAGE_HEIGHT // TILE_HEIGHT):
for w in range(IMAGE_WIDTH // TILE_WIDTH):
offset0 = IMAGE_HEIGHT * h
offset1 = IMAGE_HEIGHT * w
offset0 = TILE_HEIGHT * h
offset1 = TILE_WIDTH * w

# Put data into the channel tile by tile
ChannelPut(
"ChanIn",
a,
indices=[w, h],
indices=[h, w],
offsets=[offset0, offset1],
sizes=[TILE_HEIGHT, TILE_WIDTH],
sizes=TILE_SIZE,
strides=[IMAGE_WIDTH, 1],
)

# Transfer one tile of data per worker
for h in range(IMAGE_HEIGHT // TILE_HEIGHT):
for w in range(IMAGE_WIDTH // TILE_WIDTH):
offset0 = IMAGE_HEIGHT * h
offset1 = IMAGE_HEIGHT * w
offset0 = TILE_HEIGHT * h
offset1 = TILE_WIDTH * w

# Write data back out to the channel tile by tile
ChannelGet(
"ChanOut",
b,
indices=[w, h],
indices=[h, w],
offsets=[offset0, offset1],
sizes=[TILE_HEIGHT, TILE_WIDTH],
sizes=TILE_SIZE,
strides=[IMAGE_WIDTH, 1],
)

Expand All @@ -75,7 +81,7 @@ def segment_body():

@herd(
name="xaddherd",
sizes=[IMAGE_WIDTH // TILE_WIDTH, IMAGE_HEIGHT // TILE_HEIGHT],
sizes=[IMAGE_HEIGHT // TILE_HEIGHT, IMAGE_WIDTH // TILE_WIDTH],
)
def herd_body(th, tw, _sx, _sy):

Expand All @@ -85,7 +91,7 @@ def herd_body(th, tw, _sx, _sy):
# This is the type definition of the tile
tile_type = MemRefType.get(
shape=TILE_SIZE,
element_type=T.i32(),
element_type=xrt_dtype,
memory_space=mem_space,
)

Expand All @@ -94,11 +100,11 @@ def herd_body(th, tw, _sx, _sy):
tile_out = AllocOp(tile_type, [], [])

# Copy a tile from the input image (a) into the L1 memory region (tile_in)
ChannelGet("ChanIn", tile_in, indices=[tw, th])
ChannelGet("ChanIn", tile_in, indices=[th, tw])

# Access every value in the tile
for j in range_(TILE_HEIGHT):
for i in range_(TILE_WIDTH):
for i in range_(TILE_HEIGHT):
for j in range_(TILE_WIDTH):
# Load the input value from tile_in
val = load(tile_in, [i, j])

Expand All @@ -108,13 +114,46 @@ def herd_body(th, tw, _sx, _sy):
yield_([])

# Copy the output tile into the output
ChannelPut("ChanOut", tile_out, indices=[tw, th])
ChannelPut("ChanOut", tile_out, indices=[th, tw])

# Deallocate our L1 buffers
DeallocOp(tile_in)
DeallocOp(tile_out)


if __name__ == "__main__":
module = build_module()
print(module)
parser = argparse.ArgumentParser(
prog="run.py",
description="Builds, runs, and tests the channel_size example",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
)
parser.add_argument(
"-p",
"--print-module-only",
action="store_true",
)
args = parser.parse_args()

mlir_module = build_module()
if args.print_module_only:
print(mlir_module)
exit(0)

input_matrix = np.random.randint(
low=np.iinfo(INOUT_DATATYPE).min,
high=np.iinfo(INOUT_DATATYPE).max,
size=IMAGE_SIZE,
dtype=INOUT_DATATYPE,
)
output_matrix = input_matrix.copy()

runner = XRTRunner(verbose=args.verbose, experimental_passes=True)
exit(
runner.run_test(
mlir_module, inputs=[input_matrix], expected_outputs=[output_matrix]
)
)
Loading

0 comments on commit 2c3fcf4

Please sign in to comment.