Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recipe: an introduction to custom ops #7

Merged
merged 3 commits into from
Mar 4, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions custom-ops-introduction/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# pixi environments
.pixi
*.egg-info
# magic environments
.magic
.env
# build products
operations.mojopkg
115 changes: 115 additions & 0 deletions custom-ops-introduction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Custom Operations: An Introduction to Programming GPUs and CPUs with Mojo

In this recipe, we will cover:

* How to extend a MAX Graph using custom operations.
* Using Mojo to write high-performance calculations that run on GPUs and CPUs.
* The basics of GPU programming in MAX.

We'll walk through running three examples that show

* adding one to every number in an input tensor
* performing hardware-specific addition of two vectors
* and calculating the Mandelbrot set on CPU and GPU.

Let's get started.

## Requirements

Please make sure your system meets our
[system requirements](https://docs.modular.com/max/get-started).

To proceed, ensure you have the `magic` CLI installed:

```bash
curl -ssL https://magic.modular.com/ | bash
```

or update it via:

```bash
magic self-update
```

### GPU requirements

These examples can all be run on either a CPU or GPU. To run them on a GPU,
ensure your system meets
[these GPU requirements](https://docs.modular.com/max/faq/#gpu-requirements):

* Officially supported GPUs: NVIDIA Ampere A-series (A100/A10), or Ada
L4-series (L4/L40) data center GPUs. Unofficially, RTX 30XX and 40XX series
GPUs have been reported to work well with MAX.
* NVIDIA GPU driver version 555 or higher. [Installation guide here](https://www.nvidia.com/download/index.aspx).

## Quick start

1. Download the code for this recipe using git:

```bash
git clone https://github.com/modular/max-recipes.git
cd max-recipes/custom-ops-introduction
```

2. Run each of the examples:

```bash
magic run add_one
magic run vector_addition
magic run mandelbrot
```

3. Browse through the commented source code to see how they work.

## Custom operation examples

Graphs in MAX can be extended to use custom operations written in Mojo. The
following examples are shown here:

* **add_one**: Adding 1 to every element of an input tensor.
* **vector_addition**: Performing vector addition using a manual GPU function.
* **mandelbrot**: Calculating the Mandelbrot set.

Custom operations have been written in Mojo to carry out these calculations. For
each example, a simple graph containing a single operation is constructed
in Python. This graph is compiled and dispatched onto a supported GPU if one is
available, or the CPU if not. Input tensors, if there are any, are moved from
the host to the device on which the graph is running. The graph then runs and
the results are copied back to the host for display.

One thing to note is that this same Mojo code runs on CPU as well as GPU. In
the construction of the graph, it runs on a supported accelerator if one is
available or falls back to the CPU if not. No code changes for either path.
The `vector_addition` example shows how this works under the hood for common
MAX abstractions, where compile-time specialization lets MAX choose the optimal
code path for a given hardware architecture.

The `operations/` directory contains the custom kernel implementations, and the
graph construction occurs in the Python files in the base directory. These
examples are designed to stand on their own, so that they can be used as
templates for experimentation.

The execution has two phases: first an `operations.mojopkg` is compiled from the
custom Mojo kernel, and then the graph is constructed and run in Python. The
inference session is pointed to the `operations.mojopkg` in order to load the
custom operations.

## Conclusion

In this recipe, we've introduced the basics of how to write custom MAX Graph
operations using Mojo, place them in a one-operation graph in Python, and run
them on an available CPU or GPU.

## Next Steps

* Follow [our tutorial for building a custom operation from scratch](https://docs.modular.com/max/tutorials/build-custom-ops).

* Explore MAX's [documentation](https://docs.modular.com/max/) for additional
features. The [`gpu`](https://docs.modular.com/mojo/stdlib/gpu/) module has
detail on Mojo's GPU programming functions and types, and the documentation
on [`@compiler.register`](https://docs.modular.com/max/api/mojo-decorators/compiler-register/)
shows how to register custom graph operations.

* Join our [Modular Forum](https://forum.modular.com/) and [Discord community](https://discord.gg/modular) to share your experiences and get support.

We're excited to see what you'll build with MAX! Share your projects and experiences with us using `#ModularAI` on social media.
74 changes: 74 additions & 0 deletions custom-ops-introduction/add_one.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# ===----------------------------------------------------------------------=== #
# Copyright (c) 2025, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #

from pathlib import Path

import numpy as np
from max.driver import CPU, Accelerator, Tensor, accelerator_count
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import Graph, TensorType, ops

if __name__ == "__main__":
path = Path(__file__).parent / "operations.mojopkg"

rows = 5
columns = 10
dtype = DType.float32

# Configure our simple one-operation graph.
graph = Graph(
"addition",
# The custom Mojo operation is referenced by its string name, and we
# need to provide inputs as a list as well as expected output types.
forward=lambda x: ops.custom(
name="add_one",
values=[x],
out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)],
)[0].tensor,
input_types=[
TensorType(dtype, shape=[rows, columns]),
],
)

# Place the graph on a GPU, if available. Fall back to CPU if not.
device = CPU() if accelerator_count() == 0 else Accelerator()

# Set up an inference session for running the graph.
session = InferenceSession(
devices=[device],
custom_extensions=path,
)

# Compile the graph.
model = session.load(graph)

# Fill an input matrix with random values.
x_values = np.random.uniform(size=(rows, columns)).astype(np.float32)

# Create a driver tensor from this, and move it to the accelerator.
x = Tensor.from_numpy(x_values).to(device)

# Perform the calculation on the target device.
result = model.execute(x)[0]

# Copy values back to the CPU to be read.
assert isinstance(result, Tensor)
result = result.to(CPU())

print("Graph result:")
print(result.to_numpy())
print()

print("Expected result:")
print(x_values + 1)
94 changes: 94 additions & 0 deletions custom-ops-introduction/mandelbrot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# ===----------------------------------------------------------------------=== #
# Copyright (c) 2025, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #

from pathlib import Path

from max.driver import CPU, Accelerator, Tensor, accelerator_count
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import Graph, TensorType, ops


def create_mandelbrot_graph(
width: int,
height: int,
min_x: float,
min_y: float,
scale_x: float,
scale_y: float,
max_iterations: int,
) -> Graph:
"""Configure a graph to run a Mandelbrot kernel."""
output_dtype = DType.int32
with Graph(
"mandelbrot",
) as graph:
# The custom Mojo operation is referenced by its string name, and we
# need to provide inputs as a list as well as expected output types.
result = ops.custom(
name="mandelbrot",
values=[
ops.constant(min_x, dtype=DType.float32),
ops.constant(min_y, dtype=DType.float32),
ops.constant(scale_x, dtype=DType.float32),
ops.constant(scale_y, dtype=DType.float32),
ops.constant(max_iterations, dtype=DType.int32),
],
out_types=[TensorType(dtype=output_dtype, shape=[height, width])],
)[0].tensor

# Return the result of the custom operation as the output of the graph.
graph.output(result)
return graph


if __name__ == "__main__":
path = Path(__file__).parent / "operations.mojopkg"

# Establish Mandelbrot set ranges.
WIDTH = 15
HEIGHT = 15
MAX_ITERATIONS = 100
MIN_X = -1.5
MAX_X = 0.7
MIN_Y = -1.12
MAX_Y = 1.12

# Configure our simple graph.
scale_x = (MAX_X - MIN_X) / WIDTH
scale_y = (MAX_Y - MIN_Y) / HEIGHT
graph = create_mandelbrot_graph(
WIDTH, HEIGHT, MIN_X, MIN_Y, scale_x, scale_y, MAX_ITERATIONS
)

# Place the graph on a GPU, if available. Fall back to CPU if not.
device = CPU() if accelerator_count() == 0 else Accelerator()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of the examples, it would be nice to output some status text, including output that states which device the code is actually being run on, and any other information that might be interesting (like tick-tock timing when compared to NumPy calculations, if they actually result in better performance).

Being a bit more verbose will make the magic run commands feel a bit more interactive.


# Set up an inference session that runs the graph on a GPU, if available.
session = InferenceSession(
devices=[device],
custom_extensions=path,
)
# Compile the graph.
model = session.load(graph)

# Perform the calculation on the target device.
result = model.execute()[0]

# Copy values back to the CPU to be read.
assert isinstance(result, Tensor)
result = result.to(CPU())

print("Iterations to escape:")
print(result.to_numpy())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the numpy isclose method to demonstrate equality in this, and other examples?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Mandelbrot set, I've replaced the tensor output with an ASCII art display that should allow better scaling of output, and I've added a call-to-action in the text for people to experiment with different resolutions of the grid and different parts of the complex number space. The new output looks something like:

...................................,,,,c@8cc,,,.............
...............................,,,,,,cc8M @Mjc,,,,..........
............................,,,,,,,ccccM@aQaM8c,,,,,........
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
.......................,,,,,,,,c8888M@j,    ,wMM8cccc,,.....
.....................,,,,,,cccMQOPjjPrgg,   OrwrwMMMjjc,....
..................,,,,cccccc88MaP  @            ,pGa.g8c,...
...............,,cccccccc888MjQp.                   o@8cc,..
..........,,,,c8jjMMMMMMMMM@@w.                      aj8c,,.
.....,,,,,,ccc88@QEJwr.wPjjjwG                        w8c,,.
..,,,,,,,cccccMMjwQ       EpQ                         .8c,,,
.,,,,,,cc888MrajwJ                                   MMcc,,,
.cc88jMMM@@jaG.                                     oM8cc,,,
.cc88jMMM@@jaG.                                     oM8cc,,,
.,,,,,,cc888MrajwJ                                   MMcc,,,
..,,,,,,,cccccMMjwQ       EpQ                         .8c,,,
.....,,,,,,ccc88@QEJwr.wPjjjwG                        w8c,,.
..........,,,,c8jjMMMMMMMMM@@w.                      aj8c,,.
...............,,cccccccc888MjQp.                   o@8cc,..
..................,,,,cccccc88MaP  @            ,pGa.g8c,...
.....................,,,,,,cccMQOEjjPrgg,   OrwrwMMMjjc,....
.......................,,,,,,,,c8888M@j,    ,wMM8cccc,,.....
..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
............................,,,,,,,ccccM@aQaM8c,,,,,........
...............................,,,,,,cc8M @Mjc,,,,..........

I'll work on better results for the others.

print()
17 changes: 17 additions & 0 deletions custom-ops-introduction/metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
version: 1.0
long_title: "Custom Operations: An Introduction to Programming GPUs and CPUs with Mojo"
short_title: "Custom Operations: An Introduction"
author: "Brad Larson"
author_image: "author/bradlarson.jpg"
author_url: "https://www.linkedin.com/in/brad-larson-3549a5291/"
github_repo: "https://github.com/modular/max-recipes/tree/main/custom-ops-introduction"
date: "23-02-2025"
difficulty: "beginner"
tags:
- max-graph
- gpu-programming

tasks:
- magic run add_one
- magic run vector_addition
- magic run mandelbrot
12 changes: 12 additions & 0 deletions custom-ops-introduction/operations/__init__.mojo
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# ===----------------------------------------------------------------------=== #
# Copyright (c) 2025, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #
42 changes: 42 additions & 0 deletions custom-ops-introduction/operations/add_one.mojo
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# ===----------------------------------------------------------------------=== #
# Copyright (c) 2025, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #

import compiler
from max.tensor import ManagedTensorSlice, foreach
from runtime.asyncrt import DeviceContextPtr

from utils.index import IndexList


@compiler.register("add_one", num_dps_outputs=1)
struct AddOne:
@staticmethod
fn execute[
# The kind of device this will be run on: "cpu" or "gpu"
target: StringLiteral,
](
# as num_dps_outputs=1, the first argument is the "output"
out: ManagedTensorSlice,
# starting here are the list of inputs
x: ManagedTensorSlice[type = out.type, rank = out.rank],
# the context is needed for some GPU calls
ctx: DeviceContextPtr,
):
@parameter
@always_inline
fn elementwise_add_one[
width: Int
](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
return x.load[width](idx) + 1

foreach[elementwise_add_one, target=target](out, ctx)
Loading