Initial custom ops introductory recipe.

modular · Feb 23, 2025 · 8afd653 · 8afd653
1 parent 18d4767
commit 8afd653
Show file tree

Hide file tree

Showing 11 changed files with 650 additions and 0 deletions.
diff --git a/custom-ops-introduction/.gitignore b/custom-ops-introduction/.gitignore
@@ -0,0 +1,8 @@
+# pixi environments
+.pixi
+*.egg-info
+# magic environments
+.magic
+.env
+# build products
+operations.mojopkg
diff --git a/custom-ops-introduction/README.md b/custom-ops-introduction/README.md
@@ -0,0 +1,115 @@
+# Custom Operations: An Introduction to Programming GPUs and CPUs with Mojo
+
+In this recipe, we will cover:
+
+* How to extend a MAX Graph using custom operations.
+* Using Mojo to write high-performance calculations that run on GPUs and CPUs.
+* The basics of GPU programming in MAX.
+
+We'll walk through running three examples that show
+
+* adding one to every number in an input tensor
+* performing hardware-specific addition of two vectors
+* and calculating the Mandelbrot set on CPU and GPU.
+
+Let's get started.
+
+## Requirements
+
+Please make sure your system meets our
+[system requirements](https://docs.modular.com/max/get-started).
+
+To proceed, ensure you have the `magic` CLI installed:
+
+```bash
+curl -ssL https://magic.modular.com/ | bash
+```
+
+or update it via:
+
+```bash
+magic self-update
+```
+
+### GPU requirements
+
+These examples can all be run on either a CPU or GPU. To run them on a GPU,
+ensure your system meets
+[these GPU requirements](https://docs.modular.com/max/faq/#gpu-requirements):
+
+* Officially supported GPUs: NVIDIA Ampere A-series (A100/A10), or Ada
+  L4-series (L4/L40) data center GPUs. Unofficially, RTX 30XX and 40XX series
+  GPUs have been reported to work well with MAX.
+* NVIDIA GPU driver version 555 or higher. [Installation guide here](https://www.nvidia.com/download/index.aspx).
+
+## Quick start
+
+1. Download the code for this recipe using git:
+
+```bash
+git clone https://github.com/modular/max-recipes.git
+cd max-recipes/custom-ops-introduction
+```
+
+2. Run each of the examples:
+
+```bash
+magic run add_one
+magic run vector_addition
+magic run mandelbrot
+```
+
+3. Browse through the commented source code to see how they work.
+
+## Custom operation examples
+
+Graphs in MAX can be extended to use custom operations written in Mojo. The
+following examples are shown here:
+
+* **add_one**: Adding 1 to every element of an input tensor.
+* **vector_addition**: Performing vector addition using a manual GPU function.
+* **mandelbrot**: Calculating the Mandelbrot set.
+
+Custom operations have been written in Mojo to carry out these calculations. For
+each example, a simple graph containing a single operation is constructed
+in Python. This graph is compiled and dispatched onto a supported GPU if one is
+available, or the CPU if not. Input tensors, if there are any, are moved from
+the host to the device on which the graph is running. The graph then runs and
+the results are copied back to the host for display.
+
+One thing to note is that this same Mojo code runs on CPU as well as GPU. In
+the construction of the graph, it runs on a supported accelerator if one is
+available or falls back to the CPU if not. No code changes for either path.
+The `vector_addition` example shows how this works under the hood for common
+MAX abstractions, where compile-time specialization lets MAX choose the optimal
+code path for a given hardware architecture.
+
+The `operations/` directory contains the custom kernel implementations, and the
+graph construction occurs in the Python files in the base directory. These
+examples are designed to stand on their own, so that they can be used as
+templates for experimentation.
+
+The execution has two phases: first an `operations.mojopkg` is compiled from the
+custom Mojo kernel, and then the graph is constructed and run in Python. The
+inference session is pointed to the `operations.mojopkg` in order to load the
+custom operations.
+
+## Conclusion
+
+In this recipe, we've introduced the basics of how to write custom MAX Graph
+operations using Mojo, place them in a one-operation graph in Python, and run
+them on an available CPU or GPU.
+
+## Next Steps
+
+* Follow [our tutorial for building a custom operation from scratch](https://docs.modular.com/max/tutorials/build-custom-ops).
+
+* Explore MAX's [documentation](https://docs.modular.com/max/) for additional
+  features. The [`gpu`](https://docs.modular.com/mojo/stdlib/gpu/) module has
+  detail on Mojo's GPU programming functions and types, and the documentation
+  on [`@compiler.register`](https://docs.modular.com/max/api/mojo-decorators/compiler-register/)
+  shows how to register custom graph operations.
+
+* Join our [Modular Forum](https://forum.modular.com/) and [Discord community](https://discord.gg/modular) to share your experiences and get support.
+
+We're excited to see what you'll build with MAX! Share your projects and experiences with us using `#ModularAI` on social media.
diff --git a/custom-ops-introduction/add_one.py b/custom-ops-introduction/add_one.py
@@ -0,0 +1,74 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+from pathlib import Path
+
+import numpy as np
+from max.driver import CPU, Accelerator, Tensor, accelerator_count
+from max.dtype import DType
+from max.engine import InferenceSession
+from max.graph import Graph, TensorType, ops
+
+if __name__ == "__main__":
+    path = Path(__file__).parent / "operations.mojopkg"
+
+    rows = 5
+    columns = 10
+    dtype = DType.float32
+
+    # Configure our simple one-operation graph.
+    graph = Graph(
+        "addition",
+        # The custom Mojo operation is referenced by its string name, and we
+        # need to provide inputs as a list as well as expected output types.
+        forward=lambda x: ops.custom(
+            name="add_one",
+            values=[x],
+            out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)],
+        )[0].tensor,
+        input_types=[
+            TensorType(dtype, shape=[rows, columns]),
+        ],
+    )
+
+    # Place the graph on a GPU, if available. Fall back to CPU if not.
+    device = CPU() if accelerator_count() == 0 else Accelerator()
+
+    # Set up an inference session for running the graph.
+    session = InferenceSession(
+        devices=[device],
+        custom_extensions=path,
+    )
+
+    # Compile the graph.
+    model = session.load(graph)
+
+    # Fill an input matrix with random values.
+    x_values = np.random.uniform(size=(rows, columns)).astype(np.float32)
+
+    # Create a driver tensor from this, and move it to the accelerator.
+    x = Tensor.from_numpy(x_values).to(device)
+
+    # Perform the calculation on the target device.
+    result = model.execute(x)[0]
+
+    # Copy values back to the CPU to be read.
+    assert isinstance(result, Tensor)
+    result = result.to(CPU())
+
+    print("Graph result:")
+    print(result.to_numpy())
+    print()
+
+    print("Expected result:")
+    print(x_values + 1)
diff --git a/custom-ops-introduction/mandelbrot.py b/custom-ops-introduction/mandelbrot.py
@@ -0,0 +1,94 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+from pathlib import Path
+
+from max.driver import CPU, Accelerator, Tensor, accelerator_count
+from max.dtype import DType
+from max.engine import InferenceSession
+from max.graph import Graph, TensorType, ops
+
+
+def create_mandelbrot_graph(
+    width: int,
+    height: int,
+    min_x: float,
+    min_y: float,
+    scale_x: float,
+    scale_y: float,
+    max_iterations: int,
+) -> Graph:
+    """Configure a graph to run a Mandelbrot kernel."""
+    output_dtype = DType.int32
+    with Graph(
+        "mandelbrot",
+    ) as graph:
+        # The custom Mojo operation is referenced by its string name, and we
+        # need to provide inputs as a list as well as expected output types.
+        result = ops.custom(
+            name="mandelbrot",
+            values=[
+                ops.constant(min_x, dtype=DType.float32),
+                ops.constant(min_y, dtype=DType.float32),
+                ops.constant(scale_x, dtype=DType.float32),
+                ops.constant(scale_y, dtype=DType.float32),
+                ops.constant(max_iterations, dtype=DType.int32),
+            ],
+            out_types=[TensorType(dtype=output_dtype, shape=[height, width])],
+        )[0].tensor
+
+        # Return the result of the custom operation as the output of the graph.
+        graph.output(result)
+        return graph
+
+
+if __name__ == "__main__":
+    path = Path(__file__).parent / "operations.mojopkg"
+
+    # Establish Mandelbrot set ranges.
+    WIDTH = 15
+    HEIGHT = 15
+    MAX_ITERATIONS = 100
+    MIN_X = -1.5
+    MAX_X = 0.7
+    MIN_Y = -1.12
+    MAX_Y = 1.12
+
+    # Configure our simple graph.
+    scale_x = (MAX_X - MIN_X) / WIDTH
+    scale_y = (MAX_Y - MIN_Y) / HEIGHT
+    graph = create_mandelbrot_graph(
+        WIDTH, HEIGHT, MIN_X, MIN_Y, scale_x, scale_y, MAX_ITERATIONS
+    )
+
+    # Place the graph on a GPU, if available. Fall back to CPU if not.
+    device = CPU() if accelerator_count() == 0 else Accelerator()
+
+    # Set up an inference session that runs the graph on a GPU, if available.
+    session = InferenceSession(
+        devices=[device],
+        custom_extensions=path,
+    )
+    # Compile the graph.
+    model = session.load(graph)
+
+    # Perform the calculation on the target device.
+    result = model.execute()[0]
+
+    # Copy values back to the CPU to be read.
+    assert isinstance(result, Tensor)
+    result = result.to(CPU())
+
+    print("Iterations to escape:")
+    print(result.to_numpy())
+    print()
diff --git a/custom-ops-introduction/metadata.yaml b/custom-ops-introduction/metadata.yaml
@@ -0,0 +1,17 @@
+version: 1.0
+long_title: "Custom Operations: An Introduction to Programming GPUs and CPUs with Mojo"
+short_title: "Custom Operations: An Introduction"
+author: "Brad Larson"
+author_image: "author/bradlarson.jpg"
+author_url: "https://www.linkedin.com/in/brad-larson-3549a5291/"
+github_repo: "https://github.com/modular/max-recipes/tree/main/custom-ops-introduction"
+date: "23-02-2025"
+difficulty: "beginner"
+tags:
+  - max-graph
+  - gpu-programming
+
+tasks:
+  - magic run add_one
+  - magic run vector_addition
+  - magic run mandelbrot
diff --git a/custom-ops-introduction/operations/__init__.mojo b/custom-ops-introduction/operations/__init__.mojo
@@ -0,0 +1,12 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
diff --git a/custom-ops-introduction/operations/add_one.mojo b/custom-ops-introduction/operations/add_one.mojo
@@ -0,0 +1,42 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+import compiler
+from max.tensor import ManagedTensorSlice, foreach
+from runtime.asyncrt import DeviceContextPtr
+
+from utils.index import IndexList
+
+
+@compiler.register("add_one", num_dps_outputs=1)
+struct AddOne:
+    @staticmethod
+    fn execute[
+        # The kind of device this will be run on: "cpu" or "gpu"
+        target: StringLiteral,
+    ](
+        # as num_dps_outputs=1, the first argument is the "output"
+        out: ManagedTensorSlice,
+        # starting here are the list of inputs
+        x: ManagedTensorSlice[type = out.type, rank = out.rank],
+        # the context is needed for some GPU calls
+        ctx: DeviceContextPtr,
+    ):
+        @parameter
+        @always_inline
+        fn elementwise_add_one[
+            width: Int
+        ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
+            return x.load[width](idx) + 1
+
+        foreach[elementwise_add_one, target=target](out, ctx)