diff --git a/gpu-functions-mojo/.gitignore b/gpu-functions-mojo/.gitignore
new file mode 100644
index 0000000..ba386de
--- /dev/null
+++ b/gpu-functions-mojo/.gitignore
@@ -0,0 +1,8 @@
+# pixi environments
+.pixi
+*.egg-info
+# magic environments
+.magic
+.env
+# build products
+operations.mojopkg
diff --git a/gpu-functions-mojo/README.md b/gpu-functions-mojo/README.md
new file mode 100644
index 0000000..55804a9
--- /dev/null
+++ b/gpu-functions-mojo/README.md
@@ -0,0 +1,433 @@
+# GPU Functions in Mojo: A CUDA-style Programming Interface
+
+In this recipe, we will cover:
+
+* Writing thread-parallel GPU functions in Mojo.
+* Compiling and dispatching these functions to a GPU using the MAX Driver API.
+* Translating common CUDA C programming patterns to MAX and Mojo.
+
+We'll walk through four GPU programming examples that
+
+* perform basic vector addition,
+* convert a color image to grayscale,
+* do a naive matrix multiplication,
+* and calculate the Mandelbrot set fractal.
+
+Let's get started.
+
+## Requirements
+
+Please make sure your system meets our
+[system requirements](https://docs.modular.com/max/get-started).
+
+To proceed, ensure you have the `magic` CLI installed:
+
+```bash
+curl -ssL https://magic.modular.com/ | bash
+```
+
+or update it via:
+
+```bash
+magic self-update
+```
+
+### GPU requirements
+
+These examples require a MAX-compatible GPU, satisfying
+[these requirements](https://docs.modular.com/max/faq/#gpu-requirements):
+
+* Officially supported GPUs: NVIDIA Ampere A-series (A100/A10), or Ada
+  L4-series (L4/L40) data center GPUs. Unofficially, RTX 30XX and 40XX series
+  GPUs have been reported to work well with MAX.
+* NVIDIA GPU driver version 555 or higher. [Installation guide here](https://www.nvidia.com/download/index.aspx).
+
+## Quick start
+
+1. Download the code for this recipe using git:
+
+```bash
+git clone https://github.com/modular/max-recipes.git
+cd max-recipes/gpu-functions-mojo
+```
+
+2. Run the examples:
+
+```bash
+magic run vector_addition
+magic run grayscale
+magic run naive_matrix_multiplication
+magic run mandelbrot
+```
+
+## Compiling and running GPU functions using the Mojo Driver API
+
+MAX is a flexible, hardware-independent framework for programming GPUs and
+CPUs. It lets you get the most out of accelators without requiring
+architecture-specific frameworks like CUDA. MAX has many levels. from highly
+optimized AI models, to the computational graphs that define those models, to
+direct GPU programming in the Mojo language. You can use whatever level of
+abstraction best suits your needs in MAX.
+
+[Mojo](https://docs.modular.com/mojo/manual/) is a Python-family language that
+has been built for high-performance computing. It allows you to write custom
+algorithms for GPUs without the use of CUDA or other vendor-specific libraries.
+All of the operations that power AI models within MAX are written in Mojo.
+
+We'll demonstrate an entry point into GPU programming with MAX where individual
+thread-based functions can be defined, compiled, and dispatched onto a GPU.
+This is powered by the MAX Driver API, which handles all the hardware-specific
+details of allocating and transferring memory between host and accelerator, as
+well as compilation and execution of accelerator-targeted functions.
+
+The first three examples in this recipe show common starting points for
+thread-based GPU programming. They follow the first three examples in the
+popular GPU programming textbook
+["Programming Massively Parallel Processors"](https://www.sciencedirect.com/book/9780323912310/programming-massively-parallel-processors): 
+
+* Parallel addition of two vectors
+* Conversion of a red-green-blue image to grayscale
+* A naive matrix multiplication, with no hardware-specific optimization
+
+### Basic vector addition
+
+The common "hello world" example used for data-parallel programming is the
+addition of each element in two vectors. Let's take a look at how that can be
+defined in MAX.
+
+The function itself is very simple, running once per thread, adding each
+element in the two input vectors that correspond to that thread ID, and storing
+the result in the output vector at the matching location. If a block of threads
+is dispatched that is wider than the vector length, no work is performed by
+threads past the end of the vector.
+
+```mojo
+fn vector_addition(
+    length: Int,
+    lhs: TensorType,
+    rhs: TensorType,
+    out: TensorType,
+):
+    tid = block_dim.x * block_idx.x + thread_idx.x
+    if tid < length:
+        var result = lhs[tid] + rhs[tid]
+        out[tid] = result
+```
+
+To run this function on the GPU, input and output vectors need to be
+allocated, the function compiled and dispatched, and the results returned.
+
+As a first step, a reference to the host (CPU) and accelerator (GPU) devices
+needs to be obtained:
+
+```mojo
+gpu_device = accelerator()
+host_device = cpu()
+```
+
+If no MAX-compatible accelerator has been found, this will error out and
+indicate as such.
+
+Next, the left-hand-side and right-hand-side vectors need to be allocated on
+the CPU, initialized with values, and then moved to the GPU for use by our
+function. This is done using the MAX Driver API's `Tensor` type
+
+```mojo
+alias float_dtype = DType.float32
+alias tensor_rank = 1
+alias VECTOR_WIDTH = 10
+
+lhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device)
+rhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device)
+
+for i in range(VECTOR_WIDTH):
+    lhs_tensor[i] = 1.25
+    rhs_tensor[i] = 2.5
+
+lhs_tensor = lhs_tensor.move_to(gpu_device)
+rhs_tensor = rhs_tensor.move_to(gpu_device)
+```
+
+A tensor that hosts the result of the calculation is allocated on the GPU:
+
+```mojo
+out_tensor = Tensor[float_dtype, tensor_rank](
+    (VECTOR_WIDTH), gpu_device
+)
+```
+
+The actual `vector_addition` function we want to run on the GPU is compiled and
+dispatched across a grid, divided into blocks of threads. All arguments to this
+GPU function are provided here, in an order that corresponds to their location
+in the function signature. Note that in MAX, the GPU function is compiled for
+the GPU at the time of compilation of the Mojo file containing it.
+
+```mojo
+gpu_function = Accelerator.compile[vector_addition](gpu_device)
+
+alias BLOCK_SIZE = 16
+var num_blocks = ceildiv(VECTOR_WIDTH, BLOCK_SIZE)
+
+gpu_function(
+    gpu_device,
+    VECTOR_WIDTH,
+    lhs_tensor.unsafe_slice(),
+    rhs_tensor.unsafe_slice(),
+    out_tensor.unsafe_slice(),
+    grid_dim=Dim(num_blocks),
+    block_dim=Dim(BLOCK_SIZE),
+)
+```
+
+Finally, the results of the calculation are moved from the GPU back to the host
+to be examined:
+
+```mojo
+out_tensor = out_tensor.move_to(host_device)
+
+print("Resulting vector:", out_tensor)
+```
+
+To try out this example yourself, run it using the following command:
+
+```sh
+magic run vector_addition
+```
+
+You should observe a vector of all 3.75. Experiment with changing the vector
+length, the block size, and other parameters and see how the calculation
+scales.
+
+### Conversion of a color image to grayscale
+
+As a slightly more complex example, the next step in this recipe shows how to
+convert a red-green-blue (RGB) color image into grayscale. This uses a rank-3
+tensor to host the 2-D image and the color channels at each pixel. The inputs
+start with three color channels, and the output only has a single grayscale channel.
+
+The calculation performed is a common reduction to luminance using weighted
+values for the three channels:
+
+```mojo
+gray = 0.21 * red + 0.71 * green + 0.07 * blue
+```
+
+And this is the per-thread function that expresses this to be run on the GPU:
+
+```mojo
+fn color_to_grayscale_conversion(
+    width: Int,
+    height: Int,
+    image: TensorType,
+    out: TensorType,
+):
+    row = block_dim.y * block_idx.y + thread_idx.y
+    col = block_dim.x * block_idx.x + thread_idx.x
+
+    if col < width and row < height:
+        red = image[row, col, 0].cast[internal_float_dtype]()
+        green = image[row, col, 1].cast[internal_float_dtype]()
+        blue = image[row, col, 2].cast[internal_float_dtype]()
+        gray = 0.21 * red + 0.71 * green + 0.07 * blue
+
+        out[row, col, 0] = gray.cast[channel_dtype]()
+```
+
+The setup, compilation, and execution of this function is much the same as in
+the previous example, but in this case we're using rank-3 instead of rank-1
+`Tensor`s to hold the values. Also, the function is dispatched over a 2-D grid
+of block, which looks like the following:
+
+```mojo
+alias BLOCK_SIZE = 16
+num_col_blocks = ceildiv(IMAGE_WIDTH, BLOCK_SIZE)
+num_row_blocks = ceildiv(IMAGE_HEIGHT, BLOCK_SIZE)
+
+gpu_function(
+    gpu_device,
+    IMAGE_WIDTH,
+    IMAGE_HEIGHT,
+    rgb_tensor.unsafe_slice(),
+    gray_tensor.unsafe_slice(),
+    grid_dim=Dim(num_col_blocks, num_row_blocks),
+    block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE),
+)
+```
+
+To run this example, use the command:
+
+```sh
+magic run grayscale
+```
+
+This will show a grid of numbers representing the grayscale values for a single
+color broadcast across a simple input image. Try changing the image and block
+sizes to see how this scales on the GPU.
+
+### Naive matrix multiplication
+
+The next example performs a very basic matrix multiplication, with no
+optimizations to take advantage of hardware resources. The GPU function for
+this looks like the following:
+
+```mojo
+fn naive_matrix_multiplication(
+    i: Int,
+    j: Int,
+    k: Int,
+    m: TensorType,
+    n: TensorType,
+    p: TensorType,
+):
+    row = block_dim.y * block_idx.y + thread_idx.y
+    col = block_dim.x * block_idx.x + thread_idx.x
+
+    if row < i and col < k:
+        for j_index in range(j):
+            p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col]
+```
+
+The overall setup and execution of this function are extremely similar to the
+previous example, with the primary change being the function that is run on the
+GPU.
+
+To try out this example, run the command:
+
+```sh
+magic run naive_matrix_multiplication
+```
+
+You will see the two input matrices printed to the console, as well as the
+result of their multiplication. As with the previous examples, try out changing
+the sizes of the matrices and how they are dispatched on the GPU.
+
+### Calculating the Mandelbrot set fractal
+
+The final example in this recipe shows a slightly more complex calculation
+(pun intended):
+[the Mandelbrot set fractal](https://en.wikipedia.org/wiki/Mandelbrot_set).
+This custom operation takes no input tensors, only a set of scalar arguments,
+and returns a 2-D matrix of integer values representing the number of
+iterations it took to escape at that location in complex number space.
+
+The per-thread GPU function for this is as follows:
+
+```mojo
+fn mandelbrot(
+    min_x: Scalar[float_dtype],
+    min_y: Scalar[float_dtype],
+    scale_x: Scalar[float_dtype],
+    scale_y: Scalar[float_dtype],
+    max_iterations: Scalar[int_dtype],
+    out: TensorType,
+):
+    var row = block_dim.y * block_idx.y + thread_idx.y
+    var col = block_dim.x * block_idx.x + thread_idx.x
+
+    var cx = min_x + col * scale_x
+    var cy = min_y + row * scale_y
+    var c = ComplexSIMD[float_dtype, 1](cx, cy)
+
+    var z = ComplexSIMD[float_dtype, 1](0, 0)
+    var iters = Scalar[int_dtype](0)
+
+    var in_set_mask: Scalar[DType.bool] = True
+    for _ in range(max_iterations):
+        if not any(in_set_mask):
+            break
+        in_set_mask = z.squared_norm() <= 4
+        iters = in_set_mask.select(iters + 1, iters)
+        z = z.squared_add(c)
+
+    out[row, col] = iters
+```
+
+This begins by calculating the complex number which represents a given location
+in the output grid (C). Then, starting from `Z=0`, the calculation `Z=Z^2 + C`
+is iteratively calculated until Z exceeds 4, the threshold we're using for when
+Z will escape the set. This occurs up until a maximum number of iterations,
+and the number of iterations to escape (or not, if the maximum is hit) is then
+returned for each location in the grid.
+
+The area to examine in complex space, the resolution of the grid, and the
+maximum number of iterations, are all provided as constants:
+
+```mojo
+alias MIN_X: Scalar[float_dtype] = -2.0
+alias MAX_X: Scalar[float_dtype] = 0.7
+alias MIN_Y: Scalar[float_dtype] = -1.12
+alias MAX_Y: Scalar[float_dtype] = 1.12
+alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
+alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT
+alias MAX_ITERATIONS = 100
+```
+
+The Mandelbrot set can be calculated on GPU using this command:
+
+```sh
+magic run mandelbrot
+```
+
+The result should be an ASCII art depiction of the region covered by the
+calculation:
+
+```
+...................................,,,,c@8cc,,,.............
+...............................,,,,,,cc8M @Mjc,,,,..........
+............................,,,,,,,ccccM@aQaM8c,,,,,........
+..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
+.......................,,,,,,,,c8888M@j,    ,wMM8cccc,,.....
+.....................,,,,,,cccMQOPjjPrgg,   OrwrwMMMjjc,....
+..................,,,,cccccc88MaP  @            ,pGa.g8c,...
+...............,,cccccccc888MjQp.                   o@8cc,..
+..........,,,,c8jjMMMMMMMMM@@w.                      aj8c,,.
+.....,,,,,,ccc88@QEJwr.wPjjjwG                        w8c,,.
+..,,,,,,,cccccMMjwQ       EpQ                         .8c,,,
+.,,,,,,cc888MrajwJ                                   MMcc,,,
+.cc88jMMM@@jaG.                                     oM8cc,,,
+.cc88jMMM@@jaG.                                     oM8cc,,,
+.,,,,,,cc888MrajwJ                                   MMcc,,,
+..,,,,,,,cccccMMjwQ       EpQ                         .8c,,,
+.....,,,,,,ccc88@QEJwr.wPjjjwG                        w8c,,.
+..........,,,,c8jjMMMMMMMMM@@w.                      aj8c,,.
+...............,,cccccccc888MjQp.                   o@8cc,..
+..................,,,,cccccc88MaP  @            ,pGa.g8c,...
+.....................,,,,,,cccMQOEjjPrgg,   OrwrwMMMjjc,....
+.......................,,,,,,,,c8888M@j,    ,wMM8cccc,,.....
+..........................,,,,,,,ccc88g.o. Owg8ccc,,,,......
+............................,,,,,,,ccccM@aQaM8c,,,,,........
+...............................,,,,,,cc8M @Mjc,,,,..........
+```
+
+Try changing the various parameters above to produce different resolution
+grids, or look into different areas in the complex number space.
+
+## Conclusion
+
+In this recipe, we've demonstrated how to perform the very basics of
+thread-parallel GPU programming in MAX using Mojo and the MAX Driver API. This
+is a programming model that is very familiar to those used to CUDA C, and we
+have used a series of common examples to show how to map concepts from CUDA to
+MAX.
+
+MAX has far more power available for fully utilizing GPUs through 
+[building computational graphs](https://docs.modular.com/max/tutorials/get-started-with-max-graph-in-python)
+and then [defining custom operations](https://docs.modular.com/max/tutorials/build-custom-ops)
+in Mojo using hardware-optimized abstractions. MAX is an extremely flexible
+framework for programming GPUs, with interfaces at many levels. In this recipe,
+you've been introduced to the very basics of getting started with GPU
+programming in MAX and Mojo, but there's much more to explore!
+
+## Next Steps
+
+* Try applying GPU programming in MAX for more complex workloads via tutorials
+  on the [MAX Graph API](https://docs.modular.com/max/tutorials/get-started-with-max-graph-in-python)
+  and [defining custom GPU graph operations in Mojo](https://docs.modular.com/max/tutorials/build-custom-ops).
+
+* Explore MAX's [documentation](https://docs.modular.com/max/) for additional
+  features. The [`gpu`](https://docs.modular.com/mojo/stdlib/gpu/) module has
+  detail on Mojo's GPU programming functions and types.
+
+* Join our [Modular Forum](https://forum.modular.com/) and [Discord community](https://discord.gg/modular) to share your experiences and get support.
+
+We're excited to see what you'll build with MAX! Share your projects and experiences with us using `#ModularAI` on social media.
diff --git a/gpu-functions-mojo/grayscale.mojo b/gpu-functions-mojo/grayscale.mojo
new file mode 100755
index 0000000..1681321
--- /dev/null
+++ b/gpu-functions-mojo/grayscale.mojo
@@ -0,0 +1,134 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+
+from gpu.host import Dim
+from gpu.id import block_dim, block_idx, thread_idx
+from math import ceildiv
+from max.driver import (
+    Accelerator,
+    Device,
+    DynamicTensor,
+    Tensor,
+    accelerator,
+    cpu,
+)
+from sys import has_nvidia_gpu_accelerator
+
+alias channel_dtype = DType.uint8
+alias internal_float_dtype = DType.float32
+alias tensor_rank = 3
+alias TensorType = DynamicTensor[type=channel_dtype, rank=tensor_rank].Type
+
+
+def print_image[h: Int, w: Int](t: Tensor[channel_dtype, 3]):
+    """A helper function to print out the grayscale channel intensities."""
+    out = t.unsafe_slice()
+    for row in range(h):
+        for col in range(w):
+            var v = out[row, col, 0]
+            if v < 100:
+                print(" ", end="")
+                if v < 10:
+                    print(" ", end="")
+            print(v, " ", end="")
+        print("")
+
+
+fn color_to_grayscale_conversion(
+    width: Int,
+    height: Int,
+    image: TensorType,
+    out: TensorType,
+):
+    """Converting each RGB pixel to grayscale, parallelized across the output tensor on the GPU.
+    """
+    row = block_dim.y * block_idx.y + thread_idx.y
+    col = block_dim.x * block_idx.x + thread_idx.x
+
+    if col < width and row < height:
+        red = image[row, col, 0].cast[internal_float_dtype]()
+        green = image[row, col, 1].cast[internal_float_dtype]()
+        blue = image[row, col, 2].cast[internal_float_dtype]()
+        gray = 0.21 * red + 0.71 * green + 0.07 * blue
+
+        out[row, col, 0] = gray.cast[channel_dtype]()
+
+
+def main():
+    @parameter
+    if has_nvidia_gpu_accelerator():
+        # Attempt to connect to a compatible GPU. If one is not found, this will
+        # error out and exit.
+        gpu_device = accelerator()
+        host_device = cpu()
+
+        alias IMAGE_WIDTH = 5
+        alias IMAGE_HEIGHT = 10
+        alias NUM_CHANNELS = 3
+
+        # Allocate the input image tensor on the host.
+        rgb_tensor = Tensor[channel_dtype, tensor_rank](
+            (IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS), host_device
+        )
+
+        # Fill the image with initial colors.
+        for row in range(IMAGE_HEIGHT):
+            for col in range(IMAGE_WIDTH):
+                rgb_tensor[row, col, 0] = row + col
+                rgb_tensor[row, col, 1] = row + col + 20
+                rgb_tensor[row, col, 2] = row + col + 40
+
+        # Move the image tensor to the accelerator.
+        rgb_tensor = rgb_tensor.move_to(gpu_device)
+
+        # Allocate a tensor on the accelerator to host the grayscale image.
+        gray_tensor = Tensor[channel_dtype, tensor_rank](
+            (IMAGE_HEIGHT, IMAGE_WIDTH, 1), gpu_device
+        )
+
+        # Compile the function to run across a grid on the GPU.
+        gpu_function = Accelerator.compile[color_to_grayscale_conversion](
+            gpu_device
+        )
+
+        # The grid is divided up into blocks, making sure there's an extra
+        # full block for any remainder. This hasn't been tuned for any specific
+        # GPU.
+        alias BLOCK_SIZE = 16
+        num_col_blocks = ceildiv(IMAGE_WIDTH, BLOCK_SIZE)
+        num_row_blocks = ceildiv(IMAGE_HEIGHT, BLOCK_SIZE)
+
+        # Launch the compiled function on the GPU. The target device is specified
+        # first, followed by all function arguments. The last two named parameters
+        # are the dimensions of the grid in blocks, and the block dimensions.
+        gpu_function(
+            gpu_device,
+            IMAGE_WIDTH,
+            IMAGE_HEIGHT,
+            rgb_tensor.unsafe_slice(),
+            gray_tensor.unsafe_slice(),
+            grid_dim=Dim(num_col_blocks, num_row_blocks),
+            block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE),
+        )
+
+        # Move the output tensor back onto the CPU so that we can read the results.
+        gray_tensor = gray_tensor.move_to(host_device)
+
+        print("Resulting grayscale image:")
+        print_image[IMAGE_HEIGHT, IMAGE_WIDTH](gray_tensor)
+    else:
+        print(
+            "These examples require a MAX-compatible NVIDIA GPU and none was"
+            " detected."
+        )
diff --git a/gpu-functions-mojo/mandelbrot.mojo b/gpu-functions-mojo/mandelbrot.mojo
new file mode 100755
index 0000000..b70c949
--- /dev/null
+++ b/gpu-functions-mojo/mandelbrot.mojo
@@ -0,0 +1,136 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+from collections.string import StringSlice
+from complex import ComplexSIMD
+from gpu.host import Dim
+from gpu.id import thread_idx, block_dim, block_idx
+from math import ceildiv
+from max.driver import Accelerator, DynamicTensor, Tensor, accelerator, cpu
+from sys import has_nvidia_gpu_accelerator
+
+alias float_dtype = DType.float32
+alias int_dtype = DType.int32
+alias TensorType = DynamicTensor[type=int_dtype, rank=2].Type
+
+
+def draw_mandelbrot[h: Int, w: Int](t: Tensor[int_dtype, 2], max: Int):
+    """A helper function to visualize the Mandelbrot set in ASCII art."""
+    alias sr = StringSlice("....,c8M@jawrpogOQEPGJ")
+    out = t.unsafe_slice()
+    for row in range(h):
+        for col in range(w):
+            var v = out[row, col]
+            if v < max:
+                var idx = Int(v % len(sr))
+                var p = sr[idx]
+                print(p, end="")
+            else:
+                print(" ", end="")
+        print("")
+
+
+fn mandelbrot(
+    min_x: Scalar[float_dtype],
+    min_y: Scalar[float_dtype],
+    scale_x: Scalar[float_dtype],
+    scale_y: Scalar[float_dtype],
+    max_iterations: Scalar[int_dtype],
+    out: TensorType,
+):
+    """The per-element calculation of iterations to escape in the Mandelbrot set.
+    """
+    # Obtain the position in the grid from the X, Y thread locations.
+    var row = block_dim.y * block_idx.y + thread_idx.y
+    var col = block_dim.x * block_idx.x + thread_idx.x
+
+    # Calculate the complex C corresponding to that grid location.
+    var cx = min_x + col * scale_x
+    var cy = min_y + row * scale_y
+    var c = ComplexSIMD[float_dtype, 1](cx, cy)
+
+    # Perform the Mandelbrot iteration loop calculation.
+    var z = ComplexSIMD[float_dtype, 1](0, 0)
+    var iters = Scalar[int_dtype](0)
+
+    var in_set_mask: Scalar[DType.bool] = True
+    for _ in range(max_iterations):
+        if not any(in_set_mask):
+            break
+        in_set_mask = z.squared_norm() <= 4
+        iters = in_set_mask.select(iters + 1, iters)
+        z = z.squared_add(c)
+
+    # Write out the resulting iterations to escape.
+    out[row, col] = iters
+
+
+def main():
+    @parameter
+    if has_nvidia_gpu_accelerator():
+        # Attempt to connect to a compatible GPU. If one is not found, this will
+        # error out and exit.
+        gpu_device = accelerator()
+        host_device = cpu()
+
+        # Compile the function to run across a grid on the GPU.
+        gpu_function = Accelerator.compile[mandelbrot](gpu_device)
+
+        # Set the resolution of the Mandelbrot set grid that will be calculated.
+        alias GRID_WIDTH = 60
+        alias GRID_HEIGHT = 25
+
+        # The grid is divided up into blocks, making sure there's an extra
+        # full block for any remainder. This hasn't been tuned for any specific
+        # GPU.
+        alias BLOCK_SIZE = 16
+        num_col_blocks = ceildiv(GRID_WIDTH, BLOCK_SIZE)
+        num_row_blocks = ceildiv(GRID_HEIGHT, BLOCK_SIZE)
+
+        # Set the parameters for the area of the Mandelbrot set we'll be examining.
+        alias MIN_X: Scalar[float_dtype] = -2.0
+        alias MAX_X: Scalar[float_dtype] = 0.7
+        alias MIN_Y: Scalar[float_dtype] = -1.12
+        alias MAX_Y: Scalar[float_dtype] = 1.12
+        alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH
+        alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT
+        alias MAX_ITERATIONS = 100
+
+        # Allocate a tensor on the target device to hold the resulting set.
+        out_tensor = Tensor[int_dtype, 2]((GRID_HEIGHT, GRID_WIDTH), gpu_device)
+
+        # Launch the compiled function on the GPU. The target device is specified
+        # first, followed by all function arguments. The last two named parameters
+        # are the dimensions of the grid in blocks, and the block dimensions.
+        gpu_function(
+            gpu_device,
+            MIN_X,
+            MIN_Y,
+            SCALE_X,
+            SCALE_Y,
+            MAX_ITERATIONS,
+            out_tensor.unsafe_slice(),
+            grid_dim=Dim(num_col_blocks, num_row_blocks),
+            block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE),
+        )
+
+        # Move the output tensor back onto the CPU so that we can read the results.
+        out_tensor = out_tensor.move_to(host_device)
+
+        # Draw the final Mandelbrot set.
+        draw_mandelbrot[GRID_HEIGHT, GRID_WIDTH](out_tensor, max=MAX_ITERATIONS)
+    else:
+        print(
+            "These examples require a MAX-compatible NVIDIA GPU and none was"
+            " detected."
+        )
diff --git a/gpu-functions-mojo/metadata.yaml b/gpu-functions-mojo/metadata.yaml
new file mode 100644
index 0000000..a28a12b
--- /dev/null
+++ b/gpu-functions-mojo/metadata.yaml
@@ -0,0 +1,19 @@
+version: 1.0
+long_title: "GPU Functions in Mojo: A CUDA-style Programming Interface"
+short_title: "GPU Functions in Mojo"
+author: "Brad Larson"
+author_image: "author/bradlarson.jpg"
+author_url: "https://www.linkedin.com/in/brad-larson-3549a5291/"
+github_repo: "https://github.com/modular/max-recipes/tree/main/gpu-functions-mojo"
+date: "02-03-2025"
+difficulty: "beginner"
+tags:
+  - max-driver
+  - gpu-programming
+  - mojo
+
+tasks:
+  - magic run vector_addition
+  - magic run grayscale
+  - magic run naive_matrix_multiplication
+  - magic run mandelbrot
diff --git a/gpu-functions-mojo/naive_matrix_multiplication.mojo b/gpu-functions-mojo/naive_matrix_multiplication.mojo
new file mode 100755
index 0000000..9bf8df4
--- /dev/null
+++ b/gpu-functions-mojo/naive_matrix_multiplication.mojo
@@ -0,0 +1,120 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+
+from gpu.host import Dim
+from gpu.id import block_dim, block_idx, thread_idx
+from math import ceildiv
+from max.driver import (
+    Accelerator,
+    Device,
+    DynamicTensor,
+    Tensor,
+    accelerator,
+    cpu,
+)
+from sys import has_nvidia_gpu_accelerator
+
+alias float_dtype = DType.float32
+alias tensor_rank = 2
+alias TensorType = DynamicTensor[type=float_dtype, rank=tensor_rank].Type
+
+
+fn naive_matrix_multiplication(
+    i: Int,
+    j: Int,
+    k: Int,
+    m: TensorType,
+    n: TensorType,
+    p: TensorType,
+):
+    """Naive matrix multiplication of M_ij x N_jk = P_ik."""
+    row = block_dim.y * block_idx.y + thread_idx.y
+    col = block_dim.x * block_idx.x + thread_idx.x
+
+    if row < i and col < k:
+        for j_index in range(j):
+            p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col]
+
+
+def main():
+    @parameter
+    if has_nvidia_gpu_accelerator():
+        # Attempt to connect to a compatible GPU. If one is not found, this will
+        # error out and exit.
+        gpu_device = accelerator()
+        host_device = cpu()
+
+        alias I = 5
+        alias J = 4
+        alias K = 6
+
+        # Allocate the two input matrices on the host.
+        m_tensor = Tensor[float_dtype, tensor_rank]((I, J), host_device)
+        n_tensor = Tensor[float_dtype, tensor_rank]((J, K), host_device)
+
+        # Fill them with initial values.
+        for m_row in range(I):
+            for m_col in range(J):
+                m_tensor[m_row, m_col] = m_row - m_col
+
+        for n_row in range(J):
+            for n_col in range(K):
+                n_tensor[n_row, n_col] = n_row + n_col
+
+        print("M matrix:", m_tensor)
+        print("N matrix:", n_tensor)
+
+        # Move the input matrices to the accelerator.
+        m_tensor = m_tensor.move_to(gpu_device)
+        n_tensor = n_tensor.move_to(gpu_device)
+
+        # Allocate a tensor on the accelerator to host the calculation results.
+        p_tensor = Tensor[float_dtype, tensor_rank]((I, K), gpu_device)
+
+        # Compile the function to run across a grid on the GPU.
+        gpu_function = Accelerator.compile[naive_matrix_multiplication](
+            gpu_device
+        )
+
+        # The grid is divided up into blocks, making sure there's an extra
+        # full block for any remainder. This hasn't been tuned for any specific
+        # GPU.
+        alias BLOCK_SIZE = 16
+        num_col_blocks = ceildiv(I, BLOCK_SIZE)
+        num_row_blocks = ceildiv(J, BLOCK_SIZE)
+
+        # Launch the compiled function on the GPU. The target device is specified
+        # first, followed by all function arguments. The last two named parameters
+        # are the dimensions of the grid in blocks, and the block dimensions.
+        gpu_function(
+            gpu_device,
+            I,
+            J,
+            K,
+            m_tensor.unsafe_slice(),
+            n_tensor.unsafe_slice(),
+            p_tensor.unsafe_slice(),
+            grid_dim=Dim(num_col_blocks, num_row_blocks),
+            block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE),
+        )
+
+        # Move the output tensor back onto the CPU so that we can read the results.
+        p_tensor = p_tensor.move_to(host_device)
+
+        print("Resulting matrix:", p_tensor)
+    else:
+        print(
+            "These examples require a MAX-compatible NVIDIA GPU and none was"
+            " detected."
+        )
diff --git a/gpu-functions-mojo/pyproject.toml b/gpu-functions-mojo/pyproject.toml
new file mode 100644
index 0000000..6187d93
--- /dev/null
+++ b/gpu-functions-mojo/pyproject.toml
@@ -0,0 +1,31 @@
+[project]
+authors = [{ name = "Modular Inc", email = "hello@modular.com" }]
+description = "GPU Functions in Mojo: A CUDA-style Programming Interface"
+name = "gpu-functions-mojo"
+requires-python = ">= 3.9,<3.13"
+version = "0.1.0"
+
+[build-system]
+build-backend = "hatchling.build"
+requires = ["hatchling"]
+
+[tool.hatch.build.targets.wheel]
+packages = ["."]
+
+[tool.pixi.project]
+channels = [
+    "conda-forge",
+    "https://conda.modular.com/max-nightly",
+    "https://conda.modular.com/max",
+    "https://repo.prefix.dev/modular-community",
+]
+platforms = ["linux-64", "linux-aarch64"]
+
+[tool.pixi.tasks]
+vector_addition = { cmd = "mojo vector_addition.mojo" }
+grayscale = { cmd = "mojo benchmarks.mojo" }
+naive_matrix_multiplication = { cmd = "mojo naive_matrix_multiplication.mojo" }
+mandelbrot = { cmd = "mojo mandelbrot.mojo" }
+
+[tool.pixi.dependencies]
+max = ">=25.2.0.dev2025022205"
diff --git a/gpu-functions-mojo/vector_addition.mojo b/gpu-functions-mojo/vector_addition.mojo
new file mode 100755
index 0000000..900c845
--- /dev/null
+++ b/gpu-functions-mojo/vector_addition.mojo
@@ -0,0 +1,103 @@
+# ===----------------------------------------------------------------------=== #
+# Copyright (c) 2025, Modular Inc. All rights reserved.
+#
+# Licensed under the Apache License v2.0 with LLVM Exceptions:
+# https://llvm.org/LICENSE.txt
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ===----------------------------------------------------------------------=== #
+
+from gpu.host import Dim
+from gpu.id import block_dim, block_idx, thread_idx
+from math import ceildiv
+from max.driver import (
+    Accelerator,
+    Device,
+    DynamicTensor,
+    Tensor,
+    accelerator,
+    cpu,
+)
+from sys import has_nvidia_gpu_accelerator
+
+alias float_dtype = DType.float32
+alias tensor_rank = 1
+alias TensorType = DynamicTensor[type=float_dtype, rank=tensor_rank].Type
+
+
+fn vector_addition(
+    length: Int,
+    lhs: TensorType,
+    rhs: TensorType,
+    out: TensorType,
+):
+    """The calculation to perform across the vector on the GPU."""
+    tid = block_dim.x * block_idx.x + thread_idx.x
+    if tid < length:
+        var result = lhs[tid] + rhs[tid]
+        out[tid] = result
+
+
+def main():
+    @parameter
+    if has_nvidia_gpu_accelerator():
+        # Attempt to connect to a compatible GPU. If one is not found, this will
+        # error out and exit.
+        gpu_device = accelerator()
+        host_device = cpu()
+
+        alias VECTOR_WIDTH = 10
+
+        # Allocate the two input tensors on the host.
+        lhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device)
+        rhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device)
+
+        # Fill them with initial values.
+        for i in range(VECTOR_WIDTH):
+            lhs_tensor[i] = 1.25
+            rhs_tensor[i] = 2.5
+
+        # Move the input tensors to the accelerator.
+        lhs_tensor = lhs_tensor.move_to(gpu_device)
+        rhs_tensor = rhs_tensor.move_to(gpu_device)
+
+        # Allocate a tensor on the accelerator to host the calculation results.
+        out_tensor = Tensor[float_dtype, tensor_rank](
+            (VECTOR_WIDTH), gpu_device
+        )
+
+        # Compile the function to run across a grid on the GPU.
+        gpu_function = Accelerator.compile[vector_addition](gpu_device)
+
+        # The grid is divided up into blocks, making sure there's an extra
+        # full block for any remainder. This hasn't been tuned for any specific
+        # GPU.
+        alias BLOCK_SIZE = 16
+        var num_blocks = ceildiv(VECTOR_WIDTH, BLOCK_SIZE)
+
+        # Launch the compiled function on the GPU. The target device is specified
+        # first, followed by all function arguments. The last two named parameters
+        # are the dimensions of the grid in blocks, and the block dimensions.
+        gpu_function(
+            gpu_device,
+            VECTOR_WIDTH,
+            lhs_tensor.unsafe_slice(),
+            rhs_tensor.unsafe_slice(),
+            out_tensor.unsafe_slice(),
+            grid_dim=Dim(num_blocks),
+            block_dim=Dim(BLOCK_SIZE),
+        )
+
+        # Move the output tensor back onto the CPU so that we can read the results.
+        out_tensor = out_tensor.move_to(host_device)
+
+        print("Resulting vector:", out_tensor)
+    else:
+        print(
+            "These examples require a MAX-compatible NVIDIA GPU and none was"
+            " detected."
+        )