diff --git a/gpu-functions-mojo/.gitignore b/gpu-functions-mojo/.gitignore new file mode 100644 index 0000000..ba386de --- /dev/null +++ b/gpu-functions-mojo/.gitignore @@ -0,0 +1,8 @@ +# pixi environments +.pixi +*.egg-info +# magic environments +.magic +.env +# build products +operations.mojopkg diff --git a/gpu-functions-mojo/README.md b/gpu-functions-mojo/README.md new file mode 100644 index 0000000..55804a9 --- /dev/null +++ b/gpu-functions-mojo/README.md @@ -0,0 +1,433 @@ +# GPU Functions in Mojo: A CUDA-style Programming Interface + +In this recipe, we will cover: + +* Writing thread-parallel GPU functions in Mojo. +* Compiling and dispatching these functions to a GPU using the MAX Driver API. +* Translating common CUDA C programming patterns to MAX and Mojo. + +We'll walk through four GPU programming examples that + +* perform basic vector addition, +* convert a color image to grayscale, +* do a naive matrix multiplication, +* and calculate the Mandelbrot set fractal. + +Let's get started. + +## Requirements + +Please make sure your system meets our +[system requirements](https://docs.modular.com/max/get-started). + +To proceed, ensure you have the `magic` CLI installed: + +```bash +curl -ssL https://magic.modular.com/ | bash +``` + +or update it via: + +```bash +magic self-update +``` + +### GPU requirements + +These examples require a MAX-compatible GPU, satisfying +[these requirements](https://docs.modular.com/max/faq/#gpu-requirements): + +* Officially supported GPUs: NVIDIA Ampere A-series (A100/A10), or Ada + L4-series (L4/L40) data center GPUs. Unofficially, RTX 30XX and 40XX series + GPUs have been reported to work well with MAX. +* NVIDIA GPU driver version 555 or higher. [Installation guide here](https://www.nvidia.com/download/index.aspx). + +## Quick start + +1. Download the code for this recipe using git: + +```bash +git clone https://github.com/modular/max-recipes.git +cd max-recipes/gpu-functions-mojo +``` + +2. Run the examples: + +```bash +magic run vector_addition +magic run grayscale +magic run naive_matrix_multiplication +magic run mandelbrot +``` + +## Compiling and running GPU functions using the Mojo Driver API + +MAX is a flexible, hardware-independent framework for programming GPUs and +CPUs. It lets you get the most out of accelators without requiring +architecture-specific frameworks like CUDA. MAX has many levels. from highly +optimized AI models, to the computational graphs that define those models, to +direct GPU programming in the Mojo language. You can use whatever level of +abstraction best suits your needs in MAX. + +[Mojo](https://docs.modular.com/mojo/manual/) is a Python-family language that +has been built for high-performance computing. It allows you to write custom +algorithms for GPUs without the use of CUDA or other vendor-specific libraries. +All of the operations that power AI models within MAX are written in Mojo. + +We'll demonstrate an entry point into GPU programming with MAX where individual +thread-based functions can be defined, compiled, and dispatched onto a GPU. +This is powered by the MAX Driver API, which handles all the hardware-specific +details of allocating and transferring memory between host and accelerator, as +well as compilation and execution of accelerator-targeted functions. + +The first three examples in this recipe show common starting points for +thread-based GPU programming. They follow the first three examples in the +popular GPU programming textbook +["Programming Massively Parallel Processors"](https://www.sciencedirect.com/book/9780323912310/programming-massively-parallel-processors): + +* Parallel addition of two vectors +* Conversion of a red-green-blue image to grayscale +* A naive matrix multiplication, with no hardware-specific optimization + +### Basic vector addition + +The common "hello world" example used for data-parallel programming is the +addition of each element in two vectors. Let's take a look at how that can be +defined in MAX. + +The function itself is very simple, running once per thread, adding each +element in the two input vectors that correspond to that thread ID, and storing +the result in the output vector at the matching location. If a block of threads +is dispatched that is wider than the vector length, no work is performed by +threads past the end of the vector. + +```mojo +fn vector_addition( + length: Int, + lhs: TensorType, + rhs: TensorType, + out: TensorType, +): + tid = block_dim.x * block_idx.x + thread_idx.x + if tid < length: + var result = lhs[tid] + rhs[tid] + out[tid] = result +``` + +To run this function on the GPU, input and output vectors need to be +allocated, the function compiled and dispatched, and the results returned. + +As a first step, a reference to the host (CPU) and accelerator (GPU) devices +needs to be obtained: + +```mojo +gpu_device = accelerator() +host_device = cpu() +``` + +If no MAX-compatible accelerator has been found, this will error out and +indicate as such. + +Next, the left-hand-side and right-hand-side vectors need to be allocated on +the CPU, initialized with values, and then moved to the GPU for use by our +function. This is done using the MAX Driver API's `Tensor` type + +```mojo +alias float_dtype = DType.float32 +alias tensor_rank = 1 +alias VECTOR_WIDTH = 10 + +lhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device) +rhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device) + +for i in range(VECTOR_WIDTH): + lhs_tensor[i] = 1.25 + rhs_tensor[i] = 2.5 + +lhs_tensor = lhs_tensor.move_to(gpu_device) +rhs_tensor = rhs_tensor.move_to(gpu_device) +``` + +A tensor that hosts the result of the calculation is allocated on the GPU: + +```mojo +out_tensor = Tensor[float_dtype, tensor_rank]( + (VECTOR_WIDTH), gpu_device +) +``` + +The actual `vector_addition` function we want to run on the GPU is compiled and +dispatched across a grid, divided into blocks of threads. All arguments to this +GPU function are provided here, in an order that corresponds to their location +in the function signature. Note that in MAX, the GPU function is compiled for +the GPU at the time of compilation of the Mojo file containing it. + +```mojo +gpu_function = Accelerator.compile[vector_addition](gpu_device) + +alias BLOCK_SIZE = 16 +var num_blocks = ceildiv(VECTOR_WIDTH, BLOCK_SIZE) + +gpu_function( + gpu_device, + VECTOR_WIDTH, + lhs_tensor.unsafe_slice(), + rhs_tensor.unsafe_slice(), + out_tensor.unsafe_slice(), + grid_dim=Dim(num_blocks), + block_dim=Dim(BLOCK_SIZE), +) +``` + +Finally, the results of the calculation are moved from the GPU back to the host +to be examined: + +```mojo +out_tensor = out_tensor.move_to(host_device) + +print("Resulting vector:", out_tensor) +``` + +To try out this example yourself, run it using the following command: + +```sh +magic run vector_addition +``` + +You should observe a vector of all 3.75. Experiment with changing the vector +length, the block size, and other parameters and see how the calculation +scales. + +### Conversion of a color image to grayscale + +As a slightly more complex example, the next step in this recipe shows how to +convert a red-green-blue (RGB) color image into grayscale. This uses a rank-3 +tensor to host the 2-D image and the color channels at each pixel. The inputs +start with three color channels, and the output only has a single grayscale channel. + +The calculation performed is a common reduction to luminance using weighted +values for the three channels: + +```mojo +gray = 0.21 * red + 0.71 * green + 0.07 * blue +``` + +And this is the per-thread function that expresses this to be run on the GPU: + +```mojo +fn color_to_grayscale_conversion( + width: Int, + height: Int, + image: TensorType, + out: TensorType, +): + row = block_dim.y * block_idx.y + thread_idx.y + col = block_dim.x * block_idx.x + thread_idx.x + + if col < width and row < height: + red = image[row, col, 0].cast[internal_float_dtype]() + green = image[row, col, 1].cast[internal_float_dtype]() + blue = image[row, col, 2].cast[internal_float_dtype]() + gray = 0.21 * red + 0.71 * green + 0.07 * blue + + out[row, col, 0] = gray.cast[channel_dtype]() +``` + +The setup, compilation, and execution of this function is much the same as in +the previous example, but in this case we're using rank-3 instead of rank-1 +`Tensor`s to hold the values. Also, the function is dispatched over a 2-D grid +of block, which looks like the following: + +```mojo +alias BLOCK_SIZE = 16 +num_col_blocks = ceildiv(IMAGE_WIDTH, BLOCK_SIZE) +num_row_blocks = ceildiv(IMAGE_HEIGHT, BLOCK_SIZE) + +gpu_function( + gpu_device, + IMAGE_WIDTH, + IMAGE_HEIGHT, + rgb_tensor.unsafe_slice(), + gray_tensor.unsafe_slice(), + grid_dim=Dim(num_col_blocks, num_row_blocks), + block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE), +) +``` + +To run this example, use the command: + +```sh +magic run grayscale +``` + +This will show a grid of numbers representing the grayscale values for a single +color broadcast across a simple input image. Try changing the image and block +sizes to see how this scales on the GPU. + +### Naive matrix multiplication + +The next example performs a very basic matrix multiplication, with no +optimizations to take advantage of hardware resources. The GPU function for +this looks like the following: + +```mojo +fn naive_matrix_multiplication( + i: Int, + j: Int, + k: Int, + m: TensorType, + n: TensorType, + p: TensorType, +): + row = block_dim.y * block_idx.y + thread_idx.y + col = block_dim.x * block_idx.x + thread_idx.x + + if row < i and col < k: + for j_index in range(j): + p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col] +``` + +The overall setup and execution of this function are extremely similar to the +previous example, with the primary change being the function that is run on the +GPU. + +To try out this example, run the command: + +```sh +magic run naive_matrix_multiplication +``` + +You will see the two input matrices printed to the console, as well as the +result of their multiplication. As with the previous examples, try out changing +the sizes of the matrices and how they are dispatched on the GPU. + +### Calculating the Mandelbrot set fractal + +The final example in this recipe shows a slightly more complex calculation +(pun intended): +[the Mandelbrot set fractal](https://en.wikipedia.org/wiki/Mandelbrot_set). +This custom operation takes no input tensors, only a set of scalar arguments, +and returns a 2-D matrix of integer values representing the number of +iterations it took to escape at that location in complex number space. + +The per-thread GPU function for this is as follows: + +```mojo +fn mandelbrot( + min_x: Scalar[float_dtype], + min_y: Scalar[float_dtype], + scale_x: Scalar[float_dtype], + scale_y: Scalar[float_dtype], + max_iterations: Scalar[int_dtype], + out: TensorType, +): + var row = block_dim.y * block_idx.y + thread_idx.y + var col = block_dim.x * block_idx.x + thread_idx.x + + var cx = min_x + col * scale_x + var cy = min_y + row * scale_y + var c = ComplexSIMD[float_dtype, 1](cx, cy) + + var z = ComplexSIMD[float_dtype, 1](0, 0) + var iters = Scalar[int_dtype](0) + + var in_set_mask: Scalar[DType.bool] = True + for _ in range(max_iterations): + if not any(in_set_mask): + break + in_set_mask = z.squared_norm() <= 4 + iters = in_set_mask.select(iters + 1, iters) + z = z.squared_add(c) + + out[row, col] = iters +``` + +This begins by calculating the complex number which represents a given location +in the output grid (C). Then, starting from `Z=0`, the calculation `Z=Z^2 + C` +is iteratively calculated until Z exceeds 4, the threshold we're using for when +Z will escape the set. This occurs up until a maximum number of iterations, +and the number of iterations to escape (or not, if the maximum is hit) is then +returned for each location in the grid. + +The area to examine in complex space, the resolution of the grid, and the +maximum number of iterations, are all provided as constants: + +```mojo +alias MIN_X: Scalar[float_dtype] = -2.0 +alias MAX_X: Scalar[float_dtype] = 0.7 +alias MIN_Y: Scalar[float_dtype] = -1.12 +alias MAX_Y: Scalar[float_dtype] = 1.12 +alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH +alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT +alias MAX_ITERATIONS = 100 +``` + +The Mandelbrot set can be calculated on GPU using this command: + +```sh +magic run mandelbrot +``` + +The result should be an ASCII art depiction of the region covered by the +calculation: + +``` +...................................,,,,c@8cc,,,............. +...............................,,,,,,cc8M @Mjc,,,,.......... +............................,,,,,,,ccccM@aQaM8c,,,,,........ +..........................,,,,,,,ccc88g.o. Owg8ccc,,,,...... +.......................,,,,,,,,c8888M@j, ,wMM8cccc,,..... +.....................,,,,,,cccMQOPjjPrgg, OrwrwMMMjjc,.... +..................,,,,cccccc88MaP @ ,pGa.g8c,... +...............,,cccccccc888MjQp. o@8cc,.. +..........,,,,c8jjMMMMMMMMM@@w. aj8c,,. +.....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,. +..,,,,,,,cccccMMjwQ EpQ .8c,,, +.,,,,,,cc888MrajwJ MMcc,,, +.cc88jMMM@@jaG. oM8cc,,, +.cc88jMMM@@jaG. oM8cc,,, +.,,,,,,cc888MrajwJ MMcc,,, +..,,,,,,,cccccMMjwQ EpQ .8c,,, +.....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,. +..........,,,,c8jjMMMMMMMMM@@w. aj8c,,. +...............,,cccccccc888MjQp. o@8cc,.. +..................,,,,cccccc88MaP @ ,pGa.g8c,... +.....................,,,,,,cccMQOEjjPrgg, OrwrwMMMjjc,.... +.......................,,,,,,,,c8888M@j, ,wMM8cccc,,..... +..........................,,,,,,,ccc88g.o. Owg8ccc,,,,...... +............................,,,,,,,ccccM@aQaM8c,,,,,........ +...............................,,,,,,cc8M @Mjc,,,,.......... +``` + +Try changing the various parameters above to produce different resolution +grids, or look into different areas in the complex number space. + +## Conclusion + +In this recipe, we've demonstrated how to perform the very basics of +thread-parallel GPU programming in MAX using Mojo and the MAX Driver API. This +is a programming model that is very familiar to those used to CUDA C, and we +have used a series of common examples to show how to map concepts from CUDA to +MAX. + +MAX has far more power available for fully utilizing GPUs through +[building computational graphs](https://docs.modular.com/max/tutorials/get-started-with-max-graph-in-python) +and then [defining custom operations](https://docs.modular.com/max/tutorials/build-custom-ops) +in Mojo using hardware-optimized abstractions. MAX is an extremely flexible +framework for programming GPUs, with interfaces at many levels. In this recipe, +you've been introduced to the very basics of getting started with GPU +programming in MAX and Mojo, but there's much more to explore! + +## Next Steps + +* Try applying GPU programming in MAX for more complex workloads via tutorials + on the [MAX Graph API](https://docs.modular.com/max/tutorials/get-started-with-max-graph-in-python) + and [defining custom GPU graph operations in Mojo](https://docs.modular.com/max/tutorials/build-custom-ops). + +* Explore MAX's [documentation](https://docs.modular.com/max/) for additional + features. The [`gpu`](https://docs.modular.com/mojo/stdlib/gpu/) module has + detail on Mojo's GPU programming functions and types. + +* Join our [Modular Forum](https://forum.modular.com/) and [Discord community](https://discord.gg/modular) to share your experiences and get support. + +We're excited to see what you'll build with MAX! Share your projects and experiences with us using `#ModularAI` on social media. diff --git a/gpu-functions-mojo/grayscale.mojo b/gpu-functions-mojo/grayscale.mojo new file mode 100755 index 0000000..1681321 --- /dev/null +++ b/gpu-functions-mojo/grayscale.mojo @@ -0,0 +1,134 @@ +# ===----------------------------------------------------------------------=== # +# Copyright (c) 2025, Modular Inc. All rights reserved. +# +# Licensed under the Apache License v2.0 with LLVM Exceptions: +# https://llvm.org/LICENSE.txt +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ===----------------------------------------------------------------------=== # + + +from gpu.host import Dim +from gpu.id import block_dim, block_idx, thread_idx +from math import ceildiv +from max.driver import ( + Accelerator, + Device, + DynamicTensor, + Tensor, + accelerator, + cpu, +) +from sys import has_nvidia_gpu_accelerator + +alias channel_dtype = DType.uint8 +alias internal_float_dtype = DType.float32 +alias tensor_rank = 3 +alias TensorType = DynamicTensor[type=channel_dtype, rank=tensor_rank].Type + + +def print_image[h: Int, w: Int](t: Tensor[channel_dtype, 3]): + """A helper function to print out the grayscale channel intensities.""" + out = t.unsafe_slice() + for row in range(h): + for col in range(w): + var v = out[row, col, 0] + if v < 100: + print(" ", end="") + if v < 10: + print(" ", end="") + print(v, " ", end="") + print("") + + +fn color_to_grayscale_conversion( + width: Int, + height: Int, + image: TensorType, + out: TensorType, +): + """Converting each RGB pixel to grayscale, parallelized across the output tensor on the GPU. + """ + row = block_dim.y * block_idx.y + thread_idx.y + col = block_dim.x * block_idx.x + thread_idx.x + + if col < width and row < height: + red = image[row, col, 0].cast[internal_float_dtype]() + green = image[row, col, 1].cast[internal_float_dtype]() + blue = image[row, col, 2].cast[internal_float_dtype]() + gray = 0.21 * red + 0.71 * green + 0.07 * blue + + out[row, col, 0] = gray.cast[channel_dtype]() + + +def main(): + @parameter + if has_nvidia_gpu_accelerator(): + # Attempt to connect to a compatible GPU. If one is not found, this will + # error out and exit. + gpu_device = accelerator() + host_device = cpu() + + alias IMAGE_WIDTH = 5 + alias IMAGE_HEIGHT = 10 + alias NUM_CHANNELS = 3 + + # Allocate the input image tensor on the host. + rgb_tensor = Tensor[channel_dtype, tensor_rank]( + (IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS), host_device + ) + + # Fill the image with initial colors. + for row in range(IMAGE_HEIGHT): + for col in range(IMAGE_WIDTH): + rgb_tensor[row, col, 0] = row + col + rgb_tensor[row, col, 1] = row + col + 20 + rgb_tensor[row, col, 2] = row + col + 40 + + # Move the image tensor to the accelerator. + rgb_tensor = rgb_tensor.move_to(gpu_device) + + # Allocate a tensor on the accelerator to host the grayscale image. + gray_tensor = Tensor[channel_dtype, tensor_rank]( + (IMAGE_HEIGHT, IMAGE_WIDTH, 1), gpu_device + ) + + # Compile the function to run across a grid on the GPU. + gpu_function = Accelerator.compile[color_to_grayscale_conversion]( + gpu_device + ) + + # The grid is divided up into blocks, making sure there's an extra + # full block for any remainder. This hasn't been tuned for any specific + # GPU. + alias BLOCK_SIZE = 16 + num_col_blocks = ceildiv(IMAGE_WIDTH, BLOCK_SIZE) + num_row_blocks = ceildiv(IMAGE_HEIGHT, BLOCK_SIZE) + + # Launch the compiled function on the GPU. The target device is specified + # first, followed by all function arguments. The last two named parameters + # are the dimensions of the grid in blocks, and the block dimensions. + gpu_function( + gpu_device, + IMAGE_WIDTH, + IMAGE_HEIGHT, + rgb_tensor.unsafe_slice(), + gray_tensor.unsafe_slice(), + grid_dim=Dim(num_col_blocks, num_row_blocks), + block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE), + ) + + # Move the output tensor back onto the CPU so that we can read the results. + gray_tensor = gray_tensor.move_to(host_device) + + print("Resulting grayscale image:") + print_image[IMAGE_HEIGHT, IMAGE_WIDTH](gray_tensor) + else: + print( + "These examples require a MAX-compatible NVIDIA GPU and none was" + " detected." + ) diff --git a/gpu-functions-mojo/mandelbrot.mojo b/gpu-functions-mojo/mandelbrot.mojo new file mode 100755 index 0000000..b70c949 --- /dev/null +++ b/gpu-functions-mojo/mandelbrot.mojo @@ -0,0 +1,136 @@ +# ===----------------------------------------------------------------------=== # +# Copyright (c) 2025, Modular Inc. All rights reserved. +# +# Licensed under the Apache License v2.0 with LLVM Exceptions: +# https://llvm.org/LICENSE.txt +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ===----------------------------------------------------------------------=== # + +from collections.string import StringSlice +from complex import ComplexSIMD +from gpu.host import Dim +from gpu.id import thread_idx, block_dim, block_idx +from math import ceildiv +from max.driver import Accelerator, DynamicTensor, Tensor, accelerator, cpu +from sys import has_nvidia_gpu_accelerator + +alias float_dtype = DType.float32 +alias int_dtype = DType.int32 +alias TensorType = DynamicTensor[type=int_dtype, rank=2].Type + + +def draw_mandelbrot[h: Int, w: Int](t: Tensor[int_dtype, 2], max: Int): + """A helper function to visualize the Mandelbrot set in ASCII art.""" + alias sr = StringSlice("....,c8M@jawrpogOQEPGJ") + out = t.unsafe_slice() + for row in range(h): + for col in range(w): + var v = out[row, col] + if v < max: + var idx = Int(v % len(sr)) + var p = sr[idx] + print(p, end="") + else: + print(" ", end="") + print("") + + +fn mandelbrot( + min_x: Scalar[float_dtype], + min_y: Scalar[float_dtype], + scale_x: Scalar[float_dtype], + scale_y: Scalar[float_dtype], + max_iterations: Scalar[int_dtype], + out: TensorType, +): + """The per-element calculation of iterations to escape in the Mandelbrot set. + """ + # Obtain the position in the grid from the X, Y thread locations. + var row = block_dim.y * block_idx.y + thread_idx.y + var col = block_dim.x * block_idx.x + thread_idx.x + + # Calculate the complex C corresponding to that grid location. + var cx = min_x + col * scale_x + var cy = min_y + row * scale_y + var c = ComplexSIMD[float_dtype, 1](cx, cy) + + # Perform the Mandelbrot iteration loop calculation. + var z = ComplexSIMD[float_dtype, 1](0, 0) + var iters = Scalar[int_dtype](0) + + var in_set_mask: Scalar[DType.bool] = True + for _ in range(max_iterations): + if not any(in_set_mask): + break + in_set_mask = z.squared_norm() <= 4 + iters = in_set_mask.select(iters + 1, iters) + z = z.squared_add(c) + + # Write out the resulting iterations to escape. + out[row, col] = iters + + +def main(): + @parameter + if has_nvidia_gpu_accelerator(): + # Attempt to connect to a compatible GPU. If one is not found, this will + # error out and exit. + gpu_device = accelerator() + host_device = cpu() + + # Compile the function to run across a grid on the GPU. + gpu_function = Accelerator.compile[mandelbrot](gpu_device) + + # Set the resolution of the Mandelbrot set grid that will be calculated. + alias GRID_WIDTH = 60 + alias GRID_HEIGHT = 25 + + # The grid is divided up into blocks, making sure there's an extra + # full block for any remainder. This hasn't been tuned for any specific + # GPU. + alias BLOCK_SIZE = 16 + num_col_blocks = ceildiv(GRID_WIDTH, BLOCK_SIZE) + num_row_blocks = ceildiv(GRID_HEIGHT, BLOCK_SIZE) + + # Set the parameters for the area of the Mandelbrot set we'll be examining. + alias MIN_X: Scalar[float_dtype] = -2.0 + alias MAX_X: Scalar[float_dtype] = 0.7 + alias MIN_Y: Scalar[float_dtype] = -1.12 + alias MAX_Y: Scalar[float_dtype] = 1.12 + alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH + alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT + alias MAX_ITERATIONS = 100 + + # Allocate a tensor on the target device to hold the resulting set. + out_tensor = Tensor[int_dtype, 2]((GRID_HEIGHT, GRID_WIDTH), gpu_device) + + # Launch the compiled function on the GPU. The target device is specified + # first, followed by all function arguments. The last two named parameters + # are the dimensions of the grid in blocks, and the block dimensions. + gpu_function( + gpu_device, + MIN_X, + MIN_Y, + SCALE_X, + SCALE_Y, + MAX_ITERATIONS, + out_tensor.unsafe_slice(), + grid_dim=Dim(num_col_blocks, num_row_blocks), + block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE), + ) + + # Move the output tensor back onto the CPU so that we can read the results. + out_tensor = out_tensor.move_to(host_device) + + # Draw the final Mandelbrot set. + draw_mandelbrot[GRID_HEIGHT, GRID_WIDTH](out_tensor, max=MAX_ITERATIONS) + else: + print( + "These examples require a MAX-compatible NVIDIA GPU and none was" + " detected." + ) diff --git a/gpu-functions-mojo/metadata.yaml b/gpu-functions-mojo/metadata.yaml new file mode 100644 index 0000000..a28a12b --- /dev/null +++ b/gpu-functions-mojo/metadata.yaml @@ -0,0 +1,19 @@ +version: 1.0 +long_title: "GPU Functions in Mojo: A CUDA-style Programming Interface" +short_title: "GPU Functions in Mojo" +author: "Brad Larson" +author_image: "author/bradlarson.jpg" +author_url: "https://www.linkedin.com/in/brad-larson-3549a5291/" +github_repo: "https://github.com/modular/max-recipes/tree/main/gpu-functions-mojo" +date: "02-03-2025" +difficulty: "beginner" +tags: + - max-driver + - gpu-programming + - mojo + +tasks: + - magic run vector_addition + - magic run grayscale + - magic run naive_matrix_multiplication + - magic run mandelbrot diff --git a/gpu-functions-mojo/naive_matrix_multiplication.mojo b/gpu-functions-mojo/naive_matrix_multiplication.mojo new file mode 100755 index 0000000..9bf8df4 --- /dev/null +++ b/gpu-functions-mojo/naive_matrix_multiplication.mojo @@ -0,0 +1,120 @@ +# ===----------------------------------------------------------------------=== # +# Copyright (c) 2025, Modular Inc. All rights reserved. +# +# Licensed under the Apache License v2.0 with LLVM Exceptions: +# https://llvm.org/LICENSE.txt +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ===----------------------------------------------------------------------=== # + + +from gpu.host import Dim +from gpu.id import block_dim, block_idx, thread_idx +from math import ceildiv +from max.driver import ( + Accelerator, + Device, + DynamicTensor, + Tensor, + accelerator, + cpu, +) +from sys import has_nvidia_gpu_accelerator + +alias float_dtype = DType.float32 +alias tensor_rank = 2 +alias TensorType = DynamicTensor[type=float_dtype, rank=tensor_rank].Type + + +fn naive_matrix_multiplication( + i: Int, + j: Int, + k: Int, + m: TensorType, + n: TensorType, + p: TensorType, +): + """Naive matrix multiplication of M_ij x N_jk = P_ik.""" + row = block_dim.y * block_idx.y + thread_idx.y + col = block_dim.x * block_idx.x + thread_idx.x + + if row < i and col < k: + for j_index in range(j): + p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col] + + +def main(): + @parameter + if has_nvidia_gpu_accelerator(): + # Attempt to connect to a compatible GPU. If one is not found, this will + # error out and exit. + gpu_device = accelerator() + host_device = cpu() + + alias I = 5 + alias J = 4 + alias K = 6 + + # Allocate the two input matrices on the host. + m_tensor = Tensor[float_dtype, tensor_rank]((I, J), host_device) + n_tensor = Tensor[float_dtype, tensor_rank]((J, K), host_device) + + # Fill them with initial values. + for m_row in range(I): + for m_col in range(J): + m_tensor[m_row, m_col] = m_row - m_col + + for n_row in range(J): + for n_col in range(K): + n_tensor[n_row, n_col] = n_row + n_col + + print("M matrix:", m_tensor) + print("N matrix:", n_tensor) + + # Move the input matrices to the accelerator. + m_tensor = m_tensor.move_to(gpu_device) + n_tensor = n_tensor.move_to(gpu_device) + + # Allocate a tensor on the accelerator to host the calculation results. + p_tensor = Tensor[float_dtype, tensor_rank]((I, K), gpu_device) + + # Compile the function to run across a grid on the GPU. + gpu_function = Accelerator.compile[naive_matrix_multiplication]( + gpu_device + ) + + # The grid is divided up into blocks, making sure there's an extra + # full block for any remainder. This hasn't been tuned for any specific + # GPU. + alias BLOCK_SIZE = 16 + num_col_blocks = ceildiv(I, BLOCK_SIZE) + num_row_blocks = ceildiv(J, BLOCK_SIZE) + + # Launch the compiled function on the GPU. The target device is specified + # first, followed by all function arguments. The last two named parameters + # are the dimensions of the grid in blocks, and the block dimensions. + gpu_function( + gpu_device, + I, + J, + K, + m_tensor.unsafe_slice(), + n_tensor.unsafe_slice(), + p_tensor.unsafe_slice(), + grid_dim=Dim(num_col_blocks, num_row_blocks), + block_dim=Dim(BLOCK_SIZE, BLOCK_SIZE), + ) + + # Move the output tensor back onto the CPU so that we can read the results. + p_tensor = p_tensor.move_to(host_device) + + print("Resulting matrix:", p_tensor) + else: + print( + "These examples require a MAX-compatible NVIDIA GPU and none was" + " detected." + ) diff --git a/gpu-functions-mojo/pyproject.toml b/gpu-functions-mojo/pyproject.toml new file mode 100644 index 0000000..6187d93 --- /dev/null +++ b/gpu-functions-mojo/pyproject.toml @@ -0,0 +1,31 @@ +[project] +authors = [{ name = "Modular Inc", email = "hello@modular.com" }] +description = "GPU Functions in Mojo: A CUDA-style Programming Interface" +name = "gpu-functions-mojo" +requires-python = ">= 3.9,<3.13" +version = "0.1.0" + +[build-system] +build-backend = "hatchling.build" +requires = ["hatchling"] + +[tool.hatch.build.targets.wheel] +packages = ["."] + +[tool.pixi.project] +channels = [ + "conda-forge", + "https://conda.modular.com/max-nightly", + "https://conda.modular.com/max", + "https://repo.prefix.dev/modular-community", +] +platforms = ["linux-64", "linux-aarch64"] + +[tool.pixi.tasks] +vector_addition = { cmd = "mojo vector_addition.mojo" } +grayscale = { cmd = "mojo benchmarks.mojo" } +naive_matrix_multiplication = { cmd = "mojo naive_matrix_multiplication.mojo" } +mandelbrot = { cmd = "mojo mandelbrot.mojo" } + +[tool.pixi.dependencies] +max = ">=25.2.0.dev2025022205" diff --git a/gpu-functions-mojo/vector_addition.mojo b/gpu-functions-mojo/vector_addition.mojo new file mode 100755 index 0000000..900c845 --- /dev/null +++ b/gpu-functions-mojo/vector_addition.mojo @@ -0,0 +1,103 @@ +# ===----------------------------------------------------------------------=== # +# Copyright (c) 2025, Modular Inc. All rights reserved. +# +# Licensed under the Apache License v2.0 with LLVM Exceptions: +# https://llvm.org/LICENSE.txt +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ===----------------------------------------------------------------------=== # + +from gpu.host import Dim +from gpu.id import block_dim, block_idx, thread_idx +from math import ceildiv +from max.driver import ( + Accelerator, + Device, + DynamicTensor, + Tensor, + accelerator, + cpu, +) +from sys import has_nvidia_gpu_accelerator + +alias float_dtype = DType.float32 +alias tensor_rank = 1 +alias TensorType = DynamicTensor[type=float_dtype, rank=tensor_rank].Type + + +fn vector_addition( + length: Int, + lhs: TensorType, + rhs: TensorType, + out: TensorType, +): + """The calculation to perform across the vector on the GPU.""" + tid = block_dim.x * block_idx.x + thread_idx.x + if tid < length: + var result = lhs[tid] + rhs[tid] + out[tid] = result + + +def main(): + @parameter + if has_nvidia_gpu_accelerator(): + # Attempt to connect to a compatible GPU. If one is not found, this will + # error out and exit. + gpu_device = accelerator() + host_device = cpu() + + alias VECTOR_WIDTH = 10 + + # Allocate the two input tensors on the host. + lhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device) + rhs_tensor = Tensor[float_dtype, 1]((VECTOR_WIDTH), host_device) + + # Fill them with initial values. + for i in range(VECTOR_WIDTH): + lhs_tensor[i] = 1.25 + rhs_tensor[i] = 2.5 + + # Move the input tensors to the accelerator. + lhs_tensor = lhs_tensor.move_to(gpu_device) + rhs_tensor = rhs_tensor.move_to(gpu_device) + + # Allocate a tensor on the accelerator to host the calculation results. + out_tensor = Tensor[float_dtype, tensor_rank]( + (VECTOR_WIDTH), gpu_device + ) + + # Compile the function to run across a grid on the GPU. + gpu_function = Accelerator.compile[vector_addition](gpu_device) + + # The grid is divided up into blocks, making sure there's an extra + # full block for any remainder. This hasn't been tuned for any specific + # GPU. + alias BLOCK_SIZE = 16 + var num_blocks = ceildiv(VECTOR_WIDTH, BLOCK_SIZE) + + # Launch the compiled function on the GPU. The target device is specified + # first, followed by all function arguments. The last two named parameters + # are the dimensions of the grid in blocks, and the block dimensions. + gpu_function( + gpu_device, + VECTOR_WIDTH, + lhs_tensor.unsafe_slice(), + rhs_tensor.unsafe_slice(), + out_tensor.unsafe_slice(), + grid_dim=Dim(num_blocks), + block_dim=Dim(BLOCK_SIZE), + ) + + # Move the output tensor back onto the CPU so that we can read the results. + out_tensor = out_tensor.move_to(host_device) + + print("Resulting vector:", out_tensor) + else: + print( + "These examples require a MAX-compatible NVIDIA GPU and none was" + " detected." + )