Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quizzes for examples #690

Merged
merged 21 commits into from
Nov 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
3aee9e3
Added a test
skeqiqevian Aug 8, 2024
51d62f8
Merge branch 'main' into schedules_with_bugs
yamaguchi1024 Aug 9, 2024
d63b5be
Added another test
skeqiqevian Aug 14, 2024
2781362
Merge branch 'schedules_with_bugs' of github.com:exo-lang/exo into sc…
skeqiqevian Aug 14, 2024
10691f5
Merge branch 'main' of github.com:exo-lang/exo into schedules_with_bugs
yamaguchi1024 Aug 15, 2024
f011862
Merge branch 'schedules_with_bugs' of github.com:exo-lang/exo into sc…
yamaguchi1024 Aug 15, 2024
51c175c
add rvm example from Julien
yamaguchi1024 Aug 15, 2024
fca3c19
fixed navigation question, added reference count question
skeqiqevian Aug 19, 2024
fbd8b37
Merge branch 'schedules_with_bugs' of github.com:exo-lang/exo into sc…
skeqiqevian Aug 19, 2024
f01efcf
forgot to make reference count schedule wrong
skeqiqevian Aug 19, 2024
5da0917
Rewrite RVM example
yamaguchi1024 Aug 19, 2024
115ddcc
Merge branch 'main' of github.com:exo-lang/exo into schedules_with_bugs
yamaguchi1024 Aug 20, 2024
99a0279
Add incorrect code
yamaguchi1024 Aug 21, 2024
74204ba
Merge branch 'main' of github.com:exo-lang/exo into schedules_with_bugs
yamaguchi1024 Nov 10, 2024
d9772a3
Add pytest for examples
yamaguchi1024 Nov 10, 2024
838b9e9
Add quizzes
yamaguchi1024 Nov 10, 2024
77fb77a
delete generated c files
yamaguchi1024 Nov 10, 2024
3943ce7
delete debugging subdirectory
yamaguchi1024 Nov 10, 2024
d3f4d59
update readme
yamaguchi1024 Nov 10, 2024
4d06b84
update readmes
yamaguchi1024 Nov 10, 2024
8e0b2de
update quiz2 and 3 readmes
yamaguchi1024 Nov 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ If you are new to Exo, we recommend going through the examples in the following
2. [Cursor](./cursors/README.md): This example shows how to use Cursors to efficiently write schedules and define a new scheduling operator.

3. [RVM](./rvm_conv1d/README.md): This example illustrates how to use Exo to define and target a new hardware accelerator entirely in the user code.

4. Quizzes ([quiz1](./quiz1/README.md), [quiz2](./quiz2/README.md), [quiz3](./quiz3/README.md)) contain common scheduling mistakes in Exo and solutions to fix them. The best way to learn a programming language is by debugging code.

1 change: 1 addition & 0 deletions examples/quiz1/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
quiz1/
59 changes: 59 additions & 0 deletions examples/quiz1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Quiz 1

Throughout the quiz, we provide incorrect code and the correct output as a reference. Your goal is to understand the code and fix the bug to match the correct output!

You can execute `quiz1.py` by running `exocc quiz1.py`. Without modification, it will show the incorrect output.

## Incorrect Output

The following output is incorrect because it does not make calls to vector intrinsics. While it matches the structure of SIMD vector code, it is still being executed one element at a time:

```python
def double(N: size, inp: f32[N] @ DRAM, out: f32[N] @ DRAM):
assert N % 8 == 0
two_vec: R[8] @ DRAM
for ii in seq(0, 8):
two_vec[ii] = 2.0
for io in seq(0, N / 8):
out_vec: f32[8] @ DRAM
inp_vec: f32[8] @ DRAM
for i0 in seq(0, 8):
inp_vec[i0] = inp[i0 + 8 * io]
for ii in seq(0, 8):
out_vec[ii] = two_vec[ii] * inp_vec[ii]
for i0 in seq(0, 8):
out[i0 + 8 * io] = out_vec[i0]
```

## Correct Output

The correct output optimizes the function to use vectorized arithmetic operations to compute the result over the entire array:

```python
def double(N: size, inp: f32[N] @ DRAM, out: f32[N] @ DRAM):
assert N % 8 == 0
two_vec: R[8] @ AVX2
vector_assign_two(two_vec[0:8])
for io in seq(0, N / 8):
out_vec: f32[8] @ AVX2
inp_vec: f32[8] @ AVX2
vector_load(inp_vec[0:8], inp[8 * io + 0:8 * io + 8])
vector_multiply(out_vec[0:8], two_vec[0:8], inp_vec[0:8])
vector_store(out[8 * io + 0:8 * io + 8], out_vec[0:8])
```

---

## Solution

Before calling `replace_all(p, avx_instrs)`, you need to set buffer memory annotations to AVX2, because `replace_all` is memory-aware and will only replace code chunks with instructions that have matching memory annotations.

Add the following code before the call to `replace_all`:

```python
# Set the memory types to be AVX2 vectors
for name in ["two", "out", "inp"]:
p = set_memory(p, f"{name}_vec", AVX2)
```

This will ensure that the memory annotations are correctly set to AVX2 before replacing the code with vector intrinsics.
84 changes: 84 additions & 0 deletions examples/quiz1/quiz1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from __future__ import annotations

from exo import *
from exo.libs.memories import AVX2
from exo.stdlib.scheduling import *


@instr("{dst_data} = _mm256_loadu_ps(&{src_data});")
def vector_load(dst: [f32][8] @ AVX2, src: [f32][8] @ DRAM):
assert stride(src, 0) == 1
assert stride(dst, 0) == 1

for i in seq(0, 8):
dst[i] = src[i]


@instr("_mm256_storeu_ps(&{dst_data}, {src_data});")
def vector_store(dst: [f32][8] @ DRAM, src: [f32][8] @ AVX2):
assert stride(src, 0) == 1
assert stride(dst, 0) == 1

for i in seq(0, 8):
dst[i] = src[i]


@instr("{out_data} = _mm256_mul_ps({x_data}, {y_data});")
def vector_multiply(out: [f32][8] @ AVX2, x: [f32][8] @ AVX2, y: [f32][8] @ AVX2):
assert stride(out, 0) == 1
assert stride(x, 0) == 1
assert stride(y, 0) == 1

for i in seq(0, 8):
out[i] = x[i] * y[i]


@instr("{out_data} = _mm256_broadcast_ss(2.0);")
def vector_assign_two(out: [f32][8] @ AVX2):
assert stride(out, 0) == 1

for i in seq(0, 8):
out[i] = 2.0


@proc
def vec_double(N: size, inp: f32[N], out: f32[N]):
assert N % 8 == 0
for i in seq(0, N):
out[i] = 2.0 * inp[i]


def wrong_schedule(p):
"""
Forgot to set the memory types to be AVX2 vectors, so replace instruction
does not work as intended.
"""
p = rename(p, "vec_double_optimized")
p = divide_loop(p, "i", 8, ["io", "ii"], perfect=True)

# Create a vector of twos
p = bind_expr(p, "2.0", "two_vec")
two_alloc = p.find("two_vec: _")
two_assign = p.find("two_vec = _")
p = expand_dim(p, two_alloc, 8, "ii")

# Hoist the allocation and assignment of two vector
p = lift_alloc(p, two_alloc, 2)
p = fission(p, two_assign.after(), 2)
p = remove_loop(p, two_assign.parent().parent())

# Create vectors for the input and output values
innermost_loop = p.find_loop("ii #1")
p = stage_mem(p, innermost_loop, "out[8*io:8*io+8]", "out_vec")
p = stage_mem(p, innermost_loop, "inp[8*io:8*io+8]", "inp_vec")
p = simplify(p)

# Replace with AVX instructinos
avx_instrs = [vector_assign_two, vector_multiply, vector_load, vector_store]
p = replace_all(p, avx_instrs)

return p


w = wrong_schedule(vec_double)
print(w)
1 change: 1 addition & 0 deletions examples/quiz2/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
quiz2/
101 changes: 101 additions & 0 deletions examples/quiz2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Quiz2!

This quiz is about loop fission bugs and debugging via printing cursors.

## Incorrect output (compiler error)
As written, the schedule has a bug which attempts to incorrectly fission a loop.
```bash
Traceback (most recent call last):
File "/home/yuka/.local/bin/exocc", line 8, in <module>
sys.exit(main())
File "/home/yuka/.local/lib/python3.9/site-packages/exo/main.py", line 55, in main
library = [
File "/home/yuka/.local/lib/python3.9/site-packages/exo/main.py", line 58, in <listcomp>
for proc in get_procs_from_module(load_user_code(mod))
File "/home/yuka/.local/lib/python3.9/site-packages/exo/main.py", line 107, in load_user_code
loader.exec_module(user_module)
File "<frozen importlib._bootstrap_external>", line 790, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/home/yuka/exo/examples/quiz2/quiz2.py", line 42, in <module>
w = wrong_schedule(scaled_add)
File "/home/yuka/exo/examples/quiz2/quiz2.py", line 38, in wrong_schedule
p = fission(p, vector_assign.after())
File "/home/yuka/.local/lib/python3.9/site-packages/exo/API_scheduling.py", line 100, in __call__
return self.func(*bound_args.args, **bound_args.kwargs)
File "/home/yuka/.local/lib/python3.9/site-packages/exo/API_scheduling.py", line 2066, in fission
ir, fwd = scheduling.DoFissionAfterSimple(
File "/home/yuka/.local/lib/python3.9/site-packages/exo/rewrite/LoopIR_scheduling.py", line 2385, in DoFissionAfterSimple
alloc_check(pre, post)
File "/home/yuka/.local/lib/python3.9/site-packages/exo/rewrite/LoopIR_scheduling.py", line 2352, in alloc_check
raise SchedulingError(
exo.rewrite.new_eff.SchedulingError: <<<unknown directive>>>: Will not fission here, because doing so will hide the allocation of vec from a later use site.
```
## Correct Output
The correct output will divide the computation into individual, vectorizable loops.
```python
def scaled_add_scheduled(N: size, a: f32[N] @ DRAM, b: f32[N] @ DRAM,
c: f32[N] @ DRAM):
assert N % 8 == 0
for io in seq(0, N / 8):
vec: R[8] @ DRAM
vec_1: R[8] @ DRAM
vec_2: f32[8] @ DRAM
vec_3: R[8] @ DRAM
vec_4: R[8] @ DRAM
vec_5: f32[8] @ DRAM
for ii in seq(0, 8):
vec_1[ii] = 2
for ii in seq(0, 8):
vec_2[ii] = a[8 * io + ii]
for ii in seq(0, 8):
vec[ii] = vec_1[ii] * vec_2[ii]
for ii in seq(0, 8):
vec_4[ii] = 3
for ii in seq(0, 8):
vec_5[ii] = b[8 * io + ii]
for ii in seq(0, 8):
vec_3[ii] = vec_4[ii] * vec_5[ii]
for ii in seq(0, 8):
c[8 * io + ii] = vec[ii] + vec_3[ii]
```
---
## Solution
To understand the bug, let's first try printing right before the error. Add the following line after line 37:
```python
print(vector_assign.after())
```
This will output:
```python
for io in seq(0, N / 8):
vec: R[8] @ DRAM
for ii in seq(0, 8):
vec_1: R @ DRAM
vec_1 = 2
[GAP - After]
...
```
The code is attempting to perform fission at the `[GAP - After]` location.
However, this is unsafe because the `vec_1: R` allocation is within the `ii` loop and before the fission point.
If `vec_1` is used after the fission point, the code will no longer be a valid Exo.
To fix this issue, modify the code as follows:
```python
for i in range(num_vectors):
vector_reg = p.find(f"vec: _ #{i}")
p = expand_dim(p, vector_reg, 8, "ii")
p = lift_alloc(p, vector_reg)
for i in range(num_vectors):
vector_assign = p.find(f"vec = _ #{i}")
p = fission(p, vector_assign.after())
```
By separating the allocation lifting and fission operations into two separate loops, you ensure that all the allocations are lifted out of the loop before performing fission. This resolves the issue of unsafe fission due to the allocation being within the loop.
46 changes: 46 additions & 0 deletions examples/quiz2/quiz2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from __future__ import annotations

from exo import *
from exo.stdlib.scheduling import *


@proc
def scaled_add(N: size, a: f32[N], b: f32[N], c: f32[N]):
assert N % 8 == 0
for i in seq(0, N):
c[i] = 2 * a[i] + 3 * b[i]


def stage_exprs(p, num_vectors, assign):
if isinstance(assign.rhs(), BinaryOpCursor):
p = bind_expr(p, assign.rhs().lhs(), "vec")
num_vectors += 1
p, num_vectors = stage_exprs(p, num_vectors, p.forward(assign).prev())

p = bind_expr(p, assign.rhs().rhs(), "vec")
num_vectors += 1
p, num_vectors = stage_exprs(p, num_vectors, p.forward(assign).prev())
return p, num_vectors


def wrong_schedule(p):
p = rename(p, "scaled_add_scheduled")
num_vectors = 0

p = divide_loop(p, "i", 8, ["io", "ii"], perfect=True)

p, num_vectors = stage_exprs(p, num_vectors, p.find("c[_] = _"))

for i in range(num_vectors):
vector_reg = p.find(f"vec: _ #{i}")
p = expand_dim(p, vector_reg, 8, "ii")
p = lift_alloc(p, vector_reg)

vector_assign = p.find(f"vec = _ #{i}")
p = fission(p, vector_assign.after())

return p


w = wrong_schedule(scaled_add)
print(w)
1 change: 1 addition & 0 deletions examples/quiz3/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
quiz3/
89 changes: 89 additions & 0 deletions examples/quiz3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Quiz3!!

This quiz explores fixing subtle cursor navigation bugs.

## Correct Output
This code makes the optimization of shrinking the `blur_x` memory allocation from (H+2, W) to (34, 256). Since the code has been tiled, we don't need to store the entire intermediate `blur_x` buffer in memory. Instead, we can just reuse the same intermediate buffer for each tile.

To do so, the schedule tries to sink the allocation within the tile, reduce the memory size to the bare minimum necessary for computing that tile, and then lift the allocation back up to the top level scope.
```python
def tile_and_fused_blur(W: size, H: size, blur_y: ui16[H, W] @ DRAM,
inp: ui16[H + 2, W + 2] @ DRAM):
assert H % 32 == 0
assert W % 256 == 0
blur_x: ui16[34, 256] @ DRAM
for yo in seq(0, H / 32):
for xo in seq(0, W / 256):
for yi in seq(0, 34):
for xi in seq(0, 256):
blur_x[yi + 32 * yo - 32 * yo, xi + 256 * xo - 256 *
xo] = (inp[yi + 32 * yo, xi + 256 * xo] +
inp[yi + 32 * yo, 1 + xi + 256 * xo] +
inp[yi + 32 * yo, 2 + xi + 256 * xo]) / 3.0
for yi in seq(0, 32):
for xi in seq(0, 256):
blur_y[yi + 32 * yo, xi +
256 * xo] = (blur_x[yi + 32 * yo - 32 * yo,
xi + 256 * xo - 256 * xo] +
blur_x[1 + yi + 32 * yo - 32 * yo,
xi + 256 * xo - 256 * xo] +
blur_x[2 + yi + 32 * yo - 32 * yo,
xi + 256 * xo - 256 * xo]) / 3.0
```

## Incorrect Output
This output is partially correct: it manages to reduce the height dimension from `H+2` to `34`. However, it fails to reduce the memory usage in the width direction.
```python
def tile_and_fused_blur(W: size, H: size, blur_y: ui16[H, W] @ DRAM,
inp: ui16[H + 2, W + 2] @ DRAM):
assert H % 32 == 0
assert W % 256 == 0
blur_x: ui16[34, W] @ DRAM
for yo in seq(0, H / 32):
for xo in seq(0, W / 256):
for yi in seq(0, 34):
for xi in seq(0, 256):
blur_x[yi + 32 * yo - 32 * yo, xi + 256 *
xo] = (inp[yi + 32 * yo, xi + 256 * xo] +
inp[yi + 32 * yo, 1 + xi + 256 * xo] +
inp[yi + 32 * yo, 2 + xi + 256 * xo]) / 3.0
for yi in seq(0, 32):
for xi in seq(0, 256):
blur_y[yi + 32 * yo, xi + 256 * xo] = (
blur_x[yi + 32 * yo - 32 * yo, xi + 256 * xo] +
blur_x[1 + yi + 32 * yo - 32 * yo, xi + 256 * xo] +
blur_x[2 + yi + 32 * yo - 32 * yo,
xi + 256 * xo]) / 3.0
```

---

## Solution

To understand the bug, let's insert print statements in these places:

```python
print(xo_loop)
loops_to_lower_allocation_into = get_loops_at_or_above(xo_loop)
for i, loop in enumerate(loops_to_lower_allocation_into):
print(i, loop)
...
```

The `xo_loop` points to:
```python
for yo in seq(0, H / 32):
for xo in seq(0, W / 256): # <-- NODE
...
```

And the first (and only) iteration of the `loop` points to:
```python
for yo in seq(0, H / 32): # <-- NODE
for xo in seq(0, W / 256):
...
```

This reveals that the implementation of `get_loops_at_or_above` has a bug because it only contains "loops above" the `xo_loop` (which is `yo` loop), not including the `xo_loop` itself.

To fix this bug, change `loops = []` to `loops = [cursor]` in the implementation of `get_loops_at_or_above`.
Loading
Loading