Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding advanced FAQ #47

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Adding advanced FAQ #47

wants to merge 12 commits into from

Conversation

EmilyWebber
Copy link

@EmilyWebber EmilyWebber commented Jan 27, 2025

Issue #, if available:

Adding an advanced FAQ quide

Description of changes:

Description of common issues and how to resolve them

Testing:

Please see detailed unit test requirements in the CONTRIBUTING.md

  • The change is covered by numeric check using nki.baremetal
  • The change is covered by performance benchmark test using nki.benchmark
  • The change is covered by end-to-end integration test

Pull Request Checklist

  • I have filled in all the required field in the template
  • I have tested locally that all the tests pass
  • By submitting this pull request, I confirm that my contribution is made under the terms of the MIT-0 license.

emeryberger

This comment was marked as resolved.

```
instead of pre-declaring and using `var[...]`.

### When should I use `+=` operator?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed. +=, /=, -=, and *=n, works across sbuf and psum now. It has not been released, but the issue is fixed and folks can go back to using this syntax.

Any other use of `+=` may trigger compiler bugs. Please do not use `+=` outside of matrix multiplication.

### What's the difference between `affine_range` and `sequential_range`?
- [`nl.affine_range`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.language.affine_range.html#nki.language.affine_range) creates a sequence of numbers for use as parallel loop iterators in NKI. What this means is that the compiler will unroll your loop, identify the parts that can be parallelized, and then run them in parallel. For this reason, `affine_range` should be the default loop iterator choice when there is no loop-carried dependency. What that means is that you can should only use `nl.affine_range` when there is no depdency between steps in your loops, ie when the steps can be executed at any point in time. Please note that associative reductions are not considered loop carried dependencies in this context. A concrete example of associative reduction is multiple `nl.matmul` or `nisa.nc_matmul` calls accumulating into the same output buffer defined outside of this loop level. `nl.affine_range` allows parallel execution of loop iterations when there are no dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we should encourage everyone to use built-in range.

Nit on the affine_range definition.

Maybe integrate something like:

"When the compiler unrolls a loop specified by nl.affine_range, each iteration is treated as an independent slice that can be run concurrently without interacting with any other iterations. This type of loop works well for automatically writing simple vectorized loops when it does not contain loop-carried dependencies."

"When the compiler unrolls a loop specifed by nl.sequential_range, each iteration is unrolled in order. In addition to supporting loop-carried dependencies, this loop type works well for manually writing vectorized loops."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or suggest affine_range is like open mp parallel_for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team, I would like to ask whether the internal code of affine_range runs in parallel on the two cores of trn1. Compared to nl.sequential_range, does it introduce additional communication overhead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dinghong! I think it's more parallelizing execution across available engines, not necessary across cores (but am curious to know if others disagree).

Generally affine range should introduce much better perf due to parallelizing.

The static range is just for getting started and debugging.


```

Both options can work. But option 2 will be more efficient, as it will turn the `nl.sequential_range()` back to `nl.affine_range()` and get better throughput since the loop iterations don't need to wait for a shared chunk of memory to be updated. Tensor `w_temp` is not used outside of the loop so allocating it outside of the loop just adds an unnecessary loop-carried dependency to figure out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can work, but

**Solution:**
- Check tensor dimensions against hardware limits
- Verify tile sizes are within SBUF partition size
- Consider breaking large tensors into smaller tiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If you are using masks, be sure you're also propagating them across all relevant compute operations.

**Common causes:**
- Using very small tiles for DMA operations
- Complex nested loops with small tile sizes
- Excessive unrolling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_range will also cause this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of saying "Excessive unrolling", we should say huge tripcount in nki loops?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants