Adding advanced FAQ #47

EmilyWebber · 2025-01-27T19:10:46Z

Issue #, if available:

Adding an advanced FAQ quide

Description of changes:

Description of common issues and how to resolve them

Testing:

Please see detailed unit test requirements in the CONTRIBUTING.md

The change is covered by numeric check using nki.baremetal
The change is covered by performance benchmark test using nki.benchmark
The change is covered by end-to-end integration test

Pull Request Checklist

I have filled in all the required field in the template
I have tested locally that all the tests pass
By submitting this pull request, I confirm that my contribution is made under the terms of the MIT-0 license.

JonathanHenson · 2025-03-19T17:56:47Z

contributed/advanced_FAQs.md

+```
+instead of pre-declaring and using `var[...]`.
+
+### When should I use `+=` operator?


This can be removed. +=, /=, -=, and *=n, works across sbuf and psum now. It has not been released, but the issue is fixed and folks can go back to using this syntax.

JonathanHenson · 2025-03-19T18:06:14Z

contributed/advanced_FAQs.md

+Any other use of `+=` may trigger compiler bugs. Please do not use `+=` outside of matrix multiplication.
+
+### What's the difference between `affine_range` and `sequential_range`?
+- [`nl.affine_range`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.language.affine_range.html#nki.language.affine_range) creates a sequence of numbers for use as parallel loop iterators in NKI. What this means is that the compiler will unroll your loop, identify the parts that can be parallelized, and then run them in parallel. For this reason, `affine_range` should be the default loop iterator choice when there is no loop-carried dependency. What that means is that you can should only use `nl.affine_range` when there is no depdency between steps in your loops, ie when the steps can be executed at any point in time. Please note that associative reductions are not considered loop carried dependencies in this context. A concrete example of associative reduction is multiple `nl.matmul` or `nisa.nc_matmul` calls accumulating into the same output buffer defined outside of this loop level. `nl.affine_range` allows parallel execution of loop iterations when there are no dependencies


At the moment, we should encourage everyone to use built-in range.

Nit on the affine_range definition.

Maybe integrate something like:

"When the compiler unrolls a loop specified by nl.affine_range, each iteration is treated as an independent slice that can be run concurrently without interacting with any other iterations. This type of loop works well for automatically writing simple vectorized loops when it does not contain loop-carried dependencies."

"When the compiler unrolls a loop specifed by nl.sequential_range, each iteration is unrolled in order. In addition to supporting loop-carried dependencies, this loop type works well for manually writing vectorized loops."

or suggest affine_range is like open mp parallel_for

Hi team, I would like to ask whether the internal code of affine_range runs in parallel on the two cores of trn1. Compared to nl.sequential_range, does it introduce additional communication overhead?

Hey Dinghong! I think it's more parallelizing execution across available engines, not necessary across cores (but am curious to know if others disagree).

Generally affine range should introduce much better perf due to parallelizing.

The static range is just for getting started and debugging.

JonathanHenson · 2025-03-19T18:07:12Z

contributed/advanced_FAQs.md

+
+```
+
+Both options can work. But option 2 will be more efficient, as it will turn the `nl.sequential_range()` back to `nl.affine_range()` and get better throughput since the loop iterations don't need to wait for a shared chunk of memory to be updated. Tensor `w_temp` is not used outside of the loop so allocating it outside of the loop just adds an unnecessary loop-carried dependency to figure out. 


can work, but

JonathanHenson · 2025-03-19T18:09:10Z

contributed/advanced_FAQs.md

+**Solution:**
+- Check tensor dimensions against hardware limits
+- Verify tile sizes are within SBUF partition size
+- Consider breaking large tensors into smaller tiles


If you are using masks, be sure you're also propagating them across all relevant compute operations.

JonathanHenson · 2025-03-19T18:09:31Z

contributed/advanced_FAQs.md

+**Common causes:**
+- Using very small tiles for DMA operations
+- Complex nested loops with small tile sizes
+- Excessive unrolling


static_range will also cause this.

instead of saying "Excessive unrolling", we should say huge tripcount in nki loops?

EmilyWebber added 11 commits January 27, 2025 14:04

Add files via upload

412d7b7

Update advanced_FAQs.md

3113aa9

Update advanced_FAQs.md

ff36773

Update advanced_FAQs.md

392e9e6

Update advanced_FAQs.md

adb5018

Update advanced_FAQs.md

3c3af84

Update advanced_FAQs.md

6f1489d

Update advanced_FAQs.md

f0b749f

Update advanced_FAQs.md

af4d47e

Update advanced_FAQs.md

84c98d5

Update advanced_FAQs.md

f4b4666

This comment was marked as resolved.

Sign in to view

EmilyWebber mentioned this pull request Mar 5, 2025

Compilation of the model fail for seq_len = 640 aws-samples/nki-llama#20

Closed

EmilyWebber requested a review from JonathanHenson March 17, 2025 18:19

JonathanHenson reviewed Mar 19, 2025

View reviewed changes

Merge branch 'main' into main

579719a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding advanced FAQ #47

Adding advanced FAQ #47

EmilyWebber commented Jan 27, 2025 •

edited

Loading

This comment was marked as resolved.

JonathanHenson Mar 19, 2025

JonathanHenson Mar 19, 2025

aws-zhehongb Mar 21, 2025

dinghongsong Mar 22, 2025

EmilyWebber Mar 22, 2025

JonathanHenson Mar 19, 2025

JonathanHenson Mar 19, 2025

JonathanHenson Mar 19, 2025

aws-zhehongb Mar 21, 2025


		```

		Both options can work. But option 2 will be more efficient, as it will turn the `nl.sequential_range()` back to `nl.affine_range()` and get better throughput since the loop iterations don't need to wait for a shared chunk of memory to be updated. Tensor `w_temp` is not used outside of the loop so allocating it outside of the loop just adds an unnecessary loop-carried dependency to figure out.

Adding advanced FAQ #47

Are you sure you want to change the base?

Adding advanced FAQ #47

Conversation

EmilyWebber commented Jan 27, 2025 • edited Loading

Issue #, if available:

Description of changes:

Testing:

Pull Request Checklist

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmilyWebber commented Jan 27, 2025 •

edited

Loading