You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-06-20-accelerating-neural-network-training.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -187,7 +187,7 @@ For our experiments, the ViT-L model is trained on ImageNet for 125k steps using
187
187
</td>
188
188
</tr>
189
189
<tr>
190
-
<td>40% sparse (40k sparse -> 85k dense steps)
190
+
<td>40% sparse (50k sparse -> 75k dense steps)
191
191
</td>
192
192
<td><strong>82.9</strong>
193
193
</td>
@@ -242,7 +242,7 @@ There are several areas of expansion for this work:
242
242
***Expansion to new sparsity patterns:** Researchers have created new sparsity patterns like [V:N:M](https://arxiv.org/pdf/2310.02065) sparsity that use the underlying semi-structured sparse kernels to allow for more flexibility. This is especially interesting for applying sparsity to LLMs, as 2:4 sparsity degrades accuracy too much, but we have seen some positive [results](https://arxiv.org/pdf/2310.06927) for more general N:M pattern.
243
243
***Performance optimizations for sparse fine-tuning:** This post covers sparse training from scratch, but oftentimes we want to fine-tune a foundational model. In this case, a static mask may be sufficient to preserve accuracy which would enable us to make additional performance optimizations.
244
244
***More experiments on pruning strategy:** We calculate the mask at each step of the network, but calculating the mask every n steps may yield better training accuracy. Overall, figuring out the best strategy to use semi-structured sparsity during training is an open area of research.
245
-
***Compatibility with fp8:** The hardware also supports fp8 semi-structured sparsity (in the 4:8 format instead of 2:4), and this approach should work similarly with fp8 in principle. In practice, we would need to write similar sparsification kernels, and could possibly fuse them with the scaling of the tensors.
245
+
***Compatibility with fp8:** The hardware also supports fp8 semi-structured sparsity, and this approach should work similarly with fp8 in principle. In practice, we would need to write similar sparsification kernels, and could possibly fuse them with the scaling of the tensors.
246
246
***Activation Sparsity:** Efficient sparsification kernels also enable to sparsify the activations during training. Because the sparsification overhead grows linearly with the sparsified matrix size, setups with large activation tensors compared to the weight tensors could benefit more from activation sparsity than weight sparsity. Furthermore, activations are naturally sparse because of the usage of ReLU or GELU activation functions, reducing accuracy degradation.
247
247
248
248
If you are interested in these problems, please feel free to open an issue / PR in [torchao](https://github.com/pytorch/ao), a community we’re building for architecture optimization techniques like quantization and sparsity. Additionally, if you have general interest in sparsity please reach out in [CUDA-MODE](discord.gg/cudamode) (#sparsity)
0 commit comments