[Enhancement] KV Caching for inference speed #110
Labels
enhancement
New feature or request
help wanted
Extra attention is needed
tensorflow
Related to Tensorflow
tests
Related to tests
Is your feature request related to a problem? Please describe.
caching the key and value matrices during the self-attention mechanism to reduce computational complexity and improve inference speed
Describe the solution you'd like
This caching mechanism reduces the need to recompute the full key and value matrices in every iteration of the decoding process, leading to faster inference.
The text was updated successfully, but these errors were encountered: