Replies: 2 comments
-
@ispobock Hey Ke, could you help with this? |
Beta Was this translation helpful? Give feedback.
0 replies
-
It's added in the original implementation. And also mentioned in the paper:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,I am confused that there is a layer normalization between the down-sample and up-sample of Q. However, this layer normalization is not shown in the DeepSeek v2 paper.
Here is the code of sglang
Here is the formulate in paper
Beta Was this translation helpful? Give feedback.
All reactions