-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix byteswap for BF16 tensor on big-endian machine #452
Conversation
@Narsil Do you have time to look at this PR? |
1 similar comment
@Narsil Do you have time to look at this PR? |
@Narsil Do you have time to look at this PR? |
2 similar comments
@Narsil Do you have time to look at this PR? |
@Narsil Do you have time to look at this PR? |
@Narsil Would it be possible to look at this PR? |
Can we kindly ask someone to review and merge this PR? |
Hi @kiszk, Sorry about the delay, I haven't been able to check every place I'm being pinged on (and far from it). I understand the issue and the fix.
I'll try to take a look at how I can improve that solution if you're ok with my assumptions.
I am guessing it didn't because it used zeros instead of randn, and bfloat16 is specifically the same bytes for the zero representation :( Therefore we could just update that test. |
Managed to trigger the issue https://github.com/huggingface/safetensors/actions/runs/10109667229/job/27958053499?pr=507 on old tests with random values. |
@Narsil, thank you for your response while you were super busy. Your points are correct:
For testing point, I think that this test uses I will work to update some of them if you are still super busy. |
@kiszk Can you check the last implementation ? I think numpy byteswap can still work, since bf16 byteswap and f16 byteswap are the same. The only thing necessary is those No issue with torch version, no clone, faster byteswap. Could you confirm the branch works for you (all the tests should show issues now). |
Sure, I will check the latest main soon. |
The latest main branch does not work for me. As you said, byteswap is the same between f16 and bf16. IMHO, the problem may come from data conversion from bf16 and f16 / from f16 to bf16. We have to do byteswap without any data conversion. In detail, on a big-endian platform, at the beginning, an element in bf16 tensor has an uninterpretable value since its format is bf16 in little-endian. In the current implementation, an uninterpretable value is interpreted as bf16, and then it is converted from bf16 to f16. The value in f16 is meaningless. Then, it is byteswaped for big-endian in numpy. After byteswap, in numpy, the value is not good in bf16 on a big-endian platform. |
@Narsil If I make mistakes, could you please let me know? |
Superseeded by #507 Basically the issue is that previous code was doing
Closing this assuming the bug is fixed. Let's reopen something if itś not. |
What does this PR do?
This PR fixes an incorrect Tensor's data swap in BF16 on a big-endian machine. It uses
Storage.byteswap(datatype)
in PyTorch instead ofnumpy.byteswap()
. This is because numpy does not support BF16. Unnecessary data conversions between BF16 and F16 caused incorrect results.Fixes #448