-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slightly better neon code. #3
base: master
Are you sure you want to change the base?
Conversation
I'm running this command on my ODROID-C2:
./mandel.arm64 -w 6000 -h 6000 > /dev/null
Before your change it takes 2.26s, and with your change it takes 2.40s.
So it's slower for me.
I also a get a warning about dereferencing a type-punned pointer (e.g.
strict aliasing) in the return expression of is_zero(). This suggests to
me that that particular lane access is not valid and the compiler isn't
obligated to ensure it will work correctly. I'm not seeing much
documentation about this, though, so I don't know. The output is
bit-for-bit identical, but that could just be luck.
That said, I do like your change to the c0123 initialization. It's
valid, clean, _and_ has no performance impact. I'll definitely merge
that change even if we can't sort out is_zero.
|
I can give you access to a server-class AMD Softiron 1000. I think that @vielmetti might also be able to get you access to server-class ARM hardware.
This index access code is used in production all over (including at Google and Apple). But you can use vget_lane_u64(result,0) == 0 if you prefer.
Let us look at the assembly... for my proposal, we get
Your approach has (GCC 6.3)
The main question is how fast I wrote a benchmark that tests specifically this zero-test function: Here are my results on a softiron 1000 server, your approach is clearly slower... could you check what you get?
|
I obviously have no vested interest in you merging this, but I do have an interest in figuring out what provides the best performance, however. So I'd be pleased if you could run my benchmark on your favorite system. |
Here's the results on the Packet Type 2A server (Cavium ThunderX, which is not known to have a particularly wonderful NEON implementation).
|
@vielmetti Interesting. |
I think that this will be faster. Feel free to drop this PR.