Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly better neon code. #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Slightly better neon code. #3

wants to merge 1 commit into from

Conversation

lemire
Copy link

@lemire lemire commented Mar 26, 2018

I think that this will be faster. Feel free to drop this PR.

@skeeto
Copy link
Owner

skeeto commented Mar 29, 2018 via email

@lemire
Copy link
Author

lemire commented Mar 30, 2018

I can give you access to a server-class AMD Softiron 1000. I think that @vielmetti might also be able to get you access to server-class ARM hardware.

I also a get a warning about dereferencing a type-punned pointer (e.g.
strict aliasing) in the return expression of is_zero(). This suggests to
me that that particular lane access is not valid and the compiler isn't
obligated to ensure it will work correctly.

This index access code is used in production all over (including at Google and Apple). But you can use vget_lane_u64(result,0) == 0 if you prefer.

Before your change it takes 2.26s, and with your change it takes 2.40s.

Let us look at the assembly... for my proposal, we get

        uqxtn   v0.2s, v0.2d
        fmov    x0, d0
        cmp     x0, 0
        cset    w0, eq

Your approach has (GCC 6.3)

        umov    w0, v0.s[0]
        cbz     w0, .L4
.L6:
        mov     w0, 0
        ret
.L4:
        umov    w0, v0.s[1]
        cbnz    w0, .L6
        umov    w0, v0.s[2]
        cbnz    w0, .L6
        umov    w0, v0.s[3]
        cmp     w0, 0
        cset    w0, eq

The main question is how fast uqxtn is... This obviously depends on your specific ARM hardware. On my AMD Softiron uqxtn has a throughput of 1 instruction per cycle, so it might be hard to beat.

I wrote a benchmark that tests specifically this zero-test function:
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/extra/neon/iszero/iszero.c

Here are my results on a softiron 1000 server, your approach is clearly slower... could you check what you get?

$ clang -O3 -o iszero iszero.c && ./iszero
density = 0.000001
rdtsc_overhead set to 0
run_is_zero(buffer,N)                                       	:  689.00000  (clock units)  per operation (best) 	759.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  997.00000  (clock units)  per operation (best) 	1002.00000  (clock units)  per operation (avg)

density = 0.000002
run_is_zero(buffer,N)                                       	:  664.00000  (clock units)  per operation (best) 	669.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  992.00000  (clock units)  per operation (best) 	995.00000  (clock units)  per operation (avg)

density = 0.000004
run_is_zero(buffer,N)                                       	:  661.00000  (clock units)  per operation (best) 	662.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000008
run_is_zero(buffer,N)                                       	:  660.00000  (clock units)  per operation (best) 	662.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000015
run_is_zero(buffer,N)                                       	:  659.00000  (clock units)  per operation (best) 	661.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000031
run_is_zero(buffer,N)                                       	:  659.00000  (clock units)  per operation (best) 	661.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.000061
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000122
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000244
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	657.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000488
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	657.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.000977
run_is_zero(buffer,N)                                       	:  659.00000  (clock units)  per operation (best) 	660.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.001953
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	660.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.003906
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	661.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.007812
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	660.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	992.00000  (clock units)  per operation (avg)

density = 0.015625
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	994.00000  (clock units)  per operation (avg)

density = 0.031250
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	662.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.062500
run_is_zero(buffer,N)                                       	:  660.00000  (clock units)  per operation (best) 	663.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

density = 0.125000
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	659.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.250000
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	659.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	993.00000  (clock units)  per operation (avg)

density = 0.500000
run_is_zero(buffer,N)                                       	:  658.00000  (clock units)  per operation (best) 	659.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  990.00000  (clock units)  per operation (best) 	990.00000  (clock units)  per operation (avg)

density = 1.000000
run_is_zero(buffer,N)                                       	:  657.00000  (clock units)  per operation (best) 	658.00000  (clock units)  per operation (avg)
run_is_zero_long(buffer,N)                                  	:  991.00000  (clock units)  per operation (best) 	991.00000  (clock units)  per operation (avg)

@lemire
Copy link
Author

lemire commented Mar 30, 2018

I obviously have no vested interest in you merging this, but I do have an interest in figuring out what provides the best performance, however. So I'd be pleased if you could run my benchmark on your favorite system.

@vielmetti
Copy link

Here's the results on the Packet Type 2A server (Cavium ThunderX, which is not known to have a particularly wonderful NEON implementation).

density = 0.000001 
rdtsc_overhead set to 2
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2231.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2095.00000  (clock units)  per operation (avg) 

density = 0.000002 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2224.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000004 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

density = 0.000008 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2094.00000  (clock units)  per operation (avg) 

density = 0.000015 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2090.00000  (clock units)  per operation (avg) 

density = 0.000031 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000061 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

density = 0.000122 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000244 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.000488 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2082.00000  (clock units)  per operation (best) 	2091.00000  (clock units)  per operation (avg) 

density = 0.000977 
run_is_zero(buffer,N)                                       	:  2213.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.001953 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.003906 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2228.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

density = 0.007812 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2094.00000  (clock units)  per operation (avg) 

density = 0.015625 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.031250 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.062500 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2222.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.125000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2220.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 0.250000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2088.00000  (clock units)  per operation (avg) 

density = 0.500000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2092.00000  (clock units)  per operation (avg) 

density = 1.000000 
run_is_zero(buffer,N)                                       	:  2214.00000  (clock units)  per operation (best) 	2223.00000  (clock units)  per operation (avg) 
run_is_zero_long(buffer,N)                                  	:  2083.00000  (clock units)  per operation (best) 	2089.00000  (clock units)  per operation (avg) 

bogus 46313650 

@lemire
Copy link
Author

lemire commented Mar 30, 2018

@vielmetti Interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants