Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement SIMD versions of isnan,isinf, isfinite and signbit #22165

Conversation

Developer-Ecosystem-Engineering
Copy link
Contributor

NumPy has SIMD versions of float / double isnan, isinf, isfinite, and signbit for SSE2 and AVX-512. The changes here replace the SSE2 version with one that uses universal intrinsics. This allows other architectures to have SIMD versions of the functions too.

Apple M1: up to 3.4x faster

-      93.5±0.3μs       89.9±0.3μs     0.96  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 4, 'd')
-     65.9±0.09μs       62.9±0.1μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 2, 'f')
-     66.7±0.09μs       63.4±0.1μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 1, 'd')
-      43.4±0.5μs      40.2±0.06μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 1, 'f')
-      73.3±0.3μs       66.6±0.4μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 1, 'd')
-        85.0±1μs       77.2±0.4μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 4, 'd')
-      69.9±0.1μs       63.1±0.1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 1, 'd')
-      69.7±0.7μs      62.4±0.08μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 2, 'd')
-      72.1±0.1μs       64.3±0.2μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 4, 'f')
-      75.9±0.2μs       67.1±0.2μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 1, 'd')
-      81.8±0.9μs       72.4±0.3μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 2, 'd')
-      96.8±0.8μs       85.3±0.5μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 4, 'd')
-      72.6±0.3μs       63.3±0.3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 1, 'd')
-      75.2±0.5μs      65.0±0.03μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 2, 'd')
-        89.2±2μs       77.0±0.5μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 4, 'd')
-       101±0.9μs       86.1±0.3μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 4, 'd')
-      79.0±0.8μs       67.1±0.1μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 1, 'd')
-      75.5±0.3μs       63.3±0.2μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 1, 'd')
-      85.6±0.7μs       71.7±0.4μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 2, 'd')
-      79.5±0.8μs      66.4±0.07μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 2, 'd')
-         102±2μs       84.2±0.8μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 4, 'd')
-        94.4±3μs       77.5±0.4μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 4, 'd')
-        87.0±1μs      70.7±0.08μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 2, 'd')
-      71.3±0.1μs      57.9±0.05μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 2, 'f')
-      96.0±0.2μs       77.3±0.5μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 4, 'd')
-      76.1±0.3μs       61.2±0.7μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 4, 'f')
-     73.9±0.05μs       58.7±0.1μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 2, 'f')
-      82.7±0.8μs      65.1±0.03μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 2, 'd')
-        79.4±1μs      61.8±0.06μs     0.78  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 4, 'f')
-      65.9±0.6μs      50.8±0.03μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 2, 'f')
-      69.2±0.2μs       53.2±0.1μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 4, 'f')
-     81.9±0.09μs      62.2±0.05μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 4, 'f')
-      77.7±0.8μs       58.1±0.6μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 2, 'f')
-     64.6±0.05μs      46.7±0.04μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 2, 'f')
-      89.8±0.1μs       62.7±0.3μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 4, 'd')
-        92.3±1μs       64.2±0.4μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 4, 'd')
-      69.9±0.8μs       48.7±0.2μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 4, 'f')
-     74.1±0.09μs       49.9±0.1μs     0.67  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 1, 'd')
-      50.5±0.5μs       33.7±0.1μs     0.67  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 1, 'f')
-      42.2±0.3μs      28.0±0.01μs     0.66  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 1, 'f')
-        96.5±1μs       63.4±0.2μs     0.66  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 4, 'd')
-      74.8±0.6μs       48.4±0.5μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 4, 'f')
-      49.1±0.3μs       31.7±0.1μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 1, 'f')
-      76.5±0.1μs       49.1±0.6μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 4, 'f')
-     78.0±0.07μs       50.0±0.1μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 2, 'd')
-      74.9±0.5μs       47.6±0.4μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 2, 'd')
-      71.7±0.7μs      44.5±0.03μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 2, 'f')
-      73.2±0.1μs      45.3±0.08μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 1, 'd')
-      73.3±0.7μs      45.1±0.02μs     0.61  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 2, 'f')
-      80.3±0.2μs      49.3±0.05μs     0.61  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 4, 'f')
-      77.2±0.7μs       45.6±0.4μs     0.59  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 2, 'f')
-      82.0±0.2μs       48.1±0.6μs     0.59  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 2, 'd')
-     42.0±0.04μs      24.0±0.01μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 1, 'f')
-     55.0±0.02μs      31.5±0.08μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 1, 'f')
-      79.6±0.2μs      45.4±0.09μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 1, 'd')
-      72.2±0.7μs      40.8±0.02μs     0.56  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 2, 'f')
-      76.3±0.1μs      41.9±0.03μs     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 4, 'f')
-      71.5±0.6μs       39.2±0.4μs     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 2, 'f')
-      77.9±0.8μs       42.5±0.1μs     0.55  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 4, 'f')
-      76.4±0.7μs       39.2±0.4μs     0.51  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 2, 'f')
-      82.5±0.1μs       41.2±0.2μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 4, 'f')
-     50.4±0.01μs       22.4±0.2μs     0.44  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 1, 'f')
-     48.6±0.04μs      20.3±0.04μs     0.42  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 1, 'f')
-      54.0±0.5μs       22.3±0.2μs     0.41  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 1, 'f')
-      49.5±0.2μs      17.5±0.01μs     0.35  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 1, 'f')
-     48.5±0.03μs      15.4±0.04μs     0.32  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 1, 'f')
-      53.9±0.5μs       15.4±0.1μs     0.29  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 1, 'f')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Apple M1 Rosetta: up to 2.1x faster

...
-       102±0.4μs       83.4±0.7μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 2, 'd')
-      97.1±0.1μs       78.9±0.4μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 1, 'd')
-     98.1±0.05μs      79.5±0.02μs     0.81  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 1, 'f')
-       119±0.1μs       79.3±0.3μs     0.67  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 4, 'd')
-       106±0.2μs      66.1±0.08μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 2, 'd')
-     98.7±0.05μs      58.9±0.01μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 1, 'd')
-      118±0.05μs       60.6±0.3μs     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 4, 'f')
-       113±0.2μs       55.2±0.2μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 2, 'f')
-      111±0.07μs      52.1±0.03μs     0.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 1, 'f')

iMacPro (AVX512): Similar. A handful of benchmarks are 15% faster and another are 15% faster. Which ones show up where changes depending on the run. Averaging all gains / losses we're at ~4% faster.

+       130±0.9μs         148±10μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 2, 'd')
+       160±0.7μs         180±10μs     1.13  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 1, 4, 'd')
+         132±1μs         146±10μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 4, 'f')
+        201±10μs          216±1μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 4, 'd')
+         250±2μs          266±1μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 4, 'd')
-         134±6μs          127±1μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 4, 2, 'f')
-         127±4μs          120±6μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 1, 'd')
-      67.1±0.5μs       63.2±0.4μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 1, 'f')
-         111±6μs          105±7μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 2, 'd')
-        95.1±3μs         89.0±2μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 1, 'f')
-        99.6±2μs         92.6±1μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 1, 1, 'd')
-         198±2μs          183±2μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 2, 'd')
-      87.0±0.9μs         80.4±1μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 2, 'f')
-        96.1±2μs         88.7±2μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 1, 'd')
-         118±1μs          108±2μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 2, 4, 'f')
-         140±3μs          129±2μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isinf'>, 4, 2, 'f')
-      67.4±0.7μs         61.4±1μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 1, 'f')
-         111±2μs        101±0.8μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 2, 'f')
-         139±8μs          126±4μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 2, 'f')
-         118±1μs        107±0.7μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 4, 'f')
-         142±5μs        128±0.8μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isfinite'>, 2, 2, 'd')
-         115±7μs          101±3μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 1, 2, 'd')
-         130±7μs        114±0.8μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 4, 1, 'f')
-       129±0.7μs        114±0.5μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 1, 'd')
-         118±6μs          104±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'isnan'>, 2, 4, 'f')
-         129±5μs        113±0.7μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 4, 1, 'f')
-         145±9μs        126±0.6μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 2, 2, 'd')
-         116±2μs       98.1±0.7μs     0.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'signbit'>, 1, 2, 'd')

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Taking a look at the smoke_test failure!

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Aug 24, 2022
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

🤦‍♂️

@mattip
Copy link
Member

mattip commented Sep 4, 2022

s390x and ppc64le builds on travis are failing, linux32 is crashing

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Yeah, have an updated patch coming shortly

@mattip
Copy link
Member

mattip commented Sep 12, 2022

32-bit linux tests are segfaulting.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Developer-Ecosystem-Engineering commented Sep 12, 2022

32-bit linux tests are segfaulting.

Yup! Thought we got it all, working on resolving. Once these are all sorted we will go back through the remaining open PRs and update them as well.

@seberg
Copy link
Member

seberg commented Sep 27, 2022

Just to note, that the 32bit debug run is still failing (presumably some assert?).

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Yeah, it's not reproducing locally, so we are trying to figure out why it's failing in CI. Open to suggestions =)

@seberg
Copy link
Member

seberg commented Sep 27, 2022

Not rocket science, but since I don't have a better idea: I pushed a commit to try and see if we can get the gdb backtrace in the CI run (before we need to track down what is special about the run). Lets see if it works...

@seberg seberg force-pushed the simd_isnan_isinf_isfinite_signbit branch 2 times, most recently from f78280a to a05d960 Compare September 27, 2022 19:11
@seberg
Copy link
Member

seberg commented Sep 27, 2022

Well, had to reproduce it locally to be smart enough to use yum install correctly ;).

Here is what I get locally, please do not hesitate to simply force-push the CI addition away, I suspect it will give the same traceback now so will keep it in case it is useful to you.

Program received signal SIGABRT, Aborted.
0xf7fbb559 in __kernel_vsyscall ()
#0  0xf7fbb559 in __kernel_vsyscall ()
#1  0xf7d6d257 in raise () from /lib/libc.so.6
#2  0xf7d6ea93 in abort () from /lib/libc.so.6
#3  0xf7d660d7 in __assert_fail_base () from /lib/libc.so.6
#4  0xf7d66187 in __assert_fail () from /lib/libc.so.6
#5  0xf73f2623 in DOUBLE_isnan_SSE41 (args=0xff9c81e0, dimensions=0xff9c8150, steps=0xff9c8160, __NPY_UNUSED_TAGGEDfunc=0x0) at numpy/core/src/umath/loops_unary_fp.dispatch.c.src:732
#6  0xf766d7d6 in generic_wrapped_legacy_loop (__NPY_UNUSED_TAGGEDcontext=0xff9c8990, data=0xff9c81e0, dimensions=0xff9c8150, strides=0xff9c8160, auxdata=0xeb059a20) at numpy/core/src/umath/legacy_array_method.c:87
#7  0xf766f8df in try_trivial_single_output_loop (context=context@entry=0xff9c8990, op=op@entry=0xff9c8690, order=order@entry=NPY_KEEPORDER, arr_prep=0xff9c8490, full_args=..., errormask=521, extobj=0x0) at numpy/core/src/umath/ufunc_object.c:1367
#8  0xf7678571 in PyUFunc_GenericFunctionInternal (wheremask=0x0, full_args=..., output_array_prepare=0xff9c8490, order=NPY_KEEPORDER, casting=NPY_SAME_KIND_CASTING, extobj=0x0, op=0xff9c8690, operation_descrs=0xff9c8790, ufuncimpl=0xeb1ed338, ufunc=0xeb1e3928) at numpy/core/src/umath/ufunc_object.c:2686
#9  ufunc_generic_fastcall (ufunc=0xeb1e3928, args=<optimized out>, len_args=<optimized out>, kwnames=0x0, outer=<optimized out>) at numpy/core/src/umath/ufunc_object.c:4938

It seems to me the test is failing in the memory-check only, which is weird. And I am actually not sure which assert is failing here at all!

@seberg
Copy link
Member

seberg commented Sep 27, 2022

OK, this seems more useful (adding a print before probably helped report it clearer):

python3: numpy/core/src/umath/loops_unary_fp.dispatch.c.src:713: DOUBLE_isnan_SSE41: Assertion `len <= 1 || (istep % ilsize == 0 && ostep % olsize == 0)' failed.

printing out:

    printf("%d, %d, %d, %d, %d\n", len, istep, ilsize, ostep, olsize);

gives:

2, 12, 8, 1, 1

Not sure if it is helpful, but maybe...

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Alright CI, this is it. You can do it. Green Check Time.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Woo! @seberg thank you for the clues, was exactly what was missing =) Much appreciated!

@mattip
Copy link
Member

mattip commented Oct 2, 2022

Could you confirm that the benchmarks run before the fixes are still valid after them?

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Most certainly @mattip, will update with latest.

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Developer-Ecosystem-Engineering commented Oct 3, 2022

Still looks good on our end!

       before           after         ratio
     [da6297b9]       [e58cf4a]
     <main>           <unary-fp/upstream-pr>
-      65.5±0.1μs       62.1±0.1μs     0.95  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 1, 1, 'd')
-      42.8±0.2μs      40.0±0.07μs     0.94  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 4, 1, 'f')
-      72.9±0.3μs       68.0±0.3μs     0.93  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 4, 1, 'd')
-      84.1±0.2μs       78.3±0.7μs     0.93  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 1, 4, 'd')
-        81.9±1μs       75.7±0.1μs     0.92  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 4, 2, 'd')
-        98.0±2μs       89.2±0.4μs     0.91  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 4, 4, 'd')
-      68.7±0.2μs       62.3±0.3μs     0.91  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 1, 2, 'd')
-      70.8±0.4μs       64.2±0.2μs     0.91  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 4, 4, 'f')
-       100±0.8μs         90.2±4μs     0.90  bench_ufunc_strides_isnan.UFunc.time_ufunc_types('isfinite')
-      70.1±0.2μs       62.9±0.1μs     0.90  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 2, 1, 'd')
-      67.7±0.6μs       60.7±0.1μs     0.90  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 2, 1, 'd')
-        72.2±1μs       64.2±0.2μs     0.89  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 4, 1, 'd')
-      73.8±0.4μs       64.1±0.1μs     0.87  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 4, 1, 'd')
-        99.9±2μs         86.5±2μs     0.87  bench_ufunc_strides_isnan.UFunc.time_ufunc_types('isinf')
-      75.0±0.1μs       64.9±0.1μs     0.86  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 2, 2, 'd')
-      81.9±0.2μs      70.8±0.06μs     0.86  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 4, 2, 'd')
-      89.7±0.4μs       77.5±0.3μs     0.86  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 2, 4, 'd')
-       100±0.2μs         85.4±1μs     0.85  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 4, 4, 'd')
-      98.1±0.9μs         83.4±1μs     0.85  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 4, 4, 'd')
-      76.0±0.2μs       64.4±0.3μs     0.85  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 4, 1, 'd')
-        84.0±1μs       70.8±0.2μs     0.84  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 4, 2, 'd')
-     78.9±0.06μs       65.1±0.1μs     0.82  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 2, 2, 'd')
-      73.8±0.3μs       60.9±0.3μs     0.82  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 2, 1, 'd')
-      94.2±0.4μs       77.4±0.5μs     0.82  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 2, 4, 'd')
-      86.8±0.3μs       70.7±0.1μs     0.82  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 4, 2, 'd')
-      64.7±0.4μs       52.7±0.4μs     0.81  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 2, 2, 'f')
-        74.8±1μs       60.4±0.2μs     0.81  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 4, 4, 'f')
-      72.5±0.2μs      58.5±0.04μs     0.81  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 4, 2, 'f')
-        96.0±2μs       77.3±0.1μs     0.80  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 2, 4, 'd')
-       105±0.9μs       84.1±0.3μs     0.80  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 4, 4, 'd')
-     70.7±0.09μs       56.4±0.1μs     0.80  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 4, 2, 'f')
-        81.9±1μs      64.9±0.05μs     0.79  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 2, 2, 'd')
-      78.0±0.1μs      61.4±0.05μs     0.79  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 4, 4, 'f')
-      68.9±0.1μs       53.8±0.1μs     0.78  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 2, 4, 'f')
-     81.7±0.08μs       60.7±0.1μs     0.74  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 4, 4, 'f')
-     76.5±0.08μs       56.8±0.1μs     0.74  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 4, 2, 'f')
-      64.1±0.1μs       46.4±0.1μs     0.72  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 1, 2, 'f')
-     41.9±0.05μs      29.9±0.05μs     0.71  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 2, 1, 'f')
-      90.1±0.7μs       63.9±0.5μs     0.71  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 1, 4, 'd')
-      91.9±0.2μs       65.0±0.7μs     0.71  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 1, 4, 'd')
-      69.6±0.1μs       47.8±0.4μs     0.69  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 1, 4, 'f')
-     49.5±0.06μs      33.5±0.06μs     0.68  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 4, 1, 'f')
-     72.7±0.08μs      48.5±0.07μs     0.67  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 1, 1, 'd')
-      53.3±0.3μs         35.4±1μs     0.66  bench_ufunc_strides_isnan.UFunc.time_ufunc_types('signbit')
-      95.7±0.9μs       62.9±0.5μs     0.66  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 1, 4, 'd')
-      72.0±0.3μs       46.6±0.1μs     0.65  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 2, 2, 'f')
-     76.2±0.07μs      48.8±0.04μs     0.64  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 1, 2, 'd')
-      74.6±0.3μs      47.6±0.06μs     0.64  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 2, 4, 'f')
-      76.0±0.3μs      48.4±0.03μs     0.64  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 2, 4, 'f')
-      74.8±0.6μs       47.4±0.9μs     0.63  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 1, 2, 'd')
-     48.6±0.06μs      30.7±0.04μs     0.63  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 4, 1, 'f')
-      72.6±0.9μs       44.7±0.7μs     0.62  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 1, 1, 'd')
-      71.0±0.3μs      43.4±0.06μs     0.61  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 2, 2, 'f')
-      80.3±0.2μs      48.2±0.09μs     0.60  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 2, 4, 'f')
-      80.1±0.2μs      47.5±0.08μs     0.59  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 1, 2, 'd')
-      75.9±0.2μs       44.8±0.1μs     0.59  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 2, 2, 'f')
-     53.6±0.03μs      31.2±0.07μs     0.58  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 4, 1, 'f')
-     76.6±0.06μs      44.3±0.05μs     0.58  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 1, 1, 'd')
-     41.8±0.05μs      23.9±0.01μs     0.57  bench_ufunc_strides_isnan.Unary.time_ufunc('isnan', 1, 1, 'f')
-     71.6±0.05μs      39.7±0.03μs     0.55  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 1, 2, 'f')
-      71.1±0.7μs       39.3±0.9μs     0.55  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 1, 2, 'f')
-      76.8±0.2μs       41.1±0.1μs     0.54  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 1, 4, 'f')
-        76.4±2μs      40.8±0.06μs     0.53  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 1, 4, 'f')
-      75.8±0.1μs       38.9±0.4μs     0.51  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 1, 2, 'f')
-      81.2±0.2μs       41.0±0.1μs     0.51  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 1, 4, 'f')
-     49.3±0.07μs       24.5±0.4μs     0.50  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 2, 1, 'f')
-      48.4±0.2μs      20.2±0.01μs     0.42  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 2, 1, 'f')
-     53.1±0.08μs       22.1±0.2μs     0.42  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 2, 1, 'f')
-      49.3±0.2μs      17.1±0.01μs     0.35  bench_ufunc_strides_isnan.Unary.time_ufunc('isinf', 1, 1, 'f')
-      48.4±0.3μs       15.4±0.3μs     0.32  bench_ufunc_strides_isnan.Unary.time_ufunc('signbit', 1, 1, 'f')
-      53.2±0.1μs      15.2±0.05μs     0.29  bench_ufunc_strides_isnan.Unary.time_ufunc('isfinite', 1, 1, 'f')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@mattip
Copy link
Member

mattip commented Oct 4, 2022

Still looks good on our end!

Indeed, this is a really nice speedup for a commonly used operation

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering force-pushed the simd_isnan_isinf_isfinite_signbit branch from fb48797 to 41638be Compare October 6, 2022 03:28
@mattip
Copy link
Member

mattip commented Oct 6, 2022

Any thoughts about the missing coverage for output strides?

@seiko2plus thoughts?

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Any thoughts about the missing coverage for output strides?

@seiko2plus thoughts?

Coverage should be caught up now.

@mattip
Copy link
Member

mattip commented Dec 7, 2022

@seiko2plus could you look over this?

@seiko2plus
Copy link
Member

@mattip, I was looking at it. I'm preparing a patch to implement intrinsics for non-saturating packing and test special fpu cases, which are needed to avoid complexity and guarantee fair optimization among all architectures. similar to #22306 & #22167.

NumPy has SIMD versions of float / double `isnan`, `isinf`, `isfinite`, and `signbit` for SSE2 and AVX-512.  The changes here replace the SSE2 version with one that uses their universal intrinsics.  This allows other architectures to have SIMD versions of the functions too.
Use reinterpret to support casting across many compiler generations

Resolve deprecation warnings
Special case SSE
Fix PPC64 build
Only use vqtbl4q_u8 on A64
Stop trying to use optimizations on s390x
We don't see these failures but CI is hitting them, attempting to resolve
  On linux 32 an assert fires where stride (12) passed from ufunc_object (try_trivial_single_output_loop) to DOUBLE_isnan and DOUBLE_isfinite doesn't match the type size (8), we can relax this assert and instead fall back to the UNARY_LOOP path instead
… it was reading random data and not reliable also didn't give additional coverage
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering force-pushed the simd_isnan_isinf_isfinite_signbit branch from bee8071 to d7b19a4 Compare January 4, 2023 10:22
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering force-pushed the simd_isnan_isinf_isfinite_signbit branch from 663d61c to f277da4 Compare January 4, 2023 18:11
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

@mattip @seiko2plus Changes should all be integrated thanks!

@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Any further issues to investigate on this PR?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants