Skip to content

Commit

Permalink
fix(simd.h): fix longstanding probem with 16-wide bitcast for 8-wide …
Browse files Browse the repository at this point in the history
…HW (#4268)

We've been plagued on and off for a long time with spurious wrong
results in our SIMD versions of exp and log. It was really squirrelly,
came and went with changes to seemingly unrelated code, and had long
periods of time with no symptoms or ability to reproduce it. It just
started popping up again in CI in the dev-2.5 branch, consistently.

Today I think I finally tracked it down for real. I believe the problem
was actually in the SIMD bitcast_to_int and bitcast_to_float functions,
for the variety that was 8-wide when running on 4-wide hardware, and the
16-wide one when running on 8-wide (or less) hardware.

Background: When real 16-wide HW is not available, we define the 16-wide
vectors as an array of two 8-wide vectors. (Same for 8 vs 4, I'll spare
you explaining it all twice.) So most of the 16 wide functions boil down
to:

    vfloat16 op (vfloat16 arg, ...)
    {
    #if 16 wide HW is available
        return real_16_wide_intrinsic(arg);
    #else /* recursively define in terms of 8-wide halves */
        return vfloat16(op(arg.lo()), op(arg.hi()));
    #endif
    }

But for the bitcast operators, we did something different, a shade too
clever: we just did a pointer cast, dereference and return:

    vint16 bitcast_to_int (const vfloat16& x)
    {
    #if OIIO_SIMD_AVX >= 512
        return _mm512_castps_si512 (x.simd());
    #else
        return *(vint16 *)&x;
    #endif
    }

I believe that the problem is that this cast is undefined behavior for
the case where the vector x was broken into two halves, and by being in
the middle of a long code sequence of SIMD ops, the two halves happened
to be in registers at that moment. Basically, we are taking the address,
and it just doesn't seem to always understand that it needs to
materialize the whole thing in memory for the pointer cast to work
properly.

I removed the pointer cast and replaced with

    return vfloat16(bitcast_to_float(x.lo()), bitcast_to_float(x.hi()));

basically falling back on the usual recursive definition of 16-wide
operations to being in terms of 8-wide operations when 16-wide HW is not
available. This totally clears up the problem since we're no longer
potentially relying on needing the address of something that has no
address at that moment.

A couple other spots (andnot) also had dubious casts that I fixed in the
same way.

N.B.: The 16-wide cast from bool to int had to do something slightly
different, because it's represented differently and isn't a true
bitcast.

Signed-off-by: Larry Gritz <[email protected]>
  • Loading branch information
lgritz authored May 22, 2024
1 parent 2ed2a47 commit ce610f2
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 12 deletions.
2 changes: 1 addition & 1 deletion src/include/OpenImageIO/fmath.h
Original file line number Diff line number Diff line change
Expand Up @@ -1852,7 +1852,7 @@ template<typename T>
OIIO_FORCEINLINE OIIO_HOSTDEVICE T fast_exp2 (const T& xval) {
using namespace simd;
typedef typename T::vint_t intN;
#if OIIO_SIMD_SSE && !OIIO_MSVS_BEFORE_2022
#if OIIO_SIMD_SSE || OIIO_SIMD_NEON
// See float specialization for explanations
T x = clamp (xval, T(-126.0f), T(126.0f));
intN m (x); x -= T(m);
Expand Down
22 changes: 11 additions & 11 deletions src/include/OpenImageIO/simd.h
Original file line number Diff line number Diff line change
Expand Up @@ -5686,7 +5686,7 @@ OIIO_FORCEINLINE vint8 bitcast_to_int (const vbool8& x)
#if OIIO_SIMD_AVX
return _mm256_castps_si256 (x.simd());
#else
return *(vint8 *)&x;
return vint8(bitcast_to_int(x.lo()), bitcast_to_int(x.hi()));
#endif
}

Expand Down Expand Up @@ -7738,9 +7738,9 @@ OIIO_FORCEINLINE vfloat4 andnot (const vfloat4& a, const vfloat4& b) {
#if OIIO_SIMD_SSE
return _mm_andnot_ps (a.simd(), b.simd());
#else
const int *ai = (const int *)&a;
const int *bi = (const int *)&b;
return bitcast_to_float (vint4(~(ai[0]) & bi[0],
vint4 ai = bitcast_to_int(a);
vint4 bi = bitcast_to_int(b);
return bitcast_to_float(vint4(~(ai[0]) & bi[0],
~(ai[1]) & bi[1],
~(ai[2]) & bi[2],
~(ai[3]) & bi[3]));
Expand Down Expand Up @@ -9157,7 +9157,7 @@ OIIO_FORCEINLINE vint8 bitcast_to_int (const vfloat8& x)
#if OIIO_SIMD_AVX
return _mm256_castps_si256 (x.simd());
#else
return *(vint8 *)&x;
return vint8(bitcast_to_int(x.lo()), bitcast_to_int(x.hi()));
#endif
}

Expand All @@ -9166,7 +9166,7 @@ OIIO_FORCEINLINE vfloat8 bitcast_to_float (const vint8& x)
#if OIIO_SIMD_AVX
return _mm256_castsi256_ps (x.simd());
#else
return *(vfloat8 *)&x;
return vfloat8(bitcast_to_float(x.lo()), bitcast_to_float(x.hi()));
#endif
}

Expand Down Expand Up @@ -9395,9 +9395,9 @@ OIIO_FORCEINLINE vfloat8 andnot (const vfloat8& a, const vfloat8& b) {
#if OIIO_SIMD_AVX
return _mm256_andnot_ps (a.simd(), b.simd());
#else
const int *ai = (const int *)&a;
const int *bi = (const int *)&b;
return bitcast_to_float (vint8(~(ai[0]) & bi[0],
vint8 ai = bitcast_to_int(a);
vint8 bi = bitcast_to_int(b);
return bitcast_to_float(vint8(~(ai[0]) & bi[0],
~(ai[1]) & bi[1],
~(ai[2]) & bi[2],
~(ai[3]) & bi[3],
Expand Down Expand Up @@ -10030,7 +10030,7 @@ OIIO_FORCEINLINE vint16 bitcast_to_int (const vfloat16& x)
#if OIIO_SIMD_AVX >= 512
return _mm512_castps_si512 (x.simd());
#else
return *(vint16 *)&x;
return vint16(bitcast_to_int(x.lo()), bitcast_to_int(x.hi()));
#endif
}

Expand All @@ -10039,7 +10039,7 @@ OIIO_FORCEINLINE vfloat16 bitcast_to_float (const vint16& x)
#if OIIO_SIMD_AVX >= 512
return _mm512_castsi512_ps (x.simd());
#else
return *(vfloat16 *)&x;
return vfloat16(bitcast_to_float(x.lo()), bitcast_to_float(x.hi()));
#endif
}

Expand Down

0 comments on commit ce610f2

Please sign in to comment.