Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and Rgba24 pixels #1242

Closed
wants to merge 17 commits into from

Conversation

john-h-k
Copy link

WIP. A few initial questions raised

  • Is there a specific reason for no SSE intrinsics? They would be useful here
  • Why are the remainder paths for vectorised pixel operations marked as NoInlining? They are often going to be hot code
  • There were a few places where there were unnecessary span slices. Although not directly in the scope of this PR, I removed them
  • What other pixel types need this impl?

@CLAassistant
Copy link

CLAassistant commented Jun 19, 2020

CLA assistant check
All committers have signed the CLA.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Jun 21, 2020

Thanks @john-h-k for making a start on this, really appreciate it. I'll tag @antonfirsov and @saucecontrol here since they'll both provide much better input to these APIs than I ever will.

Is there a specific reason for no SSE intrinsics? They would be useful here

No reason I am aware of.

Why are the remainder paths for vectorised pixel operations marked as NoInlining? They are often going to be hot code

Possible oversight. I would inline them.

There were a few places where there were unnecessary span slices. Although not directly in the scope of this PR, I removed them

I think you misread them. There were two different slice overloads taking place. I've reverted those changes to fix the output and also changed what looked like a copypasta when slicing in your code.

What other pixel types need this impl?

All of the RGBA compatible types should be able to take advantage of the new API since it should be a case of channel swapping for individual implementations.

@john-h-k
Copy link
Author

john-h-k commented Jun 21, 2020

There were two different slice overloads taking place. I've reverted those changes to fix the output and also changed what looked like a copypasta when slicing in your code.

Ah yeah that was a silly mistake apologies. Shame there's not a Span<T> Split(int index, out Span<T> start, out Span<T> end) to avoid the double arg checks there.

Is there a specific reason for no SSE intrinsics? They would be useful here

No reason I am aware of.

For Intel desktop, post-2013 will have AVX2, but with AMD desktop chips pre 2015 ones won't have AVX2 (I know at least one dev who's main system doesn't have AVX2 for example). I vaguely recall mobile and laptop chips don't have AVX2 as frequently, (but will have SSE), so i'll check if thats right, and if so, that would be a valuable reason to support SSE. Also, for various reasons consumers way want to disable AVX2 (JIT supports an environment variable to do so), as it can have downclocking issues, in which case having SSE would again be useful

@codecov
Copy link

codecov bot commented Jun 21, 2020

Codecov Report

Merging #1242 (f4cabf2) into master (05659b8) will decrease coverage by 0.39%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1242      +/-   ##
==========================================
- Coverage   83.68%   83.29%   -0.40%     
==========================================
  Files         734      735       +1     
  Lines       31990    32142     +152     
  Branches     3605     3616      +11     
==========================================
  Hits        26772    26772              
- Misses       4505     4657     +152     
  Partials      713      713              
Flag Coverage Δ
unittests 83.29% <0.00%> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...arp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs 70.66% <0.00%> (-1.94%) ⬇️
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs 76.39% <0.00%> (-21.41%) ⬇️
src/ImageSharp/Common/Helpers/SimdUtils.cs 41.33% <0.00%> (-24.63%) ⬇️
...ntations/PixelOperations/Rgba32.PixelOperations.cs 70.83% <0.00%> (-29.17%) ⬇️
...mats/PixelImplementations/Rgb24.PixelOperations.cs 0.00% <0.00%> (ø)
...ImageSharp/PixelFormats/PixelOperations{TPixel}.cs 33.33% <0.00%> (-6.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cad52a2...f4cabf2. Read the comment docs.

Copy link
Contributor

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments for now on some logic errors and some general tips for the SIMD code. I'd be happy to help with more detail, but I figure if this is a learning project you may want to work through the solution yourself. Let me know if you have any questions.

On the question of the lack of SSE2 code, I believe the reason there isn't any for the existing float->byte conversion is that it was built onto an existing code path using Vector<T>, and that code path was only advantageous when Vector<float>.Count is 8 (which implies AVX2). @antonfirsov can correct me if I misexplained that. In any case, any new HWIntrinsics code should definitely have an SSE fallback, so don't let the fact it's missing from the other implementation stop you.


if (adjustedCount > 0)
{
channel0 = channel0.Slice(adjustedCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same Slice mistake @JimBobSquarePants pointed out from the other changes. You're slicing to the remainder before calling the method that's supposed to do the main part.

@@ -91,6 +91,190 @@ public static class Avx2Intrinsics
}
}

internal static void PackBytesToUInt32SaturateChannel4Reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what name is best for this, but there's no saturation happening here because the input is already byte.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by "saturate channel 4" i just meant set all bits to true. Not sure why i chose that wording

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it. I suck at names, so feel free to ignore my feedback on anything naming related 😆

// Deal with the remainder:
if (channel0.Length > 0)
{
PackBytesToUInt24(channel0, channel1, channel2, dest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is calling itself recursively. Should be calling the remainder method.

Comment on lines 151 to 153
s0 = Avx2.Permute4x64(s0.AsUInt64(), 0b_11_01_10_00).AsByte();
s1 = Avx2.Permute4x64(s1.AsUInt64(), 0b_11_01_10_00).AsByte();
s2 = Avx2.Permute4x64(s2.AsUInt64(), 0b_11_01_10_00).AsByte();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permutes are expensive, so you'll do better to just let the upper/lower lanes stay as is and fix them at the end rather than trying to keep everything in order all the way through. It may help to write the SSE2 version first and have the AVX2 code match that right up until the end when you deal with the 128-bit lanes being interleaved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that's actually news to me 😄 , i was always under the impression permutes were only a couple of cycles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're 3 cycles latency and 1 cycle throughput, but they're also only able to be scheduled on a single port, so they can easily be a bottleneck. I should have said 'relatively expensive', compared to most of the other instructions used here.

A good test to judge the impact would be to write up a benchmark for the SSE2 version of the code to see how the AVX2 version compares to that. Your goal would, of course, be 2x performance. You might be surprised what kind of impact all those permutes have in that context :)


if (adjustedCount > 0)
{
channel0 = channel0.Slice(adjustedCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect Slice use here as well

ref Vector256<byte> destBase =
ref Unsafe.As<byte, Vector256<byte>>(ref MemoryMarshal.GetReference(dest));

Vector256<byte> s0Mask0 = Vector256.Create(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1).AsByte();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'll direct you to use Vector256.Create like this over in the runtime repo because they've fixed the JIT to recognize it in 5.0. It's exceedingly expensive in 3.x, though, and since 3.1 has 2.5 years of LTS left, I would personally keep using the ReadOnlySpan<byte> trick to load these from static data.

@JimBobSquarePants do you have a preference there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's use the ROS trick here. We want to try to max out on all supported frameworks.

Comment on lines 252 to 254
Vector256<byte> loS0 = Avx2.Permute2x128(s0, s0, 0);
Vector256<byte> loS1 = Avx2.Permute2x128(s1, s1, 0);
Vector256<byte> loS2 = Avx2.Permute2x128(s2, s2, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, all these permutes and shuffles will be quite expensive. What I do for these 3 channel cases is treat them the same as 4 channel right up until the end (in this case with a dummy zero vector), and then do a single shuffle to pack the 12 good bytes out of each 16 together. That does mean overrunning the end of your buffer by 8 bytes for the AVX2 implementation, though, so you'd have to adjust your remainder/cleanup length by 8 to compensate. It's worth doing some benchmarks to see if the difference if you're not clear on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That single shuffle... How would that work? I can't figure it out without multiple permutes.

Copy link
Contributor

@saucecontrol saucecontrol Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, assuming this one starts off the same as the 4-channel version, you'll end up with vectors that look like this:

BGRx:BGRx:BGRx:BGRx||BGRx:BGRx:BGRx:BGRx

With a shuffle, you can pack the BGR/RGB triplets together, putting the filler/alpha values at the end of each lane. Then with a permute, you can cram the 12-byte valid sections of each lane together, leaving all the dummy values in the last 8 bytes. Like such (forgive my poor ascii art skills):

BGRx:BGRx:BGRx:BGRx||BGRx:BGRx:BGRx:BGRx
vpshufb (0,1,2,4,5,6,8,9,a,c,d,e,3,7,b,f)
                   ||
                   \/
BGRB:GRBG:RBGR:xxxx||BGRB:GRBG:RBGR:xxxx
vpermd (0,1,2,4,5,6,3,7)
                   ||
                   \/
BGRB:GRBG:RBGR:BGRB||GRBG:RBGR:xxxx:xxxx

At that point, you can either write the full 32-byte vector out (but then only advance your output ref/ptr by 24 bytes), or you can write the lower lane and then movq the first half of the upper lane, depending on whether you're at the end of your out buffer or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, because this one requires 3 permutes to start and then 4 to end, it may be one where SSE2 is faster than AVX2. Would be worth benchmarking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! I couldn't figure out the first part. Thanks!

On another note, BroadcastVector128ToVector256 should absolutely have an overload accepting a Vector128<T>.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

The vbroadcasti128 instruction only accepts a memory operand, so to use it with Vector128<T>, it would have to spill to memory. You can, however, use Vector256<T>.Create() with the same Vector128<T> for both args, and it will emit the correct permute for you.

@antonfirsov
Copy link
Member

@john-h-k thanks a lot for taking the time to look at this! Currently I'm on a holiday, AFK 99% of the time. I will have a deeper look next week. I trust @saucecontrol 100% about technical suggestions on SIMD, Span etc.

A few random answers in the meanwhile:

Is there a specific reason for no SSE intrinsics? They would be useful here

The focus was on server/cloud, since vast majority of our users are in the cloud. I had an assumption that AVX2 should be available on most cloud VM-s, but it was not based on actual research. It's fine to add other paths of course, but we should count with the added complexity it would introduce. Eg: make sure those paths are actually covered by unit tests -> add changes to test "infrastructure code" if necessary -> ....

(which implies AVX2). @antonfirsov can correct me if I misexplained that.

Yes there might be still many places in code, where I used the term AVX2 as synonym for `Vector˙ of 8 elements. Recently I started to use the term Vector8 for those code paths (AVX2 belongs to HWIntrinsic stuff now).

@antonfirsov
Copy link
Member

It would be really great to finish this! @john-h-k any chance you are around, having some time? I will have the necessary review capacity in the next 2 weeks.

@john-h-k
Copy link
Author

Yes I should be able to finish this friday or the coming weekend - i've been very busy with A-levels so sorry for letting the PR go stale!

@JimBobSquarePants
Copy link
Member

A-levels first mate! Hope they went well!

@JimBobSquarePants
Copy link
Member

@john-h-k There's been a bit of churn in the repo around this are of the code as I added some new intrinsics methods. I've updated your fork though so you don't have to deal with the merge conflicts.

Vector256<byte> s1 = Unsafe.Add(ref source1Base, i);
Vector256<byte> s2 = Unsafe.Add(ref source2Base, i);

s0 = Avx2.Permute4x64(s0.AsUInt64(), 0b_11_01_10_00).AsByte();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not missing something here we can pack these with less work in a manner similar to how it's done in the jpeg color converter?

internal static void ConvertCore(in ComponentValues values, Span<Vector4> result, float maxValue, float halfValue)

Copy link
Contributor

@saucecontrol saucecontrol Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's the same problem. The unpack operations work in-lane, so whatever starts in the upper lane of the input will be in the upper lanes of the output. The current code re-permutes after each round to keep things in the right place throughout. The other options are:

  1. Deal with it at the end, by writing the 4 lower lanes to the output first followed by the 4 upper lanes, or permuting the 128-bit pairs together before writing.
  2. Rearrange the input so the blocks of 4 bytes (which will become blocks of 16 bytes) will come out paired correctly after the unpacks.

Option 2 will be cheaper since extract costs the same as a permute, and as with the YCbCr conversion, you can get by with only permuting 3 inputs for 4 outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought so, thanks for confirming. 👍

@JimBobSquarePants
Copy link
Member

I think we need to have a rethink here.

We're currently doing a very similar operation in our jpeg decoder color transforms except that we pack to Vector4.

The end goal for all our recent improvements is to be able to decode/resize/save a jpeg (and maybe other formats webp for example) in Rgb24 and Rgba32 as fast as possible.

So... We have two choices:

  1. Refactor the color transforms to convert from the input planes to r,g,b planes as byte to take advantage of this approach.
  2. Refactor the color transforms to convert from from the input planes to r,g,b planes as single and convert these method to normalize and saturate the result to bytes.

@john-h-k @saucecontrol @antonfirsov I'd really like to get a consensus here on the best approach so we can squeeze the absolute best results not only for speed but for reuse (both by ourselves and one day Clint when he goes xplat).

I also need someone to do it as it would take me weeks to figure out. 😝

@saucecontrol
Copy link
Contributor

Yeah, if I'm following correctly, these would have been used for the D3->D4 step in @antonfirsov's proposal in #1121, which is no longer a thing because #1411 added a planarYCbCrfloat->chunkyRGBAfloat conversion, right?

The most optimized path longer term will be to keep everything in fixed-point math all the way through the JPEG decoder as libjpeg-turbo does, which eliminates a couple of the existing conversion steps. Porting that will be a big effort, but I hope to be able to work on that next year.

@antonfirsov
Copy link
Member

antonfirsov commented Nov 12, 2020

@saucecontrol full integer pipeline is the next big goal, but it's much bigger step than #1411. The idea is that with #1411 and this PR we can do the packing faster (pack x4 more pixels in one AVX2 batch), and it's not a big refactor to move there from our current state.

So it's still a thing, unless you say we shouldn't expect significant difference compared to point 2. in #1242 (comment).

@saucecontrol
Copy link
Contributor

Gotcha. I think it comes down to a question of how much separation (or reusability) you want between the pipeline steps or how much symmetry you want with your non-HWIntrinsic fallback pipeline steps. The fastest implementation would undoubtedly be a one-shot planarYCbCrfloat->chunkyRGB24 since that eliminates at least one round-trip through memory, if not two. The narrowing/saturating and unpacking/interleaving could easily be bolted on to the conversion from #1411 to do it all in one go and all in XMM/YMM registers.

The other question would be how you want to handle moving from a decode to resize stage. Convolution with 3 channels is more tricky than with 4, and if you're going to convert RGB24 back to RGBA128float for that step, you're doing a lot of unnecessary shuffling to eliminate and then restore that 4th channel. I do have a RGB96float convolution you can steal if you want to go that route.

@JimBobSquarePants
Copy link
Member

I think we'll leave resize as-is for now. We can optimize there by removing batch premultiplication and reverse from the operation since we will have knowledge of the pixel format alpha behavior. That in combination with reducing the amount of work that takes place during the final stages of jpeg decoding should mean a substantial improvement.

I don't think there has to be that much symmetry. Our existing non-HW Intrinsics can remain as is and we can use processor directives to establish a more optimized pipeline contract if we need to.

@antonfirsov
Copy link
Member

antonfirsov commented Nov 28, 2020

Had a quick look, after an introductory training to AVX2 shuffles. There is lot of work left here:

  • Add comprehensive test coverage
  • Get rid of permutes
  • Improve the fallback implementations (PackBytesToUInt24Remainder, PackBytesToUInt32SaturateChannel4Remainder): We should walk the buffer in larger batches. I would consider defining custom structs for better readability & probably better JIT codegen.
  • Benchmark (mutliple variants of?) SIMD implementations against (multiple variants of?) fallback code.

Depending on how much we want to rush 1.1, we may decide to change the color converters to pack into float triplets instead, but I still think the universal byte-based packing would be better in the end.

#if SUPPORTS_RUNTIME_INTRINSICS
HwIntrinsics.PackBytesToUInt32SaturateChannel4Reduce(ref channel0, ref channel1, ref channel2, ref dest);

// I can't immediately see a way to do this operation efficiently with Vector<T> or Vector4<T>. TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is none :)

@antonfirsov
Copy link
Member

antonfirsov commented Nov 29, 2020

@saucecontrol do you think the following may perform better than the code in the PR?

r0     = <R00, R01, R02, R03, R04, R05, R06, R07 | R08, R09, R10, R11, R12, R13, R14, R15 || R16, R17, R18, R19, R20, R21, R22, R23 | R24, R25, R26, R27, R28, R29, R30, R31>
g0     = <G00, G01, G02, G03, G04, G05, G06, G07 | G08, G09, G10, G11, G12, G13, G14, G15 || G16, G17, G18, G19, G20, G21, G22, G23 | G24, G25, G26, G27, G28, G29, G30, G31>
b0     = <B00, B01, B02, B03, B04, B05, B06, B07 | B08, B09, B10, B11, B12, B13, B14, B15 || B16, B17, B18, B19, B20, B21, B22, B23 | B24, B25, B26, B27, B28, B29, B30, B31>

....

mm256_unpacklo_epi8(r0, zero):
r1     = <R00, ___, R01, ___, R02, ___, R03, ___ | R04, ___, R05, ___, R06, ___, R07, ___ || R16, ___, R17, ___, R18, ___, R19, ___ | R20, ___, R21, ___, R22, ___, R23, ___>

mm256_unpacklo_epi8(b0, g0):
gb     = <B00, G00, B01, G01, B02, G02, B03, G03 | B04, G04, B05, G05, B06, G06, B07, G07 || B16, G16, B17, G17, B18, G18, B19, G19 | B20, G20, B21, G21, B22, G22, B23, G23>

_mm_unpacklo_epi16(gb, r1):
rgb1   = <B00, G00, R00, ___, B01, G01, R01, ___ | B02, G02, R02, ___, B03, G03, R03, ___ || B16, G16, R16, ___, B17, G17, R17, ___ | B18, G18, R18, ___, B19, G19, R19, ___>

shuffle(rgb1):
rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, ___, ___, ___, ___ || B16, G16, R16, B17, G17, R17, B18, G18 | R18, B19, G19, R19, ___, ___, ___, ___>

_mm_unpackhi_epi16(gb, r1):
rgb2   = <B04, G04, R04, ___, B05, G05, R05, ___ | B06, G06, R06, ___, B07, G07, R07, ___ || B20, G20, R20, ___, B21, G21, R21, ___ | B22, G22, R22, ___, B23, G23, R23, ___>

shuffle(rgb2):
rgb2   = <R04, G04, B04, R05, G05, B05, R06, G06 | B06, R07, G07, B07, ___, ___, ___, ___ || R20, G20, B20, R21, G21, B21, R22, G22 | B22, R23, G23, B23, ___, ___, ___, ___>

.....

mm256_unpackhi_epi8(r0, zero):
r1     = <R08, ___, R09, ___, R10, ___, R11, ___ | R12, ___, R13, ___, R14, ___, R15, ___ || R24, ___, R25, ___, R26, ___, R27, ___ | R28, ___, R29, ___, R30, ___, R31, ___>

mm256_unpackhi_epi8(b0, g0):
gb     = <B08, G08, B09, G09, B10, G10, B11, G11 | B12, G12, B13, G13, B14, G14, B15, G15 || B24, G24, B25, G25, B26, G26, B27, G27 | B28, G28, B29, G29, B30, G30, B31, G31>

_mm_unpacklo_epi16(gb, r1):
rgb3   = <B08, G08, R08, ___, B09, G09, R09, ___ | B10, G10, R10, ___, B11, G11, R11, ___ || B24, G24, R24, ___, B25, G25, R25, ___ | B26, G26, R26, ___, B27, G27, R27, ___>

shuffle(rgb3):
rgb3   = <R08, G08, B08, R09, G09, B09, R10, G10 | B10, R11, G11, B11, ___, ___, ___, ___ || R24, G24, B24, R25, G25, B25, R26, G26 | B26, R27, G27, B27, ___, ___, ___, ___>

_mm_unpackhi_epi16(gb, r1):
rgb4   = <B12, G12, R12, ___, B13, G13, R13, ___ | B14, G14, R14, ___, B15, G15, R15, ___ || B28, G28, R28, ___, B29, G29, R29, ___ | B30, G30, R30, ___, B31, G31, R31, ___>

shuffle(rgb4):
rgb4   = <R12, G12, B12, R13, G13, B13, R14, G14 | B14, R15, G15, B15, ___, ___, ___, ___ || R28, G28, B28, R29, G29, B29, R30, G30 | B30, R31, G31, B31, ___, ___, ___, ___>

............

rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, ___, ___, ___, ___ || B16, G16, R16, B17, G17, R17, B18, G18 | R18, B19, G19, R19, ___, ___, ___, ___>
rgb2   = <R04, G04, B04, R05, G05, B05, R06, G06 | B06, R07, G07, B07, ___, ___, ___, ___ || R20, G20, B20, R21, G21, B21, R22, G22 | B22, R23, G23, B23, ___, ___, ___, ___>
rgb3   = <R08, G08, B08, R09, G09, B09, R10, G10 | B10, R11, G11, B11, ___, ___, ___, ___ || R24, G24, B24, R25, G25, B25, R26, G26 | B26, R27, G27, B27, ___, ___, ___, ___>
rgb4   = <R12, G12, B12, R13, G13, B13, R14, G14 | B14, R15, G15, B15, ___, ___, ___, ___ || R28, G28, B28, R29, G29, B29, R30, G30 | B30, R31, G31, B31, ___, ___, ___, ___>

...........

store(rgb1)
store(rgb2)
store(rgb3)
store(rgb4)
store(_mm256_permute4x64_epi64(rgb1))
store(_mm256_permute4x64_epi64(rgb2))
store(_mm256_permute4x64_epi64(rgb3))
store(_mm256_permute4x64_epi64(rgb4))

Still has 4 permutes.

EDIT: Realized I can use maskstore, hope it's actually faster than permute+store.

@saucecontrol
Copy link
Contributor

saucecontrol commented Nov 29, 2020

Yeah, that'll be faster. The permutes are not as big a deal as long as there is no dependency chain between them. Cross-lane permutes have a latency of 3, meaning it takes 3 cycles from when the instruction starts to when the result is ready. But they have a reciprocal throughput of 1, meaning it's a pipelined instruction that can start 1 and complete 1 per cycle. If you issue three in a row, the cycle after the third is started, the first is complete -- as long as they don't depend on each others' results.

The PR code will actually stall on the shuffle, shuffle, or, shuffle, or sequences since those have dependencies between them that cause them to run in serial. And it will likely run out of registers because of all the different shuffle masks and end up with some stack spilling.

I'd actually write this one with a permute round to start, in order to minimize the stores.

something like this
vpermd (mask 0,2,4,6,1,3,5,7) x3
r0     = <R00, R01, R02, R03, R08, R09, R10, R11 | R16, R17, R18, R19, R24, R25, R26, R27 || R04, R05, R06, R07, R12, R13, R14, R15 | R20, R21, R22, R23, R28, R29, R30, R31>
g0     = <G00, G01, G02, G03, G08, G09, G10, G11 | G16, G17, G18, G19, G24, G25, G26, G27 || G04, G05, G06, G07, G12, G13, G14, G15 | G20, G21, G22, G23, G28, G29, G30, G31>
b0     = <B00, B01, B02, B03, B08, B09, B10, B11 | B16, B17, B18, B19, B24, B25, B26, B27 || B04, B05, B06, B07, B12, B13, B14, B15 | B20, B21, B22, B23, B28, B29, B30, B31>

vpupcklbw x2
rg     = <R00, G00, R01, G01, R02, G02, R03, G03 | R08, G08, R09, G09, R10, G10, R11, G11 || R04, G04, R05, G05, R06, G06, R07, G07 | R12, G12, R13, G13, R14, G14, R15, G15>
b0     = <B00, ___, B01, ___, B02, ___, B03, ___ | B08, ___, B09, ___, B10, ___, B11, ___ || B04, ___, B05, ___, B06, ___, B07, ___ | B12, ___, B13, ___, B14, ___, B15, ___>

vpunpcklwd
rgb1   = <R00, G00, B00, ___, R01, G01, B01, ___ | R02, G02, B02, ___, R03, G03, B03, ___ || R04, G04, B04, ___, R05, B05, G05, ___ | R06, G06, B06, ___, R07, G07, B07, ___>
vpunpckhwd
rgb2   = <R08, G08, B08, ___, R09, G09, B09, ___ | R10, G10, B10, ___, R11, G11, B11, ___ || R12, G12, B12, ___, R13, B13, G13, ___ | R14, G14, B14, ___, R15, G15, B15, ___>

vpunpckhbdx2
rg     = <R16, G16, ...                          | R24, G24, ...                          || R20, G20, ...                          | R28, G28, ...                         >
b0     = <B16, ___, ...                          | B24, ___, ...                          || B20, ___, ...                          | B28, ___, ...                         >

vpunpcklwd
rgb3   = <R16, G16, B16, ___, ...                | R18, G18, B18, ___, ...                || R20, G20, B20, ___, ...                | R22, G22, B22, ___, ...               >
vpunpckhwd
rgb3   = <R24, G24, B24, ___, ...                | R26, G26, B26, ___, ...                || R28, G28, B28, ___, ...                | R30, G30, B30, ___, ...               >

vpshufb x4
rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, ___, ___, ___, ___ || R04, G04, B04, R05, B05, G05, R06, G06 | B06, R07, G07, B07, ___, ___, ___, ___>
rgb2   = <R08, ...
rgb3   = <R16, ...
rgb4   = <R24, ...

vpermd (mask 0,1,2,4,5,6,3,7) x4
rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, R04, G04, B04, R05 || B05, G05, R06, G06, B06, R07, G07, B07 | ___, ___, ___, ___, ___, ___, ___, ___>
rgb2   = <R08, ...
rgb3   = <R16, ...
rgb4   = <R24, ...

store x4

I highly recommend chapter 9 in Agner's optimization guide for background on how latencies, throughputs, instruction units, and ports all interact. Should give you a better idea of what's fast without necessarily having to code it up.

You can check instruction timings on https://uops.info/table.html. Hint: vmaskmovdqu is an expensive one ;)

@antonfirsov
Copy link
Member

I'm closing this in favor of #1462. Regardless, @john-h-k thanks a lot for your contribution, it helped to kick the discussion around the topic, which ended finding the optimal solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants