Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and Rgba24 pixels #1242

john-h-k · 2020-06-19T20:41:29Z

WIP. A few initial questions raised

Is there a specific reason for no SSE intrinsics? They would be useful here
Why are the remainder paths for vectorised pixel operations marked as NoInlining? They are often going to be hot code
There were a few places where there were unnecessary span slices. Although not directly in the scope of this PR, I removed them
What other pixel types need this impl?

…gba24 pixels

CLAassistant · 2020-06-19T20:41:33Z

All committers have signed the CLA.

JimBobSquarePants · 2020-06-21T15:43:26Z

Thanks @john-h-k for making a start on this, really appreciate it. I'll tag @antonfirsov and @saucecontrol here since they'll both provide much better input to these APIs than I ever will.

Is there a specific reason for no SSE intrinsics? They would be useful here

No reason I am aware of.

Why are the remainder paths for vectorised pixel operations marked as NoInlining? They are often going to be hot code

Possible oversight. I would inline them.

There were a few places where there were unnecessary span slices. Although not directly in the scope of this PR, I removed them

I think you misread them. There were two different slice overloads taking place. I've reverted those changes to fix the output and also changed what looked like a copypasta when slicing in your code.

What other pixel types need this impl?

All of the RGBA compatible types should be able to take advantage of the new API since it should be a case of channel swapping for individual implementations.

john-h-k · 2020-06-21T15:49:17Z

There were two different slice overloads taking place. I've reverted those changes to fix the output and also changed what looked like a copypasta when slicing in your code.

Ah yeah that was a silly mistake apologies. Shame there's not a Span<T> Split(int index, out Span<T> start, out Span<T> end) to avoid the double arg checks there.

Is there a specific reason for no SSE intrinsics? They would be useful here

No reason I am aware of.

For Intel desktop, post-2013 will have AVX2, but with AMD desktop chips pre 2015 ones won't have AVX2 (I know at least one dev who's main system doesn't have AVX2 for example). I vaguely recall mobile and laptop chips don't have AVX2 as frequently, (but will have SSE), so i'll check if thats right, and if so, that would be a valuable reason to support SSE. Also, for various reasons consumers way want to disable AVX2 (JIT supports an environment variable to do so), as it can have downclocking issues, in which case having SSE would again be useful

codecov · 2020-06-21T15:50:20Z

Codecov Report

Merging #1242 (f4cabf2) into master (05659b8) will decrease coverage by 0.39%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #1242      +/-   ##
==========================================
- Coverage   83.68%   83.29%   -0.40%     
==========================================
  Files         734      735       +1     
  Lines       31990    32142     +152     
  Branches     3605     3616      +11     
==========================================
  Hits        26772    26772              
- Misses       4505     4657     +152     
  Partials      713      713

Flag	Coverage Δ
unittests	`83.29% <0.00%> (-0.40%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...arp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs	`70.66% <0.00%> (-1.94%)`	⬇️
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs	`76.39% <0.00%> (-21.41%)`	⬇️
src/ImageSharp/Common/Helpers/SimdUtils.cs	`41.33% <0.00%> (-24.63%)`	⬇️
...ntations/PixelOperations/Rgba32.PixelOperations.cs	`70.83% <0.00%> (-29.17%)`	⬇️
...mats/PixelImplementations/Rgb24.PixelOperations.cs	`0.00% <0.00%> (ø)`
...ImageSharp/PixelFormats/PixelOperations{TPixel}.cs	`33.33% <0.00%> (-6.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cad52a2...f4cabf2. Read the comment docs.

saucecontrol

Left a few comments for now on some logic errors and some general tips for the SIMD code. I'd be happy to help with more detail, but I figure if this is a learning project you may want to work through the solution yourself. Let me know if you have any questions.

On the question of the lack of SSE2 code, I believe the reason there isn't any for the existing float->byte conversion is that it was built onto an existing code path using Vector<T>, and that code path was only advantageous when Vector<float>.Count is 8 (which implies AVX2). @antonfirsov can correct me if I misexplained that. In any case, any new HWIntrinsics code should definitely have an SSE fallback, so don't let the fact it's missing from the other implementation stop you.

saucecontrol · 2020-06-21T18:56:19Z

src/ImageSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs

+
+                    if (adjustedCount > 0)
+                    {
+                        channel0 = channel0.Slice(adjustedCount);


This is the same Slice mistake @JimBobSquarePants pointed out from the other changes. You're slicing to the remainder before calling the method that's supposed to do the main part.

saucecontrol · 2020-06-21T18:57:54Z

src/ImageSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs

@@ -91,6 +91,190 @@ public static class Avx2Intrinsics
                }
            }

+            internal static void PackBytesToUInt32SaturateChannel4Reduce(


I don't know what name is best for this, but there's no saturation happening here because the input is already byte.

by "saturate channel 4" i just meant set all bits to true. Not sure why i chose that wording

Ah, I get it. I suck at names, so feel free to ignore my feedback on anything naming related 😆

saucecontrol · 2020-06-21T18:58:58Z

src/ImageSharp/Common/Helpers/SimdUtils.cs

+            // Deal with the remainder:
+            if (channel0.Length > 0)
+            {
+                PackBytesToUInt24(channel0, channel1, channel2, dest);


This is calling itself recursively. Should be calling the remainder method.

saucecontrol · 2020-06-21T19:03:22Z

src/ImageSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs

+                    s0 = Avx2.Permute4x64(s0.AsUInt64(), 0b_11_01_10_00).AsByte();
+                    s1 = Avx2.Permute4x64(s1.AsUInt64(), 0b_11_01_10_00).AsByte();
+                    s2 = Avx2.Permute4x64(s2.AsUInt64(), 0b_11_01_10_00).AsByte();


Permutes are expensive, so you'll do better to just let the upper/lower lanes stay as is and fix them at the end rather than trying to keep everything in order all the way through. It may help to write the SSE2 version first and have the AVX2 code match that right up until the end when you deal with the 128-bit lanes being interleaved.

👍 that's actually news to me 😄 , i was always under the impression permutes were only a couple of cycles

They're 3 cycles latency and 1 cycle throughput, but they're also only able to be scheduled on a single port, so they can easily be a bottleneck. I should have said 'relatively expensive', compared to most of the other instructions used here.

A good test to judge the impact would be to write up a benchmark for the SSE2 version of the code to see how the AVX2 version compares to that. Your goal would, of course, be 2x performance. You might be surprised what kind of impact all those permutes have in that context :)

saucecontrol · 2020-06-21T19:04:04Z

src/ImageSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs

+
+                    if (adjustedCount > 0)
+                    {
+                        channel0 = channel0.Slice(adjustedCount);


Incorrect Slice use here as well

saucecontrol · 2020-06-21T19:06:48Z

src/ImageSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs

+                ref Vector256<byte> destBase =
+                    ref Unsafe.As<byte, Vector256<byte>>(ref MemoryMarshal.GetReference(dest));
+
+                Vector256<byte> s0Mask0 = Vector256.Create(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1).AsByte();


They'll direct you to use Vector256.Create like this over in the runtime repo because they've fixed the JIT to recognize it in 5.0. It's exceedingly expensive in 3.x, though, and since 3.1 has 2.5 years of LTS left, I would personally keep using the ReadOnlySpan<byte> trick to load these from static data.

@JimBobSquarePants do you have a preference there?

Yeah, let's use the ROS trick here. We want to try to max out on all supported frameworks.

saucecontrol · 2020-06-21T19:11:47Z

src/ImageSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs

+                    Vector256<byte> loS0 = Avx2.Permute2x128(s0, s0, 0);
+                    Vector256<byte> loS1 = Avx2.Permute2x128(s1, s1, 0);
+                    Vector256<byte> loS2 = Avx2.Permute2x128(s2, s2, 0);


Again, all these permutes and shuffles will be quite expensive. What I do for these 3 channel cases is treat them the same as 4 channel right up until the end (in this case with a dummy zero vector), and then do a single shuffle to pack the 12 good bytes out of each 16 together. That does mean overrunning the end of your buffer by 8 bytes for the AVX2 implementation, though, so you'd have to adjust your remainder/cleanup length by 8 to compensate. It's worth doing some benchmarks to see if the difference if you're not clear on it.

That single shuffle... How would that work? I can't figure it out without multiple permutes.

So, assuming this one starts off the same as the 4-channel version, you'll end up with vectors that look like this:

BGRx:BGRx:BGRx:BGRx||BGRx:BGRx:BGRx:BGRx

With a shuffle, you can pack the BGR/RGB triplets together, putting the filler/alpha values at the end of each lane. Then with a permute, you can cram the 12-byte valid sections of each lane together, leaving all the dummy values in the last 8 bytes. Like such (forgive my poor ascii art skills):

BGRx:BGRx:BGRx:BGRx||BGRx:BGRx:BGRx:BGRx vpshufb (0,1,2,4,5,6,8,9,a,c,d,e,3,7,b,f) || \/ BGRB:GRBG:RBGR:xxxx||BGRB:GRBG:RBGR:xxxx vpermd (0,1,2,4,5,6,3,7) || \/ BGRB:GRBG:RBGR:BGRB||GRBG:RBGR:xxxx:xxxx

At that point, you can either write the full 32-byte vector out (but then only advance your output ref/ptr by 24 bytes), or you can write the lower lane and then movq the first half of the upper lane, depending on whether you're at the end of your out buffer or not.

BTW, because this one requires 3 permutes to start and then 4 to end, it may be one where SSE2 is faster than AVX2. Would be worth benchmarking.

Yay! I couldn't figure out the first part. Thanks!

On another note, BroadcastVector128ToVector256 should absolutely have an overload accepting a Vector128<T>.

Nice!

The vbroadcasti128 instruction only accepts a memory operand, so to use it with Vector128<T>, it would have to spill to memory. You can, however, use Vector256<T>.Create() with the same Vector128<T> for both args, and it will emit the correct permute for you.

antonfirsov · 2020-06-23T20:35:22Z

@john-h-k thanks a lot for taking the time to look at this! Currently I'm on a holiday, AFK 99% of the time. I will have a deeper look next week. I trust @saucecontrol 100% about technical suggestions on SIMD, Span etc.

A few random answers in the meanwhile:

Is there a specific reason for no SSE intrinsics? They would be useful here

The focus was on server/cloud, since vast majority of our users are in the cloud. I had an assumption that AVX2 should be available on most cloud VM-s, but it was not based on actual research. It's fine to add other paths of course, but we should count with the added complexity it would introduce. Eg: make sure those paths are actually covered by unit tests -> add changes to test "infrastructure code" if necessary -> ....

(which implies AVX2). @antonfirsov can correct me if I misexplained that.

Yes there might be still many places in code, where I used the term AVX2 as synonym for `Vector˙ of 8 elements. Recently I started to use the term Vector8 for those code paths (AVX2 belongs to HWIntrinsic stuff now).

antonfirsov · 2020-09-16T15:15:10Z

It would be really great to finish this! @john-h-k any chance you are around, having some time? I will have the necessary review capacity in the next 2 weeks.

john-h-k · 2020-09-17T15:37:41Z

Yes I should be able to finish this friday or the coming weekend - i've been very busy with A-levels so sorry for letting the PR go stale!

JimBobSquarePants · 2020-09-17T19:40:36Z

A-levels first mate! Hope they went well!

…anes

JimBobSquarePants · 2020-10-24T20:51:37Z

@john-h-k There's been a bit of churn in the repo around this are of the code as I added some new intrinsics methods. I've updated your fork though so you don't have to deal with the merge conflicts.

JimBobSquarePants · 2020-10-24T21:22:00Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+                    Vector256<byte> s1 = Unsafe.Add(ref source1Base, i);
+                    Vector256<byte> s2 = Unsafe.Add(ref source2Base, i);
+
+                    s0 = Avx2.Permute4x64(s0.AsUInt64(), 0b_11_01_10_00).AsByte();


If I'm not missing something here we can pack these with less work in a manner similar to how it's done in the jpeg color converter?

ImageSharp/src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrSimdAvx2.cs

Line 44 in 120080b

internal static void ConvertCore(in ComponentValues values, Span<Vector4> result, float maxValue, float halfValue)

Yep, it's the same problem. The unpack operations work in-lane, so whatever starts in the upper lane of the input will be in the upper lanes of the output. The current code re-permutes after each round to keep things in the right place throughout. The other options are:

Deal with it at the end, by writing the 4 lower lanes to the output first followed by the 4 upper lanes, or permuting the 128-bit pairs together before writing.

Rearrange the input so the blocks of 4 bytes (which will become blocks of 16 bytes) will come out paired correctly after the unpacks.

Option 2 will be cheaper since extract costs the same as a permute, and as with the YCbCr conversion, you can get by with only permuting 3 inputs for 4 outputs.

Thought so, thanks for confirming. 👍

JimBobSquarePants · 2020-11-12T18:10:23Z

I think we need to have a rethink here.

We're currently doing a very similar operation in our jpeg decoder color transforms except that we pack to Vector4.

The end goal for all our recent improvements is to be able to decode/resize/save a jpeg (and maybe other formats webp for example) in Rgb24 and Rgba32 as fast as possible.

So... We have two choices:

Refactor the color transforms to convert from the input planes to r,g,b planes as byte to take advantage of this approach.
Refactor the color transforms to convert from from the input planes to r,g,b planes as single and convert these method to normalize and saturate the result to bytes.

@john-h-k @saucecontrol @antonfirsov I'd really like to get a consensus here on the best approach so we can squeeze the absolute best results not only for speed but for reuse (both by ourselves and one day Clint when he goes xplat).

I also need someone to do it as it would take me weeks to figure out. 😝

saucecontrol · 2020-11-12T20:01:44Z

Yeah, if I'm following correctly, these would have been used for the D3->D4 step in @antonfirsov's proposal in #1121, which is no longer a thing because #1411 added a planarYCbCrfloat->chunkyRGBAfloat conversion, right?

The most optimized path longer term will be to keep everything in fixed-point math all the way through the JPEG decoder as libjpeg-turbo does, which eliminates a couple of the existing conversion steps. Porting that will be a big effort, but I hope to be able to work on that next year.

antonfirsov · 2020-11-12T20:39:39Z

@saucecontrol full integer pipeline is the next big goal, but it's much bigger step than #1411. The idea is that with #1411 and this PR we can do the packing faster (pack x4 more pixels in one AVX2 batch), and it's not a big refactor to move there from our current state.

So it's still a thing, unless you say we shouldn't expect significant difference compared to point 2. in #1242 (comment).

saucecontrol · 2020-11-12T22:27:03Z

Gotcha. I think it comes down to a question of how much separation (or reusability) you want between the pipeline steps or how much symmetry you want with your non-HWIntrinsic fallback pipeline steps. The fastest implementation would undoubtedly be a one-shot planarYCbCrfloat->chunkyRGB24 since that eliminates at least one round-trip through memory, if not two. The narrowing/saturating and unpacking/interleaving could easily be bolted on to the conversion from #1411 to do it all in one go and all in XMM/YMM registers.

The other question would be how you want to handle moving from a decode to resize stage. Convolution with 3 channels is more tricky than with 4, and if you're going to convert RGB24 back to RGBA128float for that step, you're doing a lot of unnecessary shuffling to eliminate and then restore that 4th channel. I do have a RGB96float convolution you can steal if you want to go that route.

JimBobSquarePants · 2020-11-12T23:53:41Z

I think we'll leave resize as-is for now. We can optimize there by removing batch premultiplication and reverse from the operation since we will have knowledge of the pixel format alpha behavior. That in combination with reducing the amount of work that takes place during the final stages of jpeg decoding should mean a substantial improvement.

I don't think there has to be that much symmetry. Our existing non-HW Intrinsics can remain as is and we can use processor directives to establish a more optimized pipeline contract if we need to.

antonfirsov · 2020-11-28T22:08:54Z

Had a quick look, after an introductory training to AVX2 shuffles. There is lot of work left here:

Add comprehensive test coverage
Get rid of permutes
Improve the fallback implementations (PackBytesToUInt24Remainder, PackBytesToUInt32SaturateChannel4Remainder): We should walk the buffer in larger batches. I would consider defining custom structs for better readability & probably better JIT codegen.
Benchmark (mutliple variants of?) SIMD implementations against (multiple variants of?) fallback code.

Depending on how much we want to rush 1.1, we may decide to change the color converters to pack into float triplets instead, but I still think the universal byte-based packing would be better in the end.

antonfirsov · 2020-11-28T22:11:40Z

src/ImageSharp/Common/Helpers/SimdUtils.cs

+#if SUPPORTS_RUNTIME_INTRINSICS
+            HwIntrinsics.PackBytesToUInt32SaturateChannel4Reduce(ref channel0, ref channel1, ref channel2, ref dest);
+
+            // I can't immediately see a way to do this operation efficiently with Vector<T> or Vector4<T>. TODO


There is none :)

antonfirsov · 2020-11-29T04:11:57Z

@saucecontrol do you think the following may perform better than the code in the PR?

r0     = <R00, R01, R02, R03, R04, R05, R06, R07 | R08, R09, R10, R11, R12, R13, R14, R15 || R16, R17, R18, R19, R20, R21, R22, R23 | R24, R25, R26, R27, R28, R29, R30, R31>
g0     = <G00, G01, G02, G03, G04, G05, G06, G07 | G08, G09, G10, G11, G12, G13, G14, G15 || G16, G17, G18, G19, G20, G21, G22, G23 | G24, G25, G26, G27, G28, G29, G30, G31>
b0     = <B00, B01, B02, B03, B04, B05, B06, B07 | B08, B09, B10, B11, B12, B13, B14, B15 || B16, B17, B18, B19, B20, B21, B22, B23 | B24, B25, B26, B27, B28, B29, B30, B31>

....

mm256_unpacklo_epi8(r0, zero):
r1     = <R00, ___, R01, ___, R02, ___, R03, ___ | R04, ___, R05, ___, R06, ___, R07, ___ || R16, ___, R17, ___, R18, ___, R19, ___ | R20, ___, R21, ___, R22, ___, R23, ___>

mm256_unpacklo_epi8(b0, g0):
gb     = <B00, G00, B01, G01, B02, G02, B03, G03 | B04, G04, B05, G05, B06, G06, B07, G07 || B16, G16, B17, G17, B18, G18, B19, G19 | B20, G20, B21, G21, B22, G22, B23, G23>

_mm_unpacklo_epi16(gb, r1):
rgb1   = <B00, G00, R00, ___, B01, G01, R01, ___ | B02, G02, R02, ___, B03, G03, R03, ___ || B16, G16, R16, ___, B17, G17, R17, ___ | B18, G18, R18, ___, B19, G19, R19, ___>

shuffle(rgb1):
rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, ___, ___, ___, ___ || B16, G16, R16, B17, G17, R17, B18, G18 | R18, B19, G19, R19, ___, ___, ___, ___>

_mm_unpackhi_epi16(gb, r1):
rgb2   = <B04, G04, R04, ___, B05, G05, R05, ___ | B06, G06, R06, ___, B07, G07, R07, ___ || B20, G20, R20, ___, B21, G21, R21, ___ | B22, G22, R22, ___, B23, G23, R23, ___>

shuffle(rgb2):
rgb2   = <R04, G04, B04, R05, G05, B05, R06, G06 | B06, R07, G07, B07, ___, ___, ___, ___ || R20, G20, B20, R21, G21, B21, R22, G22 | B22, R23, G23, B23, ___, ___, ___, ___>

.....

mm256_unpackhi_epi8(r0, zero):
r1     = <R08, ___, R09, ___, R10, ___, R11, ___ | R12, ___, R13, ___, R14, ___, R15, ___ || R24, ___, R25, ___, R26, ___, R27, ___ | R28, ___, R29, ___, R30, ___, R31, ___>

mm256_unpackhi_epi8(b0, g0):
gb     = <B08, G08, B09, G09, B10, G10, B11, G11 | B12, G12, B13, G13, B14, G14, B15, G15 || B24, G24, B25, G25, B26, G26, B27, G27 | B28, G28, B29, G29, B30, G30, B31, G31>

_mm_unpacklo_epi16(gb, r1):
rgb3   = <B08, G08, R08, ___, B09, G09, R09, ___ | B10, G10, R10, ___, B11, G11, R11, ___ || B24, G24, R24, ___, B25, G25, R25, ___ | B26, G26, R26, ___, B27, G27, R27, ___>

shuffle(rgb3):
rgb3   = <R08, G08, B08, R09, G09, B09, R10, G10 | B10, R11, G11, B11, ___, ___, ___, ___ || R24, G24, B24, R25, G25, B25, R26, G26 | B26, R27, G27, B27, ___, ___, ___, ___>

_mm_unpackhi_epi16(gb, r1):
rgb4   = <B12, G12, R12, ___, B13, G13, R13, ___ | B14, G14, R14, ___, B15, G15, R15, ___ || B28, G28, R28, ___, B29, G29, R29, ___ | B30, G30, R30, ___, B31, G31, R31, ___>

shuffle(rgb4):
rgb4   = <R12, G12, B12, R13, G13, B13, R14, G14 | B14, R15, G15, B15, ___, ___, ___, ___ || R28, G28, B28, R29, G29, B29, R30, G30 | B30, R31, G31, B31, ___, ___, ___, ___>

............

rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, ___, ___, ___, ___ || B16, G16, R16, B17, G17, R17, B18, G18 | R18, B19, G19, R19, ___, ___, ___, ___>
rgb2   = <R04, G04, B04, R05, G05, B05, R06, G06 | B06, R07, G07, B07, ___, ___, ___, ___ || R20, G20, B20, R21, G21, B21, R22, G22 | B22, R23, G23, B23, ___, ___, ___, ___>
rgb3   = <R08, G08, B08, R09, G09, B09, R10, G10 | B10, R11, G11, B11, ___, ___, ___, ___ || R24, G24, B24, R25, G25, B25, R26, G26 | B26, R27, G27, B27, ___, ___, ___, ___>
rgb4   = <R12, G12, B12, R13, G13, B13, R14, G14 | B14, R15, G15, B15, ___, ___, ___, ___ || R28, G28, B28, R29, G29, B29, R30, G30 | B30, R31, G31, B31, ___, ___, ___, ___>

...........

store(rgb1)
store(rgb2)
store(rgb3)
store(rgb4)
store(_mm256_permute4x64_epi64(rgb1))
store(_mm256_permute4x64_epi64(rgb2))
store(_mm256_permute4x64_epi64(rgb3))
store(_mm256_permute4x64_epi64(rgb4))

Still has 4 permutes.

EDIT: Realized I can use maskstore, hope it's actually faster than permute+store.

saucecontrol · 2020-11-29T05:33:08Z

Yeah, that'll be faster. The permutes are not as big a deal as long as there is no dependency chain between them. Cross-lane permutes have a latency of 3, meaning it takes 3 cycles from when the instruction starts to when the result is ready. But they have a reciprocal throughput of 1, meaning it's a pipelined instruction that can start 1 and complete 1 per cycle. If you issue three in a row, the cycle after the third is started, the first is complete -- as long as they don't depend on each others' results.

The PR code will actually stall on the shuffle, shuffle, or, shuffle, or sequences since those have dependencies between them that cause them to run in serial. And it will likely run out of registers because of all the different shuffle masks and end up with some stack spilling.

I'd actually write this one with a permute round to start, in order to minimize the stores.

something like this

vpermd (mask 0,2,4,6,1,3,5,7) x3
r0     = <R00, R01, R02, R03, R08, R09, R10, R11 | R16, R17, R18, R19, R24, R25, R26, R27 || R04, R05, R06, R07, R12, R13, R14, R15 | R20, R21, R22, R23, R28, R29, R30, R31>
g0     = <G00, G01, G02, G03, G08, G09, G10, G11 | G16, G17, G18, G19, G24, G25, G26, G27 || G04, G05, G06, G07, G12, G13, G14, G15 | G20, G21, G22, G23, G28, G29, G30, G31>
b0     = <B00, B01, B02, B03, B08, B09, B10, B11 | B16, B17, B18, B19, B24, B25, B26, B27 || B04, B05, B06, B07, B12, B13, B14, B15 | B20, B21, B22, B23, B28, B29, B30, B31>

vpupcklbw x2
rg     = <R00, G00, R01, G01, R02, G02, R03, G03 | R08, G08, R09, G09, R10, G10, R11, G11 || R04, G04, R05, G05, R06, G06, R07, G07 | R12, G12, R13, G13, R14, G14, R15, G15>
b0     = <B00, ___, B01, ___, B02, ___, B03, ___ | B08, ___, B09, ___, B10, ___, B11, ___ || B04, ___, B05, ___, B06, ___, B07, ___ | B12, ___, B13, ___, B14, ___, B15, ___>

vpunpcklwd
rgb1   = <R00, G00, B00, ___, R01, G01, B01, ___ | R02, G02, B02, ___, R03, G03, B03, ___ || R04, G04, B04, ___, R05, B05, G05, ___ | R06, G06, B06, ___, R07, G07, B07, ___>
vpunpckhwd
rgb2   = <R08, G08, B08, ___, R09, G09, B09, ___ | R10, G10, B10, ___, R11, G11, B11, ___ || R12, G12, B12, ___, R13, B13, G13, ___ | R14, G14, B14, ___, R15, G15, B15, ___>

vpunpckhbdx2
rg     = <R16, G16, ...                          | R24, G24, ...                          || R20, G20, ...                          | R28, G28, ...                         >
b0     = <B16, ___, ...                          | B24, ___, ...                          || B20, ___, ...                          | B28, ___, ...                         >

vpunpcklwd
rgb3   = <R16, G16, B16, ___, ...                | R18, G18, B18, ___, ...                || R20, G20, B20, ___, ...                | R22, G22, B22, ___, ...               >
vpunpckhwd
rgb3   = <R24, G24, B24, ___, ...                | R26, G26, B26, ___, ...                || R28, G28, B28, ___, ...                | R30, G30, B30, ___, ...               >

vpshufb x4
rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, ___, ___, ___, ___ || R04, G04, B04, R05, B05, G05, R06, G06 | B06, R07, G07, B07, ___, ___, ___, ___>
rgb2   = <R08, ...
rgb3   = <R16, ...
rgb4   = <R24, ...

vpermd (mask 0,1,2,4,5,6,3,7) x4
rgb1   = <R00, G00, B00, R01, G01, B01, R02, G02 | B02, R03, G03, B03, R04, G04, B04, R05 || B05, G05, R06, G06, B06, R07, G07, B07 | ___, ___, ___, ___, ___, ___, ___, ___>
rgb2   = <R08, ...
rgb3   = <R16, ...
rgb4   = <R24, ...

store x4

I highly recommend chapter 9 in Agner's optimization guide for background on how latencies, throughputs, instruction units, and ports all interact. Should give you a better idea of what's fast without necessarily having to code it up.

You can check instruction timings on https://uops.info/table.html. Hint: vmaskmovdqu is an expensive one ;)

antonfirsov · 2020-12-05T20:35:58Z

I'm closing this in favor of #1462. Regardless, @john-h-k thanks a lot for your contribution, it helped to kick the discussion around the topic, which ended finding the optimal solution!

Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and R…

5771c2c

…gba24 pixels

JimBobSquarePants added 2 commits June 19, 2020 23:29

Fix build

883344c

Fix slicing

e907126

saucecontrol reviewed Jun 21, 2020

View reviewed changes

JimBobSquarePants added the area:performance label Jul 7, 2020

JimBobSquarePants added this to the 1.0.0 milestone Jul 7, 2020

Merge branch 'master' into add-packfromrgbplanes

f7289ee

JimBobSquarePants added the blocking label Jul 10, 2020

Merge branch 'master' into add-packfromrgbplanes

a50fc32

JimBobSquarePants removed the blocking label Jul 16, 2020

JimBobSquarePants modified the milestones: 1.0.0, Future Jul 16, 2020

JimBobSquarePants added 2 commits July 28, 2020 14:51

Merge branch 'master' into add-packfromrgbplanes

0510783

Merge branch 'master' into add-packfromrgbplanes

09f464f

JimBobSquarePants added 3 commits October 6, 2020 22:23

Merge branch 'master' into add-packfromrgbplanes

4a90255

Merge branch 'master' into add-packfromrgbplanes

c35b3a8

Merge branch 'master' into add-packfromrgbplanes

660b110

JimBobSquarePants mentioned this pull request Oct 23, 2020

Add AVX2 Vector4Octet.Pack implementation #1402

Merged

4 tasks

JimBobSquarePants added 2 commits October 24, 2020 21:42

Merge remote-tracking branch 'upstream/master' into add-packfromrgbpl…

6ed1928

…anes

Fix refs

dd071a2

JimBobSquarePants reviewed Oct 24, 2020

View reviewed changes

Fix slicing

f449283

JimBobSquarePants mentioned this pull request Nov 6, 2020

Vectorize (AVX2) JPEG Color Converter #1411

Merged

Merge branch 'master' into add-packfromrgbplanes

3b216c7

JimBobSquarePants and others added 3 commits November 16, 2020 22:19

Merge branch 'master' into add-packfromrgbplanes

cf78b85

Merge remote-tracking branch 'origin/master' into add-packfromrgbplanes

3298603

ImageMaths -> Numerics

f4cabf2

antonfirsov reviewed Nov 28, 2020

View reviewed changes

antonfirsov mentioned this pull request Dec 5, 2020

Implement PackFromRgbPlanes for Rgba32 and Rgb24 #1462

Merged

4 tasks

antonfirsov closed this Dec 5, 2020

antonfirsov mentioned this pull request Jan 20, 2021

Speed improvements to resize kernel (w/ SIMD) #1513

Merged

4 tasks

antonfirsov mentioned this pull request Jan 25, 2022

Jpeg encoder optimization via removing second transpose call from FDCT #1958

Merged

4 tasks

Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and Rgba24 pixels #1242

Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and Rgba24 pixels #1242

Conversation

john-h-k commented Jun 19, 2020

CLAassistant commented Jun 19, 2020 • edited Loading

JimBobSquarePants commented Jun 21, 2020 • edited Loading

john-h-k commented Jun 21, 2020 • edited Loading

codecov bot commented Jun 21, 2020 • edited Loading

Codecov Report

saucecontrol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saucecontrol Oct 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov commented Jun 23, 2020

antonfirsov commented Sep 16, 2020

john-h-k commented Sep 17, 2020

JimBobSquarePants commented Sep 17, 2020

JimBobSquarePants commented Oct 24, 2020

Choose a reason for hiding this comment

saucecontrol Oct 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Nov 12, 2020

saucecontrol commented Nov 12, 2020

antonfirsov commented Nov 12, 2020 • edited Loading

saucecontrol commented Nov 12, 2020

JimBobSquarePants commented Nov 12, 2020

antonfirsov commented Nov 28, 2020 • edited Loading

Choose a reason for hiding this comment

antonfirsov commented Nov 29, 2020 • edited Loading

saucecontrol commented Nov 29, 2020 • edited Loading

antonfirsov commented Dec 5, 2020

CLAassistant commented Jun 19, 2020 •

edited

Loading

JimBobSquarePants commented Jun 21, 2020 •

edited

Loading

john-h-k commented Jun 21, 2020 •

edited

Loading

codecov bot commented Jun 21, 2020 •

edited

Loading

saucecontrol Oct 25, 2020 •

edited

Loading

saucecontrol Oct 24, 2020 •

edited

Loading

antonfirsov commented Nov 12, 2020 •

edited

Loading

antonfirsov commented Nov 28, 2020 •

edited

Loading

antonfirsov commented Nov 29, 2020 •

edited

Loading

saucecontrol commented Nov 29, 2020 •

edited

Loading