-
-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and Rgba24 pixels #1242
Conversation
Thanks @john-h-k for making a start on this, really appreciate it. I'll tag @antonfirsov and @saucecontrol here since they'll both provide much better input to these APIs than I ever will.
No reason I am aware of.
Possible oversight. I would inline them.
I think you misread them. There were two different slice overloads taking place. I've reverted those changes to fix the output and also changed what looked like a copypasta when slicing in your code.
All of the RGBA compatible types should be able to take advantage of the new API since it should be a case of channel swapping for individual implementations. |
Ah yeah that was a silly mistake apologies. Shame there's not a
For Intel desktop, post-2013 will have AVX2, but with AMD desktop chips pre 2015 ones won't have AVX2 (I know at least one dev who's main system doesn't have AVX2 for example). I vaguely recall mobile and laptop chips don't have AVX2 as frequently, (but will have SSE), so i'll check if thats right, and if so, that would be a valuable reason to support SSE. Also, for various reasons consumers way want to disable AVX2 (JIT supports an environment variable to do so), as it can have downclocking issues, in which case having SSE would again be useful |
Codecov Report
@@ Coverage Diff @@
## master #1242 +/- ##
==========================================
- Coverage 83.68% 83.29% -0.40%
==========================================
Files 734 735 +1
Lines 31990 32142 +152
Branches 3605 3616 +11
==========================================
Hits 26772 26772
- Misses 4505 4657 +152
Partials 713 713
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments for now on some logic errors and some general tips for the SIMD code. I'd be happy to help with more detail, but I figure if this is a learning project you may want to work through the solution yourself. Let me know if you have any questions.
On the question of the lack of SSE2 code, I believe the reason there isn't any for the existing float->byte conversion is that it was built onto an existing code path using Vector<T>
, and that code path was only advantageous when Vector<float>.Count
is 8 (which implies AVX2). @antonfirsov can correct me if I misexplained that. In any case, any new HWIntrinsics code should definitely have an SSE fallback, so don't let the fact it's missing from the other implementation stop you.
|
||
if (adjustedCount > 0) | ||
{ | ||
channel0 = channel0.Slice(adjustedCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same Slice mistake @JimBobSquarePants pointed out from the other changes. You're slicing to the remainder before calling the method that's supposed to do the main part.
@@ -91,6 +91,190 @@ public static class Avx2Intrinsics | |||
} | |||
} | |||
|
|||
internal static void PackBytesToUInt32SaturateChannel4Reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what name is best for this, but there's no saturation happening here because the input is already byte
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by "saturate channel 4" i just meant set all bits to true. Not sure why i chose that wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I get it. I suck at names, so feel free to ignore my feedback on anything naming related 😆
// Deal with the remainder: | ||
if (channel0.Length > 0) | ||
{ | ||
PackBytesToUInt24(channel0, channel1, channel2, dest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is calling itself recursively. Should be calling the remainder method.
s0 = Avx2.Permute4x64(s0.AsUInt64(), 0b_11_01_10_00).AsByte(); | ||
s1 = Avx2.Permute4x64(s1.AsUInt64(), 0b_11_01_10_00).AsByte(); | ||
s2 = Avx2.Permute4x64(s2.AsUInt64(), 0b_11_01_10_00).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Permutes are expensive, so you'll do better to just let the upper/lower lanes stay as is and fix them at the end rather than trying to keep everything in order all the way through. It may help to write the SSE2 version first and have the AVX2 code match that right up until the end when you deal with the 128-bit lanes being interleaved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that's actually news to me 😄 , i was always under the impression permutes were only a couple of cycles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're 3 cycles latency and 1 cycle throughput, but they're also only able to be scheduled on a single port, so they can easily be a bottleneck. I should have said 'relatively expensive', compared to most of the other instructions used here.
A good test to judge the impact would be to write up a benchmark for the SSE2 version of the code to see how the AVX2 version compares to that. Your goal would, of course, be 2x performance. You might be surprised what kind of impact all those permutes have in that context :)
|
||
if (adjustedCount > 0) | ||
{ | ||
channel0 = channel0.Slice(adjustedCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect Slice use here as well
ref Vector256<byte> destBase = | ||
ref Unsafe.As<byte, Vector256<byte>>(ref MemoryMarshal.GetReference(dest)); | ||
|
||
Vector256<byte> s0Mask0 = Vector256.Create(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They'll direct you to use Vector256.Create
like this over in the runtime repo because they've fixed the JIT to recognize it in 5.0. It's exceedingly expensive in 3.x, though, and since 3.1 has 2.5 years of LTS left, I would personally keep using the ReadOnlySpan<byte>
trick to load these from static data.
@JimBobSquarePants do you have a preference there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's use the ROS trick here. We want to try to max out on all supported frameworks.
Vector256<byte> loS0 = Avx2.Permute2x128(s0, s0, 0); | ||
Vector256<byte> loS1 = Avx2.Permute2x128(s1, s1, 0); | ||
Vector256<byte> loS2 = Avx2.Permute2x128(s2, s2, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, all these permutes and shuffles will be quite expensive. What I do for these 3 channel cases is treat them the same as 4 channel right up until the end (in this case with a dummy zero vector), and then do a single shuffle to pack the 12 good bytes out of each 16 together. That does mean overrunning the end of your buffer by 8 bytes for the AVX2 implementation, though, so you'd have to adjust your remainder/cleanup length by 8 to compensate. It's worth doing some benchmarks to see if the difference if you're not clear on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That single shuffle... How would that work? I can't figure it out without multiple permutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, assuming this one starts off the same as the 4-channel version, you'll end up with vectors that look like this:
BGRx:BGRx:BGRx:BGRx||BGRx:BGRx:BGRx:BGRx
With a shuffle, you can pack the BGR/RGB triplets together, putting the filler/alpha values at the end of each lane. Then with a permute, you can cram the 12-byte valid sections of each lane together, leaving all the dummy values in the last 8 bytes. Like such (forgive my poor ascii art skills):
BGRx:BGRx:BGRx:BGRx||BGRx:BGRx:BGRx:BGRx
vpshufb (0,1,2,4,5,6,8,9,a,c,d,e,3,7,b,f)
||
\/
BGRB:GRBG:RBGR:xxxx||BGRB:GRBG:RBGR:xxxx
vpermd (0,1,2,4,5,6,3,7)
||
\/
BGRB:GRBG:RBGR:BGRB||GRBG:RBGR:xxxx:xxxx
At that point, you can either write the full 32-byte vector out (but then only advance your output ref/ptr by 24 bytes), or you can write the lower lane and then movq
the first half of the upper lane, depending on whether you're at the end of your out buffer or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, because this one requires 3 permutes to start and then 4 to end, it may be one where SSE2 is faster than AVX2. Would be worth benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
The vbroadcasti128
instruction only accepts a memory operand, so to use it with Vector128<T>
, it would have to spill to memory. You can, however, use Vector256<T>.Create()
with the same Vector128<T>
for both args, and it will emit the correct permute for you.
@john-h-k thanks a lot for taking the time to look at this! Currently I'm on a holiday, AFK 99% of the time. I will have a deeper look next week. I trust @saucecontrol 100% about technical suggestions on SIMD, Span etc. A few random answers in the meanwhile:
The focus was on server/cloud, since vast majority of our users are in the cloud. I had an assumption that AVX2 should be available on most cloud VM-s, but it was not based on actual research. It's fine to add other paths of course, but we should count with the added complexity it would introduce. Eg: make sure those paths are actually covered by unit tests -> add changes to test "infrastructure code" if necessary -> ....
Yes there might be still many places in code, where I used the term AVX2 as synonym for `Vector˙ of 8 elements. Recently I started to use the term Vector8 for those code paths (AVX2 belongs to HWIntrinsic stuff now). |
It would be really great to finish this! @john-h-k any chance you are around, having some time? I will have the necessary review capacity in the next 2 weeks. |
Yes I should be able to finish this friday or the coming weekend - i've been very busy with A-levels so sorry for letting the PR go stale! |
A-levels first mate! Hope they went well! |
@john-h-k There's been a bit of churn in the repo around this are of the code as I added some new intrinsics methods. I've updated your fork though so you don't have to deal with the merge conflicts. |
Vector256<byte> s1 = Unsafe.Add(ref source1Base, i); | ||
Vector256<byte> s2 = Unsafe.Add(ref source2Base, i); | ||
|
||
s0 = Avx2.Permute4x64(s0.AsUInt64(), 0b_11_01_10_00).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not missing something here we can pack these with less work in a manner similar to how it's done in the jpeg color converter?
Line 44 in 120080b
internal static void ConvertCore(in ComponentValues values, Span<Vector4> result, float maxValue, float halfValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's the same problem. The unpack operations work in-lane, so whatever starts in the upper lane of the input will be in the upper lanes of the output. The current code re-permutes after each round to keep things in the right place throughout. The other options are:
- Deal with it at the end, by writing the 4 lower lanes to the output first followed by the 4 upper lanes, or permuting the 128-bit pairs together before writing.
- Rearrange the input so the blocks of 4 bytes (which will become blocks of 16 bytes) will come out paired correctly after the unpacks.
Option 2 will be cheaper since extract costs the same as a permute, and as with the YCbCr conversion, you can get by with only permuting 3 inputs for 4 outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought so, thanks for confirming. 👍
I think we need to have a rethink here. We're currently doing a very similar operation in our jpeg decoder color transforms except that we pack to The end goal for all our recent improvements is to be able to decode/resize/save a jpeg (and maybe other formats webp for example) in So... We have two choices:
@john-h-k @saucecontrol @antonfirsov I'd really like to get a consensus here on the best approach so we can squeeze the absolute best results not only for speed but for reuse (both by ourselves and one day Clint when he goes xplat). I also need someone to do it as it would take me weeks to figure out. 😝 |
Yeah, if I'm following correctly, these would have been used for the D3->D4 step in @antonfirsov's proposal in #1121, which is no longer a thing because #1411 added a planarYCbCrfloat->chunkyRGBAfloat conversion, right? The most optimized path longer term will be to keep everything in fixed-point math all the way through the JPEG decoder as libjpeg-turbo does, which eliminates a couple of the existing conversion steps. Porting that will be a big effort, but I hope to be able to work on that next year. |
@saucecontrol full integer pipeline is the next big goal, but it's much bigger step than #1411. The idea is that with #1411 and this PR we can do the packing faster (pack x4 more pixels in one AVX2 batch), and it's not a big refactor to move there from our current state. So it's still a thing, unless you say we shouldn't expect significant difference compared to point 2. in #1242 (comment). |
Gotcha. I think it comes down to a question of how much separation (or reusability) you want between the pipeline steps or how much symmetry you want with your non-HWIntrinsic fallback pipeline steps. The fastest implementation would undoubtedly be a one-shot planarYCbCrfloat->chunkyRGB24 since that eliminates at least one round-trip through memory, if not two. The narrowing/saturating and unpacking/interleaving could easily be bolted on to the conversion from #1411 to do it all in one go and all in XMM/YMM registers. The other question would be how you want to handle moving from a decode to resize stage. Convolution with 3 channels is more tricky than with 4, and if you're going to convert RGB24 back to RGBA128float for that step, you're doing a lot of unnecessary shuffling to eliminate and then restore that 4th channel. I do have a RGB96float convolution you can steal if you want to go that route. |
I think we'll leave resize as-is for now. We can optimize there by removing batch premultiplication and reverse from the operation since we will have knowledge of the pixel format alpha behavior. That in combination with reducing the amount of work that takes place during the final stages of jpeg decoding should mean a substantial improvement. I don't think there has to be that much symmetry. Our existing non-HW Intrinsics can remain as is and we can use processor directives to establish a more optimized pipeline contract if we need to. |
Had a quick look, after an introductory training to AVX2 shuffles. There is lot of work left here:
Depending on how much we want to rush 1.1, we may decide to change the color converters to pack into |
#if SUPPORTS_RUNTIME_INTRINSICS | ||
HwIntrinsics.PackBytesToUInt32SaturateChannel4Reduce(ref channel0, ref channel1, ref channel2, ref dest); | ||
|
||
// I can't immediately see a way to do this operation efficiently with Vector<T> or Vector4<T>. TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is none :)
@saucecontrol do you think the following may perform better than the code in the PR?
Still has 4 permutes. EDIT: Realized I can use |
Yeah, that'll be faster. The permutes are not as big a deal as long as there is no dependency chain between them. Cross-lane permutes have a latency of 3, meaning it takes 3 cycles from when the instruction starts to when the result is ready. But they have a reciprocal throughput of 1, meaning it's a pipelined instruction that can start 1 and complete 1 per cycle. If you issue three in a row, the cycle after the third is started, the first is complete -- as long as they don't depend on each others' results. The PR code will actually stall on the I'd actually write this one with a permute round to start, in order to minimize the stores. something like this
I highly recommend chapter 9 in Agner's optimization guide for background on how latencies, throughputs, instruction units, and ports all interact. Should give you a better idea of what's fast without necessarily having to code it up. You can check instruction timings on https://uops.info/table.html. Hint: |
WIP. A few initial questions raised
NoInlining
? They are often going to be hot code