Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression on .NET 9, compared to .NET 8 #108096

Closed
KeterSCP opened this issue Sep 21, 2024 · 17 comments · Fixed by #112046
Closed

Performance regression on .NET 9, compared to .NET 8 #108096

KeterSCP opened this issue Sep 21, 2024 · 17 comments · Fixed by #112046
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@KeterSCP
Copy link

Description

After migrating a genetic algorithm project to .NET 9, I noticed that there was a performance regression, and I was able to track it down to a single method that performs inverse mutation of the given chromosome (in my case it is just an array of ints). I tried many different seeds, and the result is the same: about 10-15% regression. I also set <CETCompat>false</CETCompat> in the .csproj, but it did not help.

The problem is that this method is on a hot path, and eventually this adds up to milliseconds of difference between .NET 8 and .NET 9.

Benchmark code:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Benchmarks>();

[MemoryDiagnoser, DisassemblyDiagnoser, SimpleJob(RuntimeMoniker.Net80), SimpleJob(RuntimeMoniker.Net90)]
public class Benchmarks
{
    private const int Seed = 100;

    private static readonly int[] Chromosome = new int[59];
    private static readonly Random Random = new(Seed);

    public Benchmarks()
    {
        var genes = Enumerable.Range(1, 59).ToArray();
        Random.Shuffle(genes);
        genes.CopyTo(Chromosome.AsSpan());
    }

    [Benchmark]
    public int[] InversionMutation()
    {
        var idx1 = 0;
        var idx2 = 0;

        var chromosomeLength = Chromosome.Length;

        while (idx1 == idx2)
        {
            idx1 = Random.Next(chromosomeLength);
            idx2 = Random.Next(chromosomeLength);
        }

        if (idx1 > idx2)
        {
            (idx1, idx2) = (idx2, idx1);
        }

        Array.Reverse(Chromosome, idx1, idx2 - idx1 + 1);

        return Chromosome;
    }
}

Regression?

Yes

Data


BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-rc.2.24468.2
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.0 (9.0.24.46307), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Code Size Allocated
InversionMutation .NET 8.0 .NET 8.0 27.10 ns 0.020 ns 0.018 ns 794 B -
InversionMutation .NET 9.0 .NET 9.0 31.14 ns 0.039 ns 0.033 ns 914 B -
@KeterSCP KeterSCP added the tenet-performance Performance related issue label Sep 21, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 21, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 21, 2024
@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 21, 2024
@AndyAyersMS
Copy link
Member

I can repro this ...

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-rc.1.24452.12
[Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Job-ATRFFH : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
Job-OSNKXL : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2

EvaluateOverhead=False OutlierMode=DontRemove InvocationCount=16777216
IterationCount=15 UnrollFactor=16 WarmupCount=1

Method Runtime Mean Error StdDev Ratio
InversionMutation .NET 8.0 37.76 ns 0.224 ns 0.209 ns 1.00
InversionMutation .NET 9.0 41.48 ns 0.199 ns 0.186 ns 1.10

Looks like the issue is in the benchmark method.

.NET 8

06.28%   6.08E+06    ?        Unknown
62.50%   6.049E+07   Tier-1   [r108096]Benchmarks.InversionMutation()
28.75%   2.782E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.Reverse(int32&,unsigned int)
01.90%   1.84E+06    Tier-1   [de65ffca-e259-4478-ba18-546e4ea8ee1d]Runnable_0.WorkloadActionUnroll(int64)

.NET 9

05.39%   5.78E+06    ?        Unknown
65.86%   7.062E+07   Tier-1   [r108096]Benchmarks.InversionMutation()
25.60%   2.745E+07   Tier-1   [System.Private.CoreLib]SpanHelpers.Reverse(int32&,unsigned int)
01.47%   1.58E+06    FullOpt  [a32d8f19-cdba-4847-977e-9727065bc356]Runnable_0.WorkloadActionUnroll(int64)

@filipnavara
Copy link
Member

I cannot resist being off-topic... you can always remove the Random.Next loop and just do:

idx1 = Random.Next(chromosomeLength);
idx2 = (idx1 + Random.Next(chromosomeLength - 1) + 1) % chromosomeLength;

@KeterSCP
Copy link
Author

Thanks @filipnavara! I will definitely use that in my code.

It seems like the problem is connected with seeded random. Using Random.Shared shows no regression.

Benchmark code
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Benchmarks>();

[MemoryDiagnoser, DisassemblyDiagnoser, SimpleJob(RuntimeMoniker.Net80), SimpleJob(RuntimeMoniker.Net90)]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByJob)]
public class Benchmarks
{
    private const int Seed = 100;

    private static readonly int[] Chromosome = new int[59];
    private static readonly Random Random = new(Seed);

    public Benchmarks()
    {
        var genes = Enumerable.Range(1, 59).ToArray();
        Random.Shuffle(genes);
        genes.CopyTo(Chromosome.AsSpan());
    }

    [Benchmark]
    public int[] InversionMutationSharedRandom()
    {
        var chromosomeLength = Chromosome.Length;

        var idx1 = Random.Shared.Next(chromosomeLength);
        var idx2 = (idx1 + Random.Shared.Next(chromosomeLength - 1) + 1) % chromosomeLength;

        if (idx1 > idx2)
        {
            (idx1, idx2) = (idx2, idx1);
        }

        Array.Reverse(Chromosome, idx1, idx2 - idx1 + 1);

        return Chromosome;
    }

    [Benchmark]
    public int[] InversionMutationSeededRandom()
    {
        var chromosomeLength = Chromosome.Length;

        var idx1 = Random.Next(chromosomeLength);
        var idx2 = (idx1 + Random.Next(chromosomeLength - 1) + 1) % chromosomeLength;

        if (idx1 > idx2)
        {
            (idx1, idx2) = (idx2, idx1);
        }

        Array.Reverse(Chromosome, idx1, idx2 - idx1 + 1);

        return Chromosome;
    }
}

Results:


BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-rc.2.24468.2
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.0 (9.0.24.46307), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Code Size Allocated
InversionMutationSharedRandom .NET 8.0 .NET 8.0 22.95 ns 0.051 ns 0.048 ns 892 B -
InversionMutationSeededRandom .NET 8.0 .NET 8.0 26.28 ns 0.053 ns 0.050 ns 757 B -
InversionMutationSharedRandom .NET 9.0 .NET 9.0 22.59 ns 0.094 ns 0.088 ns 892 B -
InversionMutationSeededRandom .NET 9.0 .NET 9.0 29.60 ns 0.062 ns 0.058 ns 858 B -

@agocke
Copy link
Member

agocke commented Sep 21, 2024

cc @stephentoub I seem to remember that the implementation of seeded random has changed recently

@stephentoub
Copy link
Member

cc @stephentoub I seem to remember that the implementation of seeded random has changed recently

Not the APIs used here.

@AndyAyersMS
Copy link
Member

Codegen diff

;; net 8

       vxorps   xmm0, xmm0, xmm0
       vcvtsi2sd xmm0, xmm0, eax
       vmulsd   xmm0, xmm0, qword ptr [reloc @RWD00]
       vmulsd   xmm0, xmm0, qword ptr [reloc @RWD08]
       vcvttsd2si  r14d, xmm0
 
;; net 9

       vxorps   xmm0, xmm0, xmm0
       vcvtsi2sd xmm0, xmm0, eax
       vmulsd   xmm0, xmm0, qword ptr [reloc @RWD00]
       vmulsd   xmm0, xmm0, qword ptr [reloc @RWD08]
       vmovddup xmm1, xmm0
       vmovddup xmm2, xmm0
       vmovddup xmm0, xmm0
       vcmppd   xmm1, xmm2, xmm1, 0
       vandpd   xmm0, xmm1, xmm0
       vcmppd   xmm1, xmm0, xmmword ptr [reloc @RWD16], 13
       vcvttsd2si ecx, xmm0
       vmovd    xmm0, ecx
       vpbroadcastd xmm0, xmm0
       vpblendvb xmm0, xmm0, xmmword ptr [reloc @RWD32], xmm1
       vmovd    r14d, xmm0

@AndyAyersMS
Copy link
Member

Looks like this is a consequence of #97529, which standardized the behavior of some floating to integer conversions. We expected some regressions in performance.

The legacy RNG involves floating to integer conversion (even for Next(int)) and so is impacted by this:

internal double Sample() =>
// Including the division at the end gives us significantly improved random number distribution.
InternalSample() * (1.0 / int.MaxValue);

The vmulsd above are carrying out the division / multiplication here... the code that follows is the double to int conversion.

We might be able to use the faster "legacy" conversion here since the values from InternalSample() are constrained to fall in [0, 1).

@AndyAyersMS
Copy link
Member

FYI @tannergooding

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Sep 22, 2024
@skyoxZ
Copy link
Contributor

skyoxZ commented Sep 22, 2024

This would be less affected by Random.Next:

while (idx1 == idx2)
{
    idx1 = Random.Next(chromosomeLength * chromosomeLength);
    idx2 = idx1 % chromosomeLength;
    idx1 /= chromosomeLength;
}

@KeterSCP
Copy link
Author

Hi @skyoxZ! While I agree that reducing number of calls to random.Next would reduce the regression (which is obvious), I cannot use that example as it would introduce strong correlation between mutation points (for different random seeds, idx2 will always be exactly related to idx1), which is not acceptable in terms of genetic algorithm.

@skyoxZ
Copy link
Contributor

skyoxZ commented Sep 23, 2024

Hi @skyoxZ! While I agree that reducing number of calls to random.Next would reduce the regression (which is obvious), I cannot use that example as it would introduce strong correlation between mutation points (for different random seeds, idx2 will always be exactly related to idx1), which is not acceptable in terms of genetic algorithm.

idx1 and idx2 are not related. Consider when chromosomeLength = 10, the random number is between 00 to 99. Then idx1 always takes the ten's digit and idx2 always takes the unit's digit.

@KeterSCP
Copy link
Author

KeterSCP commented Sep 23, 2024

@skyoxZ as I mentioned above, the example introduces a fixed relationship:

Since idx1 and idx2 are both derived from the same random number, they are not independent. For a given random number, idx1 and idx2 will always have the same corresponding values. If the random number is 34, idx1 = 3 and idx2 = 4, and this will always hold true.

No matter how many times (or with what random seed) you generate the number 34, idx1 will always be 3 and idx2 will always be 4. The relationship is fixed, meaning idx1 and idx2 are dependent.

@skyoxZ
Copy link
Contributor

skyoxZ commented Sep 23, 2024

@KeterSCP please think again. My random number contains two independent digits.

@KeterSCP
Copy link
Author

@skyoxZ While it may seem like the two digits are independent because they come from different parts of the random number, they are not independent in a probabilistic sense:

The two digits are derived from a single random number, not two separate random events. Both idx1 and idx2 come from the same number generated by Random.Next(), which ties them together. The process of splitting a single number into two parts does not introduce true independence between those parts. True probabilistic independence requires separate random events.

@skyoxZ
Copy link
Contributor

skyoxZ commented Sep 23, 2024

@KeterSCP I believe my algorithm is ok. It's just like: there are 2 dice, one red and one blue. You throw them one by one while I throw them both at the same time.

And you may be interested in the source code of Random.Shared.NextBytes.

@KeterSCP
Copy link
Author

@skyoxZ thanks for the willingness to help, but I cannot accept that solution due to the aforementioned correlation between these numbers. Let's not continue the off-top

@tannergooding
Copy link
Member

We might be able to use the faster "legacy" conversion here since the values from InternalSample() are constrained to fall in [0, 1).

The consideration isn't in that sample is [0, 1) but rather what the result of (_parent.Sample() * maxValue) is, which is meant to place it in the range of [0, maxValue). For int it would be safe to always use the older algorithm (call double.ConvertToIntegerNative), since we're operating on double and all int are less than 2^52.

I'll get a fix up for .NET 10. However, barring some real world samples showing the impact to a hot path, I don't think it meets the bar for backport. This is because the regression is 4ns and even in extreme cases of generating 250 million random numbers (which would add up to a 1s regression), it is likely going to be lost to the general noise of the rest of the system (the overhead of the for loop, the branch predictor, the latency of accessing the memory which is often going to be 10ns or more even on the best machines, cache hits/misses, etc). -- CC. @jeffhandley

@tannergooding tannergooding added this to the 10.0.0 milestone Sep 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants