Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InlineIfLambda doesn't seem to apply when using |> #12388

Closed
mrange opened this issue Nov 13, 2021 · 11 comments
Closed

InlineIfLambda doesn't seem to apply when using |> #12388

mrange opened this issue Nov 13, 2021 · 11 comments

Comments

@mrange
Copy link
Contributor

mrange commented Nov 13, 2021

Repro steps

I am experimenting with InlineIfLambda and tried to implement a simple push stream library to see if I get some performance gains. I do see some pretty exciting ones from using InlineIfLambda but when I use the |> forward pipeline it doesn't seem inlining applies.

I have written a simple push stream library with some tests to illustrate the issue I see. Make sure to run it in Release mode without debugger applied.

open System
open System.Diagnostics

type 'T PushStream = ('T->bool)->bool

module PushStream =
  let inline ofRange b e ([<InlineIfLambda>] r) =
      let mutable i = b
      while i <= e && r i do
        i <- i + 1
      i >= e
    
  let inline filter ([<InlineIfLambda>] f) ([<InlineIfLambda>] ps) ([<InlineIfLambda>] r) =
    ps (fun v -> if f v then r v else true)

  let inline map ([<InlineIfLambda>] f) ([<InlineIfLambda>] ps) ([<InlineIfLambda>] r) =
    ps (fun v -> r (f v))

  let inline fold ([<InlineIfLambda>] f) z ([<InlineIfLambda>] ps) =
    let mutable s = z
    ps (fun v -> s <- f s v; true) |> ignore
    s

// Simplistic performance measurement
let now =
  let sw = Stopwatch ()
  sw.Start ()
  fun () -> sw.ElapsedMilliseconds

let time n a =
  let inline cc i = GC.CollectionCount i

  GC.Collect (2, GCCollectionMode.Forced, true) |> ignore
  GC.WaitForFullGCComplete () |> ignore

  let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
  let before            = now ()
  for i = 1 to n do
    a () |> ignore
  let after             = now ()
  let acc0, acc1, acc2  = cc 0, cc 1, cc 2

  let cc      = acc0 - bcc0, acc1 - bcc1, acc2 - bcc2
  let elapsed = after - before
  elapsed, cc

open PushStream


// The baseline, the pipeline implemented as plain old loop
let baseLine () =
  let mutable s = 0L
  for i = 0 to 10000 do
    let i = i + 1
    if (i &&& 1) = 0 then
      s <- s + int64 i
  s

// How I would like the code to work
let pipeLine () =
  ofRange 0 10000
  |> map ((+) 1)
  |> filter (fun v -> (v &&& 1) = 0)
  |> map int64
  |> fold (+) 0L

// Trying to improve performance by avoid use of |>
let implicit () =
  fold (+) 0L (map int64 (filter (fun v -> (v &&& 1) = 0) (map ((+) 1) (ofRange 0 10000))))

// Trying to improve performance by writing explicit lambdas
let explicit () =
  fold (+) 0L (fun r -> map int64 (fun r -> filter (fun v -> (v &&& 1) = 0) (fun r -> map ((+) 1) (fun r -> ofRange 0 10000 r) r) r) r)

let testCases = 
  [|
    "baseLine"  , baseLine
    "pipeLine"  , pipeLine
    "implicit"  , implicit
    "explicit"  , explicit
  |]

for nm, tc in testCases do
  let r = tc ()
  printfn "Result of %s : %d" nm r

for nm, tc in testCases do
  printfn "Warming up: %s" nm
  for i = 0 to 1000 do
    tc () |> ignore

for nm, tc in testCases do
  printfn "Running: %s ..." nm
  let elapsed, cc = time 20000 tc
  printfn "  ... took %d ms with cc: %A" elapsed cc

The numbers I see

Running: baseLine ...
  ... took 137 ms with cc: (0, 0, 0)
Running: pipeLine ...
  ... took 801 ms with cc: (1, 0, 0)
Running: implicit ...
  ... took 185 ms with cc: (0, 0, 0)
Running: explicit ...
  ... took 172 ms with cc: (0, 0, 0) 

First it's pretty exciting to see that functional pipeline can come close to matching the baseline but I would like the pipeline code to get the same gains as the explicit case.

Expected behavior

The testCases pipeLine, explicit and implicit all have the same performance which is roughly close to the baseline.

Actual behavior

Testcase pipeLine performs worse than explicit and implicit as it seems the inlining is not applied when using |>

Known workarounds

Don't use |> when inlining lambdas has significant performance benefits.

Related information

  • Operating system: Windows 10
  • .NET Runtime kind: .NET 6.0.100
@mrange mrange added the Bug label Nov 13, 2021
@mrange
Copy link
Contributor Author

mrange commented Nov 13, 2021

This might be more of a feature request than bug request when I think about it.

@mrange mrange changed the title InlineIfLambda doesn't apply when using |> InlineIfLambda doesn't seem to apply when using |> Nov 13, 2021
@En3Tho
Copy link
Contributor

En3Tho commented Nov 13, 2021

I think it's better to make a benchmark using Benchmark.Net to better measure perf and allocations (which might come from non-inlined code, eg closures).

Judging from your numbers in case of pipe compiler can't really optimize the code so real currying takes action (calling nested closures one by one) resulting in a substantially worse perf.

@mrange
Copy link
Contributor Author

mrange commented Nov 14, 2021

Benchmark.NET numbers:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1348 (21H2)
Intel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.100
  [Host]     : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT DEBUG
  DefaultJob : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT


|   Method |       Mean |     Error |    StdDev |
|--------- |-----------:|----------:|----------:|
| Baseline |   6.821 us | 0.0540 us | 0.0505 us |
|     Linq | 149.053 us | 0.2072 us | 0.1837 us |
| PipeLine |  34.643 us | 0.0765 us | 0.0716 us |
| Implicit |   9.110 us | 0.0487 us | 0.0456 us |
| Explicit |   8.298 us | 0.0136 us | 0.0121 us |

@mrange
Copy link
Contributor Author

mrange commented Nov 14, 2021

Decompiled into C# (no decompile to F# yet in dnspy)

	public long Baseline()
	{
		long s = 0L;
		for (int i = 0; i < 10001; i++)
		{
			int j = i + 1;
			if ((j & 1) == 0)
			{
				s += (long)j;
			}
		}
		return s;
	}

	public long Linq()
	{
		return Enumerable.Range(0, 10001).Select(new Func<int, int>(Program.Linq@40.Invoke)).Where(new Func<int, bool>(Program.Linq@40-1.Invoke)).Select(new Func<int, long>(Program.Linq@40-2.Invoke)).Sum();
	}

	public long PipeLine()
	{
		FSharpFunc<FSharpFunc<int, bool>, bool> _instance = Program.PipeLine@45.@_instance;
		FSharpFunc<FSharpFunc<int, bool>, bool> arg = new Program.PipeLine@46-2(@_instance);
		FSharpFunc<FSharpFunc<long, bool>, bool> fsharpFunc = new Program.PipeLine@47-4(arg);
		FSharpRef<long> fsharpRef = new FSharpRef<long>(0L);
		bool flag = fsharpFunc.Invoke(new Program.PipeLine@48-6(fsharpRef));
		return fsharpRef.contents;
	}

	public long Implicit()
	{
		long num = 0L;
		int num2 = 0;
		for (;;)
		{
			bool flag;
			if (num2 <= 10000)
			{
				int num3 = num2;
				int num4 = 1 + num3;
				if ((num4 & 1) == 0)
				{
					long num5 = (long)num4;
					num += num5;
					flag = true;
				}
				else
				{
					flag = true;
				}
			}
			else
			{
				flag = false;
			}
			if (!flag)
			{
				break;
			}
			num2++;
		}
		bool flag2 = num2 > 10000;
		return num;
	}

	public long Explicit()
	{
		FSharpRef<long> fsharpRef = new FSharpRef<long>(0L);
		int num = 0;
		for (;;)
		{
			bool flag;
			if (num <= 10000)
			{
				int num2 = num;
				flag = Program.r@56(fsharpRef, 1 + num2);
			}
			else
			{
				flag = false;
			}
			if (!flag)
			{
				break;
			}
			num++;
		}
		bool flag2 = num > 10000;
		return fsharpRef.contents;
	}
	
        internal static bool r@56(FSharpRef<long> s, int v)
	{
		if ((v & 1) == 0)
		{
			long num = (long)v;
			s.contents += num;
			return true;
		}
		return true;
	}

From this I would not suspect Implicit and Explicit to perform differently but at least on my machine there seems to be a diff.

@En3Tho
Copy link
Contributor

En3Tho commented Nov 14, 2021

Can you please add MemoryDiagnoser attribute to your benchmark class?

I think the compiler could benefit from "un-currying" and "lambdafying" to perform better in pipeline cases.

Also, I think if you stick to "explicit pipeline" style, your perf will be more or less the same with "explicit" one, eg map (fun x -> x + 1) etc

@mrange
Copy link
Contributor Author

mrange commented Nov 14, 2021

I added MemoryAnalyzer and this is what I got

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1348 (21H2)
Intel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.100
  [Host]    : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT DEBUG
  RyuJitX64 : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT

Job=RyuJitX64  Jit=RyuJit  Platform=X64

|   Method |       Mean |     Error |    StdDev | Allocated |
|--------- |-----------:|----------:|----------:|----------:|
| Baseline |   6.866 us | 0.0690 us | 0.0646 us |         - |
|     Linq | 147.643 us | 0.4201 us | 0.3929 us |     400 B |
| PipeLine |  34.323 us | 0.1181 us | 0.1105 us |     168 B |
| Implicit |   9.064 us | 0.0410 us | 0.0383 us |         - |
| Explicit |   8.260 us | 0.0234 us | 0.0219 us |      24 B |

@mrange
Copy link
Contributor Author

mrange commented Nov 14, 2021

The code I use with Benchmark.NET

open System
open System.Linq

open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

module PushStream =
  let inline ofRange b e ([<InlineIfLambda>] r) =
      let mutable i = b
      while i <= e && r i do
        i <- i + 1
      i > e
    
  let inline filter ([<InlineIfLambda>] f) ([<InlineIfLambda>] ps) ([<InlineIfLambda>] r) =
    ps (fun v -> if f v then r v else true)

  let inline map ([<InlineIfLambda>] f) ([<InlineIfLambda>] ps) ([<InlineIfLambda>] r) =
    ps (fun v -> r (f v))

  let inline fold ([<InlineIfLambda>] f) z ([<InlineIfLambda>] ps) =
    let mutable s = z
    ps (fun v -> s <- f s v; true) |> ignore
    s

open PushStream

[<MemoryDiagnoser>]
[<RyuJitX64Job>]
type FsInlineBenchmark() =
  class
    [<Benchmark>]
    member x.Baseline() =
      let mutable s = 0L
      for i = 0 to 10000 do
        let i = i + 1
        if (i &&& 1) = 0 then
          s <- s + int64 i
      s

    [<Benchmark>]
    member x.Linq() =
      Enumerable.Range(0,10001).Select((+) 1).Where(fun v -> (v &&& 1) = 0).Select(int64).Sum()

    [<Benchmark>]
    member x.PipeLine () =
      ofRange 0 10000
      |> map ((+) 1)
      |> filter (fun v -> (v &&& 1) = 0)
      |> map int64
      |> fold (+) 0L

    [<Benchmark>]
    member x.Implicit () =
      fold (+) 0L (map int64 (filter (fun v -> (v &&& 1) = 0) (map ((+) 1) (ofRange 0 10000))))

    [<Benchmark>]
    member x.Explicit () =
      fold (+) 0L (fun r -> map int64 (fun r -> filter (fun v -> (v &&& 1) = 0) (fun r -> map ((+) 1) (fun r -> ofRange 0 10000 r) r) r) r)

  end
BenchmarkRunner.Run<FsInlineBenchmark>() |> ignore

@mrange
Copy link
Contributor Author

mrange commented Nov 14, 2021

Found a potential work-around

let inline (|>>) ([<InlineIfLambda>] v : _ -> _) ([<InlineIfLambda>] f : _ -> _) = f v

let pipeLineVariant () =
  ofRange 0 10000
  |>> map ((+) 1)
  |>> filter (fun v -> (v &&& 1) = 0)
  |>> map int64
  |>> fold (+) 0L
let v = pipeLineVariant ()

This seems to generate code that is much more optimal than if I use |>. I will update the benchmark with this later.

@mrange
Copy link
Contributor Author

mrange commented Nov 14, 2021

PipeLineVariant is the updated pipeline example using |>> above

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1348 (21H2)
Intel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.100
  [Host]    : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT DEBUG
  RyuJitX64 : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT

Job=RyuJitX64  Jit=RyuJit  Platform=X64

|          Method |       Mean |     Error |    StdDev | Allocated |
|---------------- |-----------:|----------:|----------:|----------:|
|        Baseline |   6.906 us | 0.0613 us | 0.0573 us |         - |
|            Linq | 149.370 us | 0.2530 us | 0.2113 us |     400 B |
|        PipeLine |  34.717 us | 0.1054 us | 0.0986 us |     168 B |
| PipeLineVariant |   9.149 us | 0.0313 us | 0.0293 us |         - |
|        Implicit |   9.144 us | 0.0283 us | 0.0264 us |         - |
|        Explicit |   8.341 us | 0.0149 us | 0.0132 us |      24 B |

@En3Tho
Copy link
Contributor

En3Tho commented Feb 5, 2022

Related: fsharp/fslang-suggestions#1105

@dsyme
Copy link
Contributor

dsyme commented Feb 7, 2022

This might be more of a feature request than bug request when I think about it.

Yes, please make this a language suggestion or feature request. It is by-design as things stand

@dsyme dsyme closed this as completed Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants