perf: try to avoid buffer allocations #1998

ronag · 2022-01-04T16:25:20Z

No description provided.

lib/sender.js

ronag · 2022-01-05T10:56:01Z

const net = require('net')

const server = net.createServer((socket) => {
  socket.resume()
  socket.on('error', () => {})
}).listen(0)

const socket = net.connect(server.address().port, '0.0.0.0', () => {
  let str = '1'.repeat(4096)
  for (let n = 0; n < 1024*128; ++n) {
    // socket.write(str)
    socket.write(toBuffer(str))
  }
  socket.end()
  socket.on('finish', () => {
    socket.on('error', () => {})
    server.close()
  })
})

function toBuffer(data) {
  toBuffer.readOnly = true;

  if (Buffer.isBuffer(data)) return data;

  let buf;

  if (data instanceof ArrayBuffer) {
    buf = Buffer.from(data);
  } else if (ArrayBuffer.isView(data)) {
    buf = Buffer.from(data.buffer, data.byteOffset, data.byteLength);
  } else {
    buf = Buffer.from(data);
    toBuffer.readOnly = false;
  }

  return buf;
}

ws$ time node bench.js 

real    0m0.324s
user    0m0.144s
sys     0m0.204s
ws$ time node bench.js 

real    0m0.857s
user    0m0.687s
sys     0m0.235s

It's more than 2x faster without toBuffer.

lib/sender.js

lpinca · 2022-01-05T14:15:08Z

lib/sender.js

+    const buf =
+      perMessageDeflate || typeof data !== 'string' ? toBuffer(data) : data;


The buf.length >= perMessageDeflate._threshold check below might not work as expected when data is a string because it does not use the string bytes length.

Also, when data is a string and permessage-deflate is enabled, toBuffer() should ideally be called only when the message will actually be compressed (this._compress === true).

It would be better to make this permessage-deflate agnostic. Removing the perMessageDeflate check here and passing the string to perMessageDeflate.compress() should work. The problem is that enqueue() and dequeue() currently use the buffer length to update the _bufferedBytes value.

That branch is only used if perMessageDeflate is true; in which case data is converted to a buffer.

While I agree that we could improve it, I don't think there is an issue here currently. Only that it's a little hard to read.

Correct. Anyway atm this improvement only works when permessage-deflate is disabled. What about websocket.ping() and websocket.pong()?

What about websocket.ping() and websocket.pong()?

Don't see them as a hot path.

lpinca · 2022-01-05T14:45:54Z

I think the benchmark is a bit fabricated. It writes 512 MiB synchronously by calling socket.write() 1024 * 128 times. There are 1024 * 128 allocations when using the toBuffer() variant but a lot less when using the string variant because there is only one allocation for the data buffered in the socket writable queue. Also in the string variant the string is always the same. There is basically no memory usage.

I don't think this is realistic.

ronag · 2022-01-05T14:49:46Z

I think the benchmark is a bit fabricated. It writes 512 MiB synchronously by calling socket.write() 1024 * 128 times. There are 1024 * 128 allocations when using the toBuffer() variant but a lot less when using the string variant because there is only one allocation for the data buffered in the socket writable queue. Also in the string variant the string is always the same. There is basically no memory usage.

I mean the point here is all of those buffer allocations are not necessary. The strings are already allocated. This just demonstrates the overhead of creating all of those buffers that you would otherwise not create. Which is basically what happens in production when you are sending a large amount of strings to ws.

i.e. what this PR is trying to do is to avoid some of the overhead of going string -> Buffer. Hence the benchmark should only show the difference between doing the conversion in js before writing to the socket instance or inside the native writeGeneric.

I don't think this is realistic.

While it could be better maybe I don't see why it wouldn't be realistic. It's basically what's happening in my case, although not with quite such large values as 512 MiB.

ronag · 2022-01-05T14:56:57Z

Even with smaller values e.g. 4 MiB there is improvement:

ws$ time node bench

real    0m0.058s
user    0m0.043s
sys     0m0.013s
ws$ time node bench

real    0m0.041s
user    0m0.031s
sys     0m0.009s
ws$

ronag · 2022-01-05T15:03:49Z

Added a cronometro benchmark

╔══════════════╤═════════╤═══════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │        Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼───────────────┼───────────┼─────────────────────────╢
║ toBuffer     │     100 │ 141.63 op/sec │  ± 1.44 % │                         ║
╟──────────────┼─────────┼───────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │        Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼───────────────┼───────────┼─────────────────────────╢
║ string       │     100 │ 495.38 op/sec │  ± 2.48 % │ + 249.78 %              ║
╚══════════════╧═════════╧═══════════════╧═══════════╧═════════════════════════╝

ronag · 2022-01-05T15:06:22Z

Without pre-allocating the string:

╔══════════════╤═════════╤══════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ toBuffer     │      99 │ 49.33 op/sec │  ± 0.39 % │                         ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ string       │     100 │ 71.23 op/sec │  ± 1.36 % │ + 44.40 %               ║
╚══════════════╧═════════╧══════════════╧═══════════╧═════════════════════════╝

lpinca · 2022-01-05T15:13:56Z

╔══════════════╤═════════╤══════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ toBuffer     │     100 │ 41.60 op/sec │  ± 1.00 % │                         ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │       Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼──────────────┼───────────┼─────────────────────────╢
║ string       │     100 │ 42.91 op/sec │  ± 1.66 % │ + 3.15 %                ║
╚══════════════╧═════════╧══════════════╧═══════════╧═════════════════════════╝

with Buffer.from() instead of toBuffer(), that is removing the Buffer, ArrayBuffer, DataView checks from it.

Hence the benchmark should only show the difference between doing the conversion in "user space" or inside the native writeGeneric

The problem with this is that the difference is only the result of the huge amount of buffered data in the socket writable queue. The difference should not be noticeable if data is regularly flowing, because the number of allocations would basically be 1:1.

ronag · 2022-01-05T15:20:34Z

with Buffer.from() instead of toBuffer(), that is removing the Buffer, ArrayBuffer, DataView checks from it.

Sorry. I don't understand? We're using toBuffer just as it would with ws.

The problem with this is that the difference is only the result of the huge amount of buffered data in the socket writable queue. The difference should not be noticeable if data is regularly flowing, because the number of allocations would basically be 1:1.

This is with 4MiB of back pressure, which is not unrealistic.

ronag · 2022-01-05T15:22:12Z

I don't quite see why you think the benchmark is not relevant. According to you, how would a "correct" benchmark look?

lpinca · 2022-01-05T15:22:35Z

Sorry. I don't understand? We're using toBuffer just as it would with ws. We are not using Buffer.from

Yes but we could optimize it for the string case instead of dealing with the string case only at the end.

ronag · 2022-01-05T15:24:18Z

Ok updated benchmark:

Backpressure down to 16 * 4096
Added toBuffer2 which check for string earlier

╔══════════════╤═════════╤════════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ toBuffer2    │    1000 │ 1522.18 op/sec │  ± 1.60 % │                         ║
║ toBuffer     │     999 │ 1735.66 op/sec │  ± 0.89 % │ + 14.02 %               ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ string       │     999 │ 2606.76 op/sec │  ± 0.72 % │ + 71.25 %               ║
╚══════════════╧═════════╧════════════════╧═══════════╧═════════════════════════╝

lpinca · 2022-01-05T15:26:42Z

It's 15% on my machine but only 3% with Buffer.from(), that 12% only for the toBuffer() wrapper is odd.

ronag · 2022-01-05T15:31:39Z

On an EPYC server + Node v17.3.0:

╔══════════════╤═════════╤════════════════╤═══════════╤═════════════════════════╗
║ Slower tests │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ toBuffer     │    1000 │ 1920.40 op/sec │  ± 1.36 % │                         ║
║ bufferFrom   │     999 │ 2024.28 op/sec │  ± 0.93 % │ + 5.41 %                ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ Fastest test │ Samples │         Result │ Tolerance │ Difference with slowest ║
╟──────────────┼─────────┼────────────────┼───────────┼─────────────────────────╢
║ string       │    1000 │ 3084.64 op/sec │  ± 1.56 % │ + 60.63 %               ║
╚══════════════╧═════════╧════════════════╧═══════════╧═════════════════════════╝

lpinca · 2022-01-05T15:52:05Z

How to make cronometro wait for the socket to be connected?

ronag · 2022-01-05T16:58:34Z

How to make cronometro wait for the socket to be connected?

cronometro does a warmup run so waiting for the socket doesn't really make any difference.

lpinca · 2022-01-06T07:24:01Z

Ok, always sending the same string does not make sense and invalidates the test.

ronag · 2022-01-06T07:24:33Z

Ok, always sending the same string does not make sense and invalidates the test.

Added another one. See above.

lpinca · 2022-01-06T07:25:46Z

That's a ~5% improvement.

ronag · 2022-01-06T07:26:02Z

and reduced gc pressure

ronag · 2022-01-06T07:26:46Z

btw, I kind of did manage to resolve that with the current version by setting Buffer.poolSize = 128 * 1024.

lpinca · 2022-01-06T07:33:20Z

Yes, it's a small improvement that has some readability and maintenance drawbacks. We can move this forward but I want to make it general and handle websocket.ping(), websocket.pong(), and permessage-deflate. It shouldn't be hard, we only have to precompute the message length in bytes and pass it along to sender.enqueue(), sender.dequeue(), and Sender.frame() with a dedicated option.

Feel free to reopen and do that. I might open a new PR for this later.

lpinca · 2022-01-06T07:39:10Z

Anyway, thank you for bearing with me.

ronag · 2022-01-06T12:31:46Z

Something like this?

ronag · 2022-01-06T13:26:12Z

Tried cleaning it up a little more with proper commit msgs. A little exhausted on this one ATM. Hope this can be included in some form. Feel free to take over.

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

lpinca reviewed Jan 4, 2022

View reviewed changes

lib/sender.js Outdated Show resolved Hide resolved

lpinca reviewed Jan 4, 2022

View reviewed changes

lib/sender.js Outdated Show resolved Hide resolved

lpinca reviewed Jan 5, 2022

View reviewed changes

lib/sender.js Show resolved Hide resolved

lpinca reviewed Jan 5, 2022

View reviewed changes

lib/sender.js Show resolved Hide resolved

ronag force-pushed the avoid-buf-allocations branch 7 times, most recently from 8875b42 to b163d6c Compare January 5, 2022 13:00

lpinca reviewed Jan 5, 2022

View reviewed changes

ronag force-pushed the avoid-buf-allocations branch from 34cc4e0 to 8ad1e6d Compare January 5, 2022 15:04

ronag force-pushed the avoid-buf-allocations branch from 0555b10 to 16a7034 Compare January 5, 2022 15:25

ronag force-pushed the avoid-buf-allocations branch from e9a4ca2 to c26cb06 Compare January 5, 2022 15:34

ronag closed this Jan 6, 2022

ronag reopened this Jan 6, 2022

ronag requested a review from lpinca January 6, 2022 12:31

ronag force-pushed the avoid-buf-allocations branch 6 times, most recently from a5cf313 to c285e36 Compare January 6, 2022 13:15

perf: try to avoid buffer allocations

3bd436c

ronag force-pushed the avoid-buf-allocations branch from c285e36 to 3bd436c Compare January 6, 2022 13:18

ronag added 2 commits January 6, 2022 14:20

refactor: single return

02d4afc

perf: avoid sendFrame

7b034f1

lpinca added a commit that referenced this pull request Jan 6, 2022

[perf] Reduce buffer allocations

b1a6e81

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

lpinca mentioned this pull request Jan 6, 2022

Reduce buffer allocations #2000

Merged

lpinca added a commit that referenced this pull request Jan 6, 2022

[perf] Reduce buffer allocations

94299f0

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

lpinca added a commit that referenced this pull request Jan 7, 2022

[perf] Reduce buffer allocations

dbb0cf4

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

lpinca added a commit that referenced this pull request Jan 7, 2022

[perf] Reduce buffer allocations

4461ac6

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

lpinca added a commit that referenced this pull request Jan 7, 2022

[perf] Reduce buffer allocations

8735894

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

ronag closed this Jan 9, 2022

lpinca added a commit that referenced this pull request Jan 13, 2022

[perf] Reduce buffer allocations (#2000)

5b7fbb0

Do not convert strings to `Buffer`s if data does not need to be masked. Refs: #1998

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: try to avoid buffer allocations #1998

perf: try to avoid buffer allocations #1998

ronag commented Jan 4, 2022

ronag commented Jan 5, 2022 •

edited

Loading

lpinca Jan 5, 2022 •

edited

Loading

ronag Jan 5, 2022 •

edited

Loading

ronag Jan 5, 2022

lpinca Jan 5, 2022

ronag Jan 5, 2022 •

edited

Loading

lpinca commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

lpinca commented Jan 5, 2022

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022

lpinca commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

lpinca commented Jan 5, 2022

ronag commented Jan 5, 2022 •

edited

Loading

lpinca commented Jan 5, 2022

ronag commented Jan 5, 2022 •

edited

Loading

lpinca commented Jan 6, 2022 •

edited

Loading

ronag commented Jan 6, 2022

lpinca commented Jan 6, 2022

ronag commented Jan 6, 2022 •

edited

Loading

ronag commented Jan 6, 2022

lpinca commented Jan 6, 2022 •

edited

Loading

lpinca commented Jan 6, 2022

ronag commented Jan 6, 2022

ronag commented Jan 6, 2022

		const buf =
		perMessageDeflate \|\| typeof data !== 'string' ? toBuffer(data) : data;

perf: try to avoid buffer allocations #1998

perf: try to avoid buffer allocations #1998

Conversation

ronag commented Jan 4, 2022

ronag commented Jan 5, 2022 • edited Loading

lpinca Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

ronag Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

ronag Jan 5, 2022

Choose a reason for hiding this comment

lpinca Jan 5, 2022

Choose a reason for hiding this comment

ronag Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

lpinca commented Jan 5, 2022 • edited Loading

ronag commented Jan 5, 2022 • edited Loading

ronag commented Jan 5, 2022 • edited Loading

ronag commented Jan 5, 2022 • edited Loading

ronag commented Jan 5, 2022 • edited Loading

lpinca commented Jan 5, 2022

ronag commented Jan 5, 2022 • edited Loading

ronag commented Jan 5, 2022

lpinca commented Jan 5, 2022 • edited Loading

ronag commented Jan 5, 2022 • edited Loading

lpinca commented Jan 5, 2022

ronag commented Jan 5, 2022 • edited Loading

lpinca commented Jan 5, 2022

ronag commented Jan 5, 2022 • edited Loading

lpinca commented Jan 6, 2022 • edited Loading

ronag commented Jan 6, 2022

lpinca commented Jan 6, 2022

ronag commented Jan 6, 2022 • edited Loading

ronag commented Jan 6, 2022

lpinca commented Jan 6, 2022 • edited Loading

lpinca commented Jan 6, 2022

ronag commented Jan 6, 2022

ronag commented Jan 6, 2022

ronag commented Jan 5, 2022 •

edited

Loading

lpinca Jan 5, 2022 •

edited

Loading

ronag Jan 5, 2022 •

edited

Loading

ronag Jan 5, 2022 •

edited

Loading

lpinca commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

lpinca commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

ronag commented Jan 5, 2022 •

edited

Loading

lpinca commented Jan 6, 2022 •

edited

Loading

ronag commented Jan 6, 2022 •

edited

Loading

lpinca commented Jan 6, 2022 •

edited

Loading