Increase escapeHTML performance up to 10 times! #27

frenzzy · 2020-09-09T12:02:39Z

I compared performance for escapeHTML function using @dougwilson's benchmark from escape-html repo.

Before:

After:

Bundle size will increase a bit but it shouldn't matter if this code is mostly used on backend side during server-side rendering where the performance is much more important.

ryansolid · 2020-09-09T16:05:26Z

Awesome. Escaping is a huge part of the performance cost in SSR. I look forward to checking this out.

ryansolid · 2020-09-10T02:55:42Z

Wow.. people need to know about this. It seems to work but seems too good to be true. This just doubled Solid's score on the isomorphic benchmarks(https://github.com/marko-js/isomorphic-ui-benchmarks). Shooting it past even Marko. I'm sure we will use the same technique in Marko but for a few days we can enjoy being the fastest on the server and the client.

Seriously ignoring Inferno's lack of escaping attributes skewing results in the color picker look at applying this escaping does:

Running "search-results"...

Running benchmark "marko"...

marko x 4,948 ops/sec ±1.33% (89 runs sampled)

Running benchmark "preact"...

preact x 524 ops/sec ±1.06% (89 runs sampled)

Running benchmark "react"...

react x 561 ops/sec ±5.58% (74 runs sampled)

Running benchmark "vue"...

vue x 1,789 ops/sec ±3.99% (72 runs sampled)

Running benchmark "inferno"...

inferno x 2,047 ops/sec ±2.26% (84 runs sampled)

Running benchmark "solid"...

solid x 13,515 ops/sec ±1.03% (94 runs sampled)

Running benchmark "svelte"...

svelte x 2,860 ops/sec ±5.03% (73 runs sampled)

Fastest is solid

Running "color-picker"...

Running benchmark "marko"...

marko x 17,367 ops/sec ±5.15% (79 runs sampled)

Running benchmark "preact"...

preact x 2,973 ops/sec ±3.60% (87 runs sampled)

Running benchmark "react"...

react x 3,211 ops/sec ±5.48% (78 runs sampled)

Running benchmark "vue"...

vue x 6,114 ops/sec ±5.10% (70 runs sampled)

Running benchmark "inferno"...

inferno x 17,168 ops/sec ±3.07% (90 runs sampled)

Running benchmark "solid"...

solid x 33,844 ops/sec ±0.81% (93 runs sampled)

Running benchmark "svelte"...

svelte x 7,406 ops/sec ±3.18% (79 runs sampled)

Fastest is solid

DONE!

dougwilson · 2020-09-10T02:59:30Z

The benefit of micro module is full attention is given to the one thing it actually does.

rturnq · 2020-09-10T17:28:07Z

I think there may be an issue keeping the regexes outside of the escapeHTML function scope. Using RegExp.prototype.exec is stateful when using a regex with y or g flags, and it stores the last index internally. This will make the results non-deterministic and some inputs will not be fully escaped or escaped at all. Moving the regexes into the function fixes it but may degrade performance. Another option would be to call exec('') before returning which would reset its state, not sure if this is faster than creating the regex.

Here is an example from using the console:

rturnq · 2020-09-10T19:11:17Z

It looks like the regex's last index is writable (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex), so I think the fastest option is to set it to 0 right before the final return.

Ha. Just looked at the original and since it only cares about the first match, the g can just be removed from the regex.

ryansolid · 2020-09-10T19:34:56Z

Oh right of course.. before with replace we needed all the matches.. but since we are just using this to determine where to start the first match is sufficient. Ha good call.

Remove g flag from escape regex, fixes #27

ryansolid · 2020-09-11T02:24:50Z

Yeah the change didn't affect the benchmarks. Everything still looking really good.

xandris · 2020-09-21T03:25:30Z

Is there a reason not to use let index = html.search(regex) since all you want from the match is the index?

frenzzy · 2020-09-21T10:49:32Z

@xandris with str.search(regexp) the case when there are no special chars (the most common case) works 12% slower:

exec vs search

xandris · 2020-09-21T18:49:48Z

Well you can't deny the evidence but WHAT! Node, what are you doing? Sorry for butting in, thanks for the numbers

xandris · 2020-09-22T01:41:21Z

No one asked but this implementation is ~50% more ops/sec on my machine than the one in this PR. It's gnarly though haha

function escapeHtml(s, attr) {
  const delim = attr ? '"' : '<';
  const escDelim = attr ? '&quot;' : '&lt;';
  let iDelim = s.indexOf(delim);
  let iAmp = s.indexOf('&');

  if(iDelim < 0 && iAmp < 0)
    return s;

  let left = 0, out = '';

  while(iDelim >= 0 && iAmp >= 0) {
    if(iDelim < iAmp) {
      if(left < iDelim)
        out += s.substring(left, iDelim);
      out += escDelim;
      left = iDelim + 1;
      iDelim = s.indexOf(delim, left);
    } else {
      if(left < iAmp)
        out += s.substring(left, iAmp);
      out += '&amp;';
      left = iAmp + 1;
      iAmp = s.indexOf('&', left);
    }
  }

  if(iDelim >= 0 ) {
    do {
      if(left < iDelim)
        out += s.substring(left, iDelim);
      out += escDelim;
      left = iDelim+1;
      iDelim = s.indexOf(delim, left);
    } while(iDelim >= 0);
  } else while(iAmp >= 0) {
    if(left < iAmp)
      out += s.substring(left, iAmp);
    out += '&amp;'
    left = iAmp+1;
    iAmp = s.indexOf('&', left);
  }

  return left < s.length ? out + s.substring(left) : out;
}

ryansolid · 2020-09-22T04:13:22Z

Size to this degree doesn't matter on the server. So not particularly concerned about size, but I do agree we want to optimize for the no special characters. And we need test anything handwritten fairly thoroughly.

But interesting approach. Given we know we know we are in an attribute vs script, I think you are correct to look for the single character (and the ampersand).

xandris · 2020-09-22T17:09:06Z

I only have this machine to try it on, so I don't know if it's faster for anyone else, but if you're interested I could open a PR. I'm going to be using this on my project's server-side code :)

ryansolid · 2020-09-22T17:10:49Z

It looks like it produces about a 5-8% boost in the final SSR test I'm doing but it's close and I haven't counted it out to variance. I'd like to test it the same way @frenzzy has been.

ryansolid · 2020-10-16T08:44:57Z

It's taken me some time to get back to it, but having tested perf directly @xandris it looks good and I will be incorporating in the next minor version. Thank you.

joshgoebel · 2020-11-23T09:44:06Z

I know we just borrowed the code here... but it's worth stopping to ask:

If regex is SO fast that using it to bootstrap the search is a good idea - why does is it also suddenly so slow that it's avoided entirely later - and instead the string is scanned one character at a time? :-)

This implementation is essentially a very very limited rewrite of a single regex - coded in Javascript. Turns out that's far slower than just using the built-in RegExp engine. :-)

This change makes escape-html 30% faster and the version here 75% faster (because you are encoding fewer characters).

component/escape-html#28

If someone could please explain to me WHY the built-in replace is darn crazy slow though... I'll truly owe you a coffee or something.

ryansolid · 2020-11-23T17:24:41Z

@joshgoebel Thank you for spending the time to look into this. I'd already moved on when this came in and was focused elsewhere. It's good to know that that this is just something brutally slow around replace. Thank you for bringing it to my attention and perceivably anyone else who is following down this path as well. It's clear there is still more to understand here. Thank you.

joshgoebel · 2020-11-24T00:24:53Z

It also seems there is perhaps a huge difference between large and small strings and the frequency of the letters to be encoded in the data. So that makes it hard to even benchmark without first deciding on representative sample data looks like. For example right now the escape-html test suits is showing opposite results (from mine) in 2 of it's 3 cases, but it's also using 10 character strings where-as the test case I used was 1.2mb of actual source code.

And it also seems possible that if someone was encoding data where every few characters needed to be replaced that this "letter by letter" approach might still be superior. I don't think we good enough benchmarks (covering enough cases) yet to be certain.

Anyways I just thought if the speed was critical that someone might want to know there is quite possibly more to be had. :)

ryansolid · 2020-11-24T00:42:37Z

@joshgoebel Also great information. Thank you. I do think given these are small template inserts where the static parts are already processed at compile time the average is going to be in that 10 character region. For the most part I was testing as part of not a direct benchmark but in a full SSR benchmark where escaping makes a big impact but still isn't most of the overhead by any means. This 10x improvement was more like a 2-3x there. So it lacks the precision to see most differences.

joshgoebel · 2020-11-24T00:46:01Z

I do think given these are small template inserts where the static parts are already processed at compile time the average is going to be in that 10 character region.

Oh wow. :-) Ah, I was imagining dropping in processed user input like posting a question of StackOverflow or a forum post or something... (which would be much longer than 10, but not 1.2mb by any stretch) Your use case then is very different than mine. I kind of wonder if a truly optimal solution would have two entirely different code paths based on the size of the string. It seems there may be some real "startup costs" with just engaging regex on tiny strings...

xandris · 2020-11-25T17:59:34Z

My approach isn't letter-by-letter, so I'm kind of confused. I was reaching for a strpbrk or strcspn in JavaScript, couldn't find one, so I made a very limited version using indexOf. I tried a lot of variations on regexes to find the next character and couldn't get them to perform as well as indexOf...

joshgoebel · 2020-11-25T18:12:47Z

@xandris I was commenting on the code in the PR here, not your posted code. I'm looking at this in a lot of other contexts and often there are 5 characters that need escaping, so indexOf quickly starts to be a real mess when you have to run it 5 times (for all 5 special chars), sort those results to find the "next" char, etc... I'd be surprised if at that point it's still faster than regex...

Although perhaps it if was done sequentially it'd be much simpler... ie:

replaceChar(str, "&", "&amp;")
replaceChar(...
replaceChar(...

Where replaceChar can focus on a single letter at a time... the code would probably be much simpler. If your's is 50% faster though that's impressive. :)

joshgoebel · 2020-11-25T18:43:21Z

@xandris Oh wow you might be onto something with indexOf. With a 5 character replace:

  1000000 characters (0% special) (current)  x 1,444 ops/sec ±0.82% (11 runs sampled)
  1000000 characters (0% special) Regex      x 1,443 ops/sec ±1.29% (11 runs sampled)
  1000000 characters (5% special) (current)  x   147 ops/sec ±0.86% (10 runs sampled)
  1000000 characters (5% special) Regex      x   235 ops/sec ±0.55% (10 runs sampled)
  1000000 characters (10% special) (current) x 63.42 ops/sec ±1.55% (6 runs sampled)
  1000000 characters (10% special) Regex     x   132 ops/sec ±1.42% (10 runs sampled)
  1000000 characters (15% special) (current) x 71.44 ops/sec ±0.89% (9 runs sampled)
  1000000 characters (15% special) Regex     x 77.82 ops/sec ±0.49% (10 runs sampled)
  1000000 characters (20% special) (current) x 81.13 ops/sec ±2.05% (9 runs sampled)
  1000000 characters (20% special) Regex     x 51.33 ops/sec ±1.89% (8 runs sampled)
  1000000 characters (25% special) (current) x 18.96 ops/sec ±19.20% (5 runs sampled)
  1000000 characters (25% special) Regex     x 50.50 ops/sec ±0.48% (9 runs sampled)
  1000000 characters (30% special) (current) x 17.99 ops/sec ±18.10% (6 runs sampled)
  1000000 characters (30% special) Regex     x 48.01 ops/sec ±12.32% (8 runs sampled)

Ignore regex it should say indexOf. :)

Much simpler linear solution:

function replaceChar(str, char, replacement) {
  var html = ""
  var index = 0
  var lastIndex = 0
  while ((index = str.indexOf(char, lastIndex))!=-1) {
    html += str.substring(lastIndex, index) + replacement
    lastIndex = index + 1
  }
  return html + str.substring(lastIndex)
}

// console.log(replaceChar("boo<<<hoo", "<", "&lt;"));

/**
 * Escape special characters in the given string of text.
 *
 * @param  {string} string The string to escape for inserting into HTML
 * @return {string}
 * @public
 */

var matchHtmlRegExpFast = /["'&<>]/
function escapeHtmlFast (string) {
  var str = '' + string
  if (!matchHtmlRegExpFast.test(str)) return str;

  var out = replaceChar(string, "&", "&amp;")
  out = replaceChar(out, "<", "&lt;")
  out = replaceChar(out, ">", "&gt;")
  out = replaceChar(out, "'", '&#39;')
  out = replaceChar(out, '"', '&quot;')

  return out;
}

One must be very carefully with sample cases though. The performance of all these things seems to very widely based not only on the density of characters needing replacement but also on the size of the input test. There may also be variance based on whether the input is randomized or has some structure.

  657579 characters (5% special) (current)  x   230 ops/sec ±2.88% (10 runs sampled)
  657579 characters (5% special) Regex      x   270 ops/sec ±0.60% (11 runs sampled)
  Long HTML page                            x   255 ops/sec ±5.96% (10 runs sampled)
  Long HTML page REGEX                      x   163 ops/sec ±3.09% (10 runs sampled)

Same size of data but the actual real-life HTML is slower and the generated data is faster despite the density being almost the same.

joshgoebel · 2020-11-25T19:05:31Z

As mentioned over in the escape_html library thread optimally we'd collect a bunch of real-life representative samples of various sizes representing actual data users are likely to encode and use to tune the performance.

xandris · 2020-11-25T19:16:20Z

The version here only needed to replace two characters, so I didn't need to compute five indices and sort them or keep intermediate results that get discarded. But yeah it's maddening trying to do this in JavaScript. I wish there were a strcspn available.

frenzzy added 3 commits September 9, 2020 14:49

Improve escapeHTML performance

b083bb8

Update utils.js

ae441e2

Update runtime.js

c583e76

ryansolid merged commit 3029e0f into ryansolid:master Sep 10, 2020

frenzzy deleted the patch-1 branch September 10, 2020 04:07

frenzzy added a commit to frenzzy/dom-expressions that referenced this pull request Sep 10, 2020

Remove g flag from escape regex, fixes ryansolid#27

cb34b85

frenzzy mentioned this pull request Sep 10, 2020

Remove g flag from escape regex, fixes #27 #28

Merged

ryansolid added a commit that referenced this pull request Sep 10, 2020

Merge pull request #28 from frenzzy/patch-1

e4ba45b

Remove g flag from escape regex, fixes #27

RunDevelopment mentioned this pull request Sep 18, 2020

Core: Token.stringify will now call util.encode PrismJS/prism#1844

Closed

Rich-Harris added a commit to sveltejs/svelte that referenced this pull request Nov 20, 2020

faster SSR, modelled on ryansolid/dom-expressions#27

aa15e9a

Rich-Harris mentioned this pull request Nov 20, 2020

Faster SSR sveltejs/svelte#5701

Merged

4 tasks

RunDevelopment mentioned this pull request Nov 21, 2020

[Request] Faster HTML escaping? highlightjs/highlight.js#2885

Closed

leeoniya mentioned this pull request May 26, 2021

SSR -> ludicrous speed domvm/domvm#158

Open

pkozlowski-opensource mentioned this pull request Aug 17, 2022

HTML escaper performance could be improved by adopting the same html escaping code as Svelte and Solid JS angular/angular#46950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase escapeHTML performance up to 10 times! #27

Increase escapeHTML performance up to 10 times! #27

frenzzy commented Sep 9, 2020

ryansolid commented Sep 9, 2020

ryansolid commented Sep 10, 2020

dougwilson commented Sep 10, 2020

rturnq commented Sep 10, 2020 •

edited

Loading

rturnq commented Sep 10, 2020 •

edited

Loading

ryansolid commented Sep 10, 2020 •

edited

Loading

ryansolid commented Sep 11, 2020

xandris commented Sep 21, 2020

frenzzy commented Sep 21, 2020 •

edited

Loading

xandris commented Sep 21, 2020

xandris commented Sep 22, 2020

ryansolid commented Sep 22, 2020

xandris commented Sep 22, 2020

ryansolid commented Sep 22, 2020

ryansolid commented Oct 16, 2020

joshgoebel commented Nov 23, 2020 •

edited

Loading

ryansolid commented Nov 23, 2020

joshgoebel commented Nov 24, 2020 •

edited

Loading

ryansolid commented Nov 24, 2020

joshgoebel commented Nov 24, 2020

xandris commented Nov 25, 2020

joshgoebel commented Nov 25, 2020 •

edited

Loading

joshgoebel commented Nov 25, 2020 •

edited

Loading

joshgoebel commented Nov 25, 2020

xandris commented Nov 25, 2020

Increase escapeHTML performance up to 10 times! #27

Increase escapeHTML performance up to 10 times! #27

Conversation

frenzzy commented Sep 9, 2020

ryansolid commented Sep 9, 2020

ryansolid commented Sep 10, 2020

dougwilson commented Sep 10, 2020

rturnq commented Sep 10, 2020 • edited Loading

rturnq commented Sep 10, 2020 • edited Loading

ryansolid commented Sep 10, 2020 • edited Loading

ryansolid commented Sep 11, 2020

xandris commented Sep 21, 2020

frenzzy commented Sep 21, 2020 • edited Loading

xandris commented Sep 21, 2020

xandris commented Sep 22, 2020

ryansolid commented Sep 22, 2020

xandris commented Sep 22, 2020

ryansolid commented Sep 22, 2020

ryansolid commented Oct 16, 2020

joshgoebel commented Nov 23, 2020 • edited Loading

ryansolid commented Nov 23, 2020

joshgoebel commented Nov 24, 2020 • edited Loading

ryansolid commented Nov 24, 2020

joshgoebel commented Nov 24, 2020

xandris commented Nov 25, 2020

joshgoebel commented Nov 25, 2020 • edited Loading

joshgoebel commented Nov 25, 2020 • edited Loading

joshgoebel commented Nov 25, 2020

xandris commented Nov 25, 2020

rturnq commented Sep 10, 2020 •

edited

Loading

rturnq commented Sep 10, 2020 •

edited

Loading

ryansolid commented Sep 10, 2020 •

edited

Loading

frenzzy commented Sep 21, 2020 •

edited

Loading

joshgoebel commented Nov 23, 2020 •

edited

Loading

joshgoebel commented Nov 24, 2020 •

edited

Loading

joshgoebel commented Nov 25, 2020 •

edited

Loading

joshgoebel commented Nov 25, 2020 •

edited

Loading