Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase escapeHTML performance up to 10 times! #27

Merged
merged 3 commits into from
Sep 10, 2020

Conversation

frenzzy
Copy link
Contributor

@frenzzy frenzzy commented Sep 9, 2020

I compared performance for escapeHTML function using @dougwilson's benchmark from escape-html repo.

Before:

Screenshot 2020-09-09 at 14 58 25

After:

Screenshot 2020-09-09 at 14 58 41

Bundle size will increase a bit but it shouldn't matter if this code is mostly used on backend side during server-side rendering where the performance is much more important.

@ryansolid
Copy link
Owner

Awesome. Escaping is a huge part of the performance cost in SSR. I look forward to checking this out.

@ryansolid
Copy link
Owner

Wow.. people need to know about this. It seems to work but seems too good to be true. This just doubled Solid's score on the isomorphic benchmarks(https://github.com/marko-js/isomorphic-ui-benchmarks). Shooting it past even Marko. I'm sure we will use the same technique in Marko but for a few days we can enjoy being the fastest on the server and the client.

Seriously ignoring Inferno's lack of escaping attributes skewing results in the color picker look at applying this escaping does:

Running "search-results"...

Running benchmark "marko"...

marko x 4,948 ops/sec ±1.33% (89 runs sampled)

Running benchmark "preact"...

preact x 524 ops/sec ±1.06% (89 runs sampled)

Running benchmark "react"...

react x 561 ops/sec ±5.58% (74 runs sampled)

Running benchmark "vue"...

vue x 1,789 ops/sec ±3.99% (72 runs sampled)

Running benchmark "inferno"...

inferno x 2,047 ops/sec ±2.26% (84 runs sampled)

Running benchmark "solid"...

solid x 13,515 ops/sec ±1.03% (94 runs sampled)

Running benchmark "svelte"...

svelte x 2,860 ops/sec ±5.03% (73 runs sampled)

Fastest is solid


Running "color-picker"...

Running benchmark "marko"...

marko x 17,367 ops/sec ±5.15% (79 runs sampled)

Running benchmark "preact"...

preact x 2,973 ops/sec ±3.60% (87 runs sampled)

Running benchmark "react"...

react x 3,211 ops/sec ±5.48% (78 runs sampled)

Running benchmark "vue"...

vue x 6,114 ops/sec ±5.10% (70 runs sampled)

Running benchmark "inferno"...

inferno x 17,168 ops/sec ±3.07% (90 runs sampled)

Running benchmark "solid"...

solid x 33,844 ops/sec ±0.81% (93 runs sampled)

Running benchmark "svelte"...

svelte x 7,406 ops/sec ±3.18% (79 runs sampled)

Fastest is solid


DONE!

@dougwilson
Copy link

The benefit of micro module is full attention is given to the one thing it actually does.

@ryansolid ryansolid merged commit 3029e0f into ryansolid:master Sep 10, 2020
@frenzzy frenzzy deleted the patch-1 branch September 10, 2020 04:07
@rturnq
Copy link

rturnq commented Sep 10, 2020

I think there may be an issue keeping the regexes outside of the escapeHTML function scope. Using RegExp.prototype.exec is stateful when using a regex with y or g flags, and it stores the last index internally. This will make the results non-deterministic and some inputs will not be fully escaped or escaped at all. Moving the regexes into the function fixes it but may degrade performance. Another option would be to call exec('') before returning which would reset its state, not sure if this is faster than creating the regex.

Here is an example from using the console:
image

@rturnq
Copy link

rturnq commented Sep 10, 2020

It looks like the regex's last index is writable (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex), so I think the fastest option is to set it to 0 right before the final return.

Ha. Just looked at the original and since it only cares about the first match, the g can just be removed from the regex.

@ryansolid
Copy link
Owner

ryansolid commented Sep 10, 2020

Oh right of course.. before with replace we needed all the matches.. but since we are just using this to determine where to start the first match is sufficient. Ha good call.

frenzzy added a commit to frenzzy/dom-expressions that referenced this pull request Sep 10, 2020
ryansolid added a commit that referenced this pull request Sep 10, 2020
Remove g flag from escape regex, fixes #27
@ryansolid
Copy link
Owner

Yeah the change didn't affect the benchmarks. Everything still looking really good.

@xandris
Copy link

xandris commented Sep 21, 2020

Is there a reason not to use let index = html.search(regex) since all you want from the match is the index?

@frenzzy
Copy link
Contributor Author

frenzzy commented Sep 21, 2020

@xandris with str.search(regexp) the case when there are no special chars (the most common case) works 12% slower:

exec vs search
Screenshot 2020-09-21 at 13 46 15 Screenshot 2020-09-21 at 13 46 30

@xandris
Copy link

xandris commented Sep 21, 2020

Well you can't deny the evidence but WHAT! Node, what are you doing? Sorry for butting in, thanks for the numbers

@xandris
Copy link

xandris commented Sep 22, 2020

No one asked but this implementation is ~50% more ops/sec on my machine than the one in this PR. It's gnarly though haha

function escapeHtml(s, attr) {
  const delim = attr ? '"' : '<';
  const escDelim = attr ? '&quot;' : '&lt;';
  let iDelim = s.indexOf(delim);
  let iAmp = s.indexOf('&');

  if(iDelim < 0 && iAmp < 0)
    return s;

  let left = 0, out = '';

  while(iDelim >= 0 && iAmp >= 0) {
    if(iDelim < iAmp) {
      if(left < iDelim)
        out += s.substring(left, iDelim);
      out += escDelim;
      left = iDelim + 1;
      iDelim = s.indexOf(delim, left);
    } else {
      if(left < iAmp)
        out += s.substring(left, iAmp);
      out += '&amp;';
      left = iAmp + 1;
      iAmp = s.indexOf('&', left);
    }
  }

  if(iDelim >= 0 ) {
    do {
      if(left < iDelim)
        out += s.substring(left, iDelim);
      out += escDelim;
      left = iDelim+1;
      iDelim = s.indexOf(delim, left);
    } while(iDelim >= 0);
  } else while(iAmp >= 0) {
    if(left < iAmp)
      out += s.substring(left, iAmp);
    out += '&amp;'
    left = iAmp+1;
    iAmp = s.indexOf('&', left);
  }

  return left < s.length ? out + s.substring(left) : out;
}

@ryansolid
Copy link
Owner

Size to this degree doesn't matter on the server. So not particularly concerned about size, but I do agree we want to optimize for the no special characters. And we need test anything handwritten fairly thoroughly.

But interesting approach. Given we know we know we are in an attribute vs script, I think you are correct to look for the single character (and the ampersand).

@xandris
Copy link

xandris commented Sep 22, 2020

I only have this machine to try it on, so I don't know if it's faster for anyone else, but if you're interested I could open a PR. I'm going to be using this on my project's server-side code :)

@ryansolid
Copy link
Owner

It looks like it produces about a 5-8% boost in the final SSR test I'm doing but it's close and I haven't counted it out to variance. I'd like to test it the same way @frenzzy has been.

@ryansolid
Copy link
Owner

It's taken me some time to get back to it, but having tested perf directly @xandris it looks good and I will be incorporating in the next minor version. Thank you.

@joshgoebel
Copy link

joshgoebel commented Nov 23, 2020

I know we just borrowed the code here... but it's worth stopping to ask:

  • If regex is SO fast that using it to bootstrap the search is a good idea - why does is it also suddenly so slow that it's avoided entirely later - and instead the string is scanned one character at a time? :-)

This implementation is essentially a very very limited rewrite of a single regex - coded in Javascript. Turns out that's far slower than just using the built-in RegExp engine. :-)

This change makes escape-html 30% faster and the version here 75% faster (because you are encoding fewer characters).

component/escape-html#28


If someone could please explain to me WHY the built-in replace is darn crazy slow though... I'll truly owe you a coffee or something.

@ryansolid
Copy link
Owner

@joshgoebel Thank you for spending the time to look into this. I'd already moved on when this came in and was focused elsewhere. It's good to know that that this is just something brutally slow around replace. Thank you for bringing it to my attention and perceivably anyone else who is following down this path as well. It's clear there is still more to understand here. Thank you.

@joshgoebel
Copy link

joshgoebel commented Nov 24, 2020

It also seems there is perhaps a huge difference between large and small strings and the frequency of the letters to be encoded in the data. So that makes it hard to even benchmark without first deciding on representative sample data looks like. For example right now the escape-html test suits is showing opposite results (from mine) in 2 of it's 3 cases, but it's also using 10 character strings where-as the test case I used was 1.2mb of actual source code.

And it also seems possible that if someone was encoding data where every few characters needed to be replaced that this "letter by letter" approach might still be superior. I don't think we good enough benchmarks (covering enough cases) yet to be certain.

Anyways I just thought if the speed was critical that someone might want to know there is quite possibly more to be had. :)

@ryansolid
Copy link
Owner

@joshgoebel Also great information. Thank you. I do think given these are small template inserts where the static parts are already processed at compile time the average is going to be in that 10 character region. For the most part I was testing as part of not a direct benchmark but in a full SSR benchmark where escaping makes a big impact but still isn't most of the overhead by any means. This 10x improvement was more like a 2-3x there. So it lacks the precision to see most differences.

@joshgoebel
Copy link

I do think given these are small template inserts where the static parts are already processed at compile time the average is going to be in that 10 character region.

Oh wow. :-) Ah, I was imagining dropping in processed user input like posting a question of StackOverflow or a forum post or something... (which would be much longer than 10, but not 1.2mb by any stretch) Your use case then is very different than mine. I kind of wonder if a truly optimal solution would have two entirely different code paths based on the size of the string. It seems there may be some real "startup costs" with just engaging regex on tiny strings...

@xandris
Copy link

xandris commented Nov 25, 2020

My approach isn't letter-by-letter, so I'm kind of confused. I was reaching for a strpbrk or strcspn in JavaScript, couldn't find one, so I made a very limited version using indexOf. I tried a lot of variations on regexes to find the next character and couldn't get them to perform as well as indexOf...

@joshgoebel
Copy link

joshgoebel commented Nov 25, 2020

@xandris I was commenting on the code in the PR here, not your posted code. I'm looking at this in a lot of other contexts and often there are 5 characters that need escaping, so indexOf quickly starts to be a real mess when you have to run it 5 times (for all 5 special chars), sort those results to find the "next" char, etc... I'd be surprised if at that point it's still faster than regex...

Although perhaps it if was done sequentially it'd be much simpler... ie:

replaceChar(str, "&", "&amp;")
replaceChar(...
replaceChar(...

Where replaceChar can focus on a single letter at a time... the code would probably be much simpler. If your's is 50% faster though that's impressive. :)

@joshgoebel
Copy link

joshgoebel commented Nov 25, 2020

@xandris Oh wow you might be onto something with indexOf. With a 5 character replace:

  1000000 characters (0% special) (current)  x 1,444 ops/sec ±0.82% (11 runs sampled)
  1000000 characters (0% special) Regex      x 1,443 ops/sec ±1.29% (11 runs sampled)
  1000000 characters (5% special) (current)  x   147 ops/sec ±0.86% (10 runs sampled)
  1000000 characters (5% special) Regex      x   235 ops/sec ±0.55% (10 runs sampled)
  1000000 characters (10% special) (current) x 63.42 ops/sec ±1.55% (6 runs sampled)
  1000000 characters (10% special) Regex     x   132 ops/sec ±1.42% (10 runs sampled)
  1000000 characters (15% special) (current) x 71.44 ops/sec ±0.89% (9 runs sampled)
  1000000 characters (15% special) Regex     x 77.82 ops/sec ±0.49% (10 runs sampled)
  1000000 characters (20% special) (current) x 81.13 ops/sec ±2.05% (9 runs sampled)
  1000000 characters (20% special) Regex     x 51.33 ops/sec ±1.89% (8 runs sampled)
  1000000 characters (25% special) (current) x 18.96 ops/sec ±19.20% (5 runs sampled)
  1000000 characters (25% special) Regex     x 50.50 ops/sec ±0.48% (9 runs sampled)
  1000000 characters (30% special) (current) x 17.99 ops/sec ±18.10% (6 runs sampled)
  1000000 characters (30% special) Regex     x 48.01 ops/sec ±12.32% (8 runs sampled)

Ignore regex it should say indexOf. :)

Much simpler linear solution:

function replaceChar(str, char, replacement) {
  var html = ""
  var index = 0
  var lastIndex = 0
  while ((index = str.indexOf(char, lastIndex))!=-1) {
    html += str.substring(lastIndex, index) + replacement
    lastIndex = index + 1
  }
  return html + str.substring(lastIndex)
}

// console.log(replaceChar("boo<<<hoo", "<", "&lt;"));

/**
 * Escape special characters in the given string of text.
 *
 * @param  {string} string The string to escape for inserting into HTML
 * @return {string}
 * @public
 */

var matchHtmlRegExpFast = /["'&<>]/
function escapeHtmlFast (string) {
  var str = '' + string
  if (!matchHtmlRegExpFast.test(str)) return str;

  var out = replaceChar(string, "&", "&amp;")
  out = replaceChar(out, "<", "&lt;")
  out = replaceChar(out, ">", "&gt;")
  out = replaceChar(out, "'", '&#39;')
  out = replaceChar(out, '"', '&quot;')

  return out;
}

One must be very carefully with sample cases though. The performance of all these things seems to very widely based not only on the density of characters needing replacement but also on the size of the input test. There may also be variance based on whether the input is randomized or has some structure.

  657579 characters (5% special) (current)  x   230 ops/sec ±2.88% (10 runs sampled)
  657579 characters (5% special) Regex      x   270 ops/sec ±0.60% (11 runs sampled)
  Long HTML page                            x   255 ops/sec ±5.96% (10 runs sampled)
  Long HTML page REGEX                      x   163 ops/sec ±3.09% (10 runs sampled)

Same size of data but the actual real-life HTML is slower and the generated data is faster despite the density being almost the same.

@joshgoebel
Copy link

As mentioned over in the escape_html library thread optimally we'd collect a bunch of real-life representative samples of various sizes representing actual data users are likely to encode and use to tune the performance.

@xandris
Copy link

xandris commented Nov 25, 2020

The version here only needed to replace two characters, so I didn't need to compute five indices and sort them or keep intermediate results that get discarded. But yeah it's maddening trying to do this in JavaScript. I wish there were a strcspn available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants