-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase escapeHTML performance up to 10 times! #27
Conversation
Awesome. Escaping is a huge part of the performance cost in SSR. I look forward to checking this out. |
Wow.. people need to know about this. It seems to work but seems too good to be true. This just doubled Solid's score on the isomorphic benchmarks(https://github.com/marko-js/isomorphic-ui-benchmarks). Shooting it past even Marko. I'm sure we will use the same technique in Marko but for a few days we can enjoy being the fastest on the server and the client. Seriously ignoring Inferno's lack of escaping attributes skewing results in the color picker look at applying this escaping does: Running "search-results"... Running benchmark "marko"... marko x 4,948 ops/sec ±1.33% (89 runs sampled) Running benchmark "preact"... preact x 524 ops/sec ±1.06% (89 runs sampled) Running benchmark "react"... react x 561 ops/sec ±5.58% (74 runs sampled) Running benchmark "vue"... vue x 1,789 ops/sec ±3.99% (72 runs sampled) Running benchmark "inferno"... inferno x 2,047 ops/sec ±2.26% (84 runs sampled) Running benchmark "solid"... solid x 13,515 ops/sec ±1.03% (94 runs sampled) Running benchmark "svelte"... svelte x 2,860 ops/sec ±5.03% (73 runs sampled) Fastest is solid Running "color-picker"... Running benchmark "marko"... marko x 17,367 ops/sec ±5.15% (79 runs sampled) Running benchmark "preact"... preact x 2,973 ops/sec ±3.60% (87 runs sampled) Running benchmark "react"... react x 3,211 ops/sec ±5.48% (78 runs sampled) Running benchmark "vue"... vue x 6,114 ops/sec ±5.10% (70 runs sampled) Running benchmark "inferno"... inferno x 17,168 ops/sec ±3.07% (90 runs sampled) Running benchmark "solid"... solid x 33,844 ops/sec ±0.81% (93 runs sampled) Running benchmark "svelte"... svelte x 7,406 ops/sec ±3.18% (79 runs sampled) Fastest is solid DONE! |
The benefit of micro module is full attention is given to the one thing it actually does. |
I think there may be an issue keeping the regexes outside of the |
It looks like the regex's last index is writable (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex), so I think the fastest option is to set it to 0 right before the final return. Ha. Just looked at the original and since it only cares about the first match, the |
Oh right of course.. before with replace we needed all the matches.. but since we are just using this to determine where to start the first match is sufficient. Ha good call. |
Remove g flag from escape regex, fixes #27
Yeah the change didn't affect the benchmarks. Everything still looking really good. |
Is there a reason not to use |
@xandris with |
Well you can't deny the evidence but WHAT! Node, what are you doing? Sorry for butting in, thanks for the numbers |
No one asked but this implementation is ~50% more ops/sec on my machine than the one in this PR. It's gnarly though haha function escapeHtml(s, attr) {
const delim = attr ? '"' : '<';
const escDelim = attr ? '"' : '<';
let iDelim = s.indexOf(delim);
let iAmp = s.indexOf('&');
if(iDelim < 0 && iAmp < 0)
return s;
let left = 0, out = '';
while(iDelim >= 0 && iAmp >= 0) {
if(iDelim < iAmp) {
if(left < iDelim)
out += s.substring(left, iDelim);
out += escDelim;
left = iDelim + 1;
iDelim = s.indexOf(delim, left);
} else {
if(left < iAmp)
out += s.substring(left, iAmp);
out += '&';
left = iAmp + 1;
iAmp = s.indexOf('&', left);
}
}
if(iDelim >= 0 ) {
do {
if(left < iDelim)
out += s.substring(left, iDelim);
out += escDelim;
left = iDelim+1;
iDelim = s.indexOf(delim, left);
} while(iDelim >= 0);
} else while(iAmp >= 0) {
if(left < iAmp)
out += s.substring(left, iAmp);
out += '&'
left = iAmp+1;
iAmp = s.indexOf('&', left);
}
return left < s.length ? out + s.substring(left) : out;
} |
Size to this degree doesn't matter on the server. So not particularly concerned about size, but I do agree we want to optimize for the no special characters. And we need test anything handwritten fairly thoroughly. But interesting approach. Given we know we know we are in an attribute vs script, I think you are correct to look for the single character (and the ampersand). |
I only have this machine to try it on, so I don't know if it's faster for anyone else, but if you're interested I could open a PR. I'm going to be using this on my project's server-side code :) |
It looks like it produces about a 5-8% boost in the final SSR test I'm doing but it's close and I haven't counted it out to variance. I'd like to test it the same way @frenzzy has been. |
It's taken me some time to get back to it, but having tested perf directly @xandris it looks good and I will be incorporating in the next minor version. Thank you. |
I know we just borrowed the code here... but it's worth stopping to ask:
This implementation is essentially a very very limited rewrite of a single regex - coded in Javascript. Turns out that's far slower than just using the built-in RegExp engine. :-) This change makes If someone could please explain to me WHY the built-in |
@joshgoebel Thank you for spending the time to look into this. I'd already moved on when this came in and was focused elsewhere. It's good to know that that this is just something brutally slow around |
It also seems there is perhaps a huge difference between large and small strings and the frequency of the letters to be encoded in the data. So that makes it hard to even benchmark without first deciding on representative sample data looks like. For example right now the And it also seems possible that if someone was encoding data where every few characters needed to be replaced that this "letter by letter" approach might still be superior. I don't think we good enough benchmarks (covering enough cases) yet to be certain. Anyways I just thought if the speed was critical that someone might want to know there is quite possibly more to be had. :) |
@joshgoebel Also great information. Thank you. I do think given these are small template inserts where the static parts are already processed at compile time the average is going to be in that 10 character region. For the most part I was testing as part of not a direct benchmark but in a full SSR benchmark where escaping makes a big impact but still isn't most of the overhead by any means. This 10x improvement was more like a 2-3x there. So it lacks the precision to see most differences. |
Oh wow. :-) Ah, I was imagining dropping in processed user input like posting a question of StackOverflow or a forum post or something... (which would be much longer than 10, but not 1.2mb by any stretch) Your use case then is very different than mine. I kind of wonder if a truly optimal solution would have two entirely different code paths based on the size of the string. It seems there may be some real "startup costs" with just engaging regex on tiny strings... |
My approach isn't letter-by-letter, so I'm kind of confused. I was reaching for a |
@xandris I was commenting on the code in the PR here, not your posted code. I'm looking at this in a lot of other contexts and often there are 5 characters that need escaping, so indexOf quickly starts to be a real mess when you have to run it 5 times (for all 5 special chars), sort those results to find the "next" char, etc... I'd be surprised if at that point it's still faster than regex... Although perhaps it if was done sequentially it'd be much simpler... ie: replaceChar(str, "&", "&")
replaceChar(...
replaceChar(... Where replaceChar can focus on a single letter at a time... the code would probably be much simpler. If your's is 50% faster though that's impressive. :) |
@xandris Oh wow you might be onto something with indexOf. With a 5 character replace:
Ignore regex it should say indexOf. :) Much simpler linear solution: function replaceChar(str, char, replacement) {
var html = ""
var index = 0
var lastIndex = 0
while ((index = str.indexOf(char, lastIndex))!=-1) {
html += str.substring(lastIndex, index) + replacement
lastIndex = index + 1
}
return html + str.substring(lastIndex)
}
// console.log(replaceChar("boo<<<hoo", "<", "<"));
/**
* Escape special characters in the given string of text.
*
* @param {string} string The string to escape for inserting into HTML
* @return {string}
* @public
*/
var matchHtmlRegExpFast = /["'&<>]/
function escapeHtmlFast (string) {
var str = '' + string
if (!matchHtmlRegExpFast.test(str)) return str;
var out = replaceChar(string, "&", "&")
out = replaceChar(out, "<", "<")
out = replaceChar(out, ">", ">")
out = replaceChar(out, "'", ''')
out = replaceChar(out, '"', '"')
return out;
} One must be very carefully with sample cases though. The performance of all these things seems to very widely based not only on the density of characters needing replacement but also on the size of the input test. There may also be variance based on whether the input is randomized or has some structure.
Same size of data but the actual real-life HTML is slower and the generated data is faster despite the density being almost the same. |
As mentioned over in the |
The version here only needed to replace two characters, so I didn't need to compute five indices and sort them or keep intermediate results that get discarded. But yeah it's maddening trying to do this in JavaScript. I wish there were a strcspn available. |
I compared performance for
escapeHTML
function using @dougwilson's benchmark from escape-html repo.Before:
After:
Bundle size will increase a bit but it shouldn't matter if this code is mostly used on backend side during server-side rendering where the performance is much more important.