-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Porting roslyn regex tests to corefx #29178
Conversation
// See the LICENSE file in the project root for more information. | ||
|
||
using System.Runtime.CompilerServices; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've tried to avoid doing this, preferring reflection if necessary instead. Other issues aside, now that this will prevent the Linker from shaking away dead code.
cc: @weshaggard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know... I could change the RegexParseException
and RegexParseError
to be public in the src assembly only, whitelistening them and remove the extra RegexParser checks (https://github.com/dotnet/corefx/pull/29178/files#diff-e55ccc955dc3ddac57a0e77d1a8fc944R1001) in the test code. That would remove the InternalsVisibleTo requirement but would expose two types in the src assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change the RegexParseException and RegexParseError to be public in the src assembly only
Why do they need to be public? What can't you achieve using reflection?
As a principle, we do not want to negatively impact shipping assemblies in order to make testing a little easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be able to use attribute based testing if the RegexParseError
enum isn't public: https://github.com/dotnet/corefx/pull/29178/files#diff-e55ccc955dc3ddac57a0e77d1a8fc944R947 and I wouldn't be able to catch on a the specific exception. I could workaround the second issue easily by catching the parent exception and then checking the exact type via reflection and its error code but I'm not sure how to handle the first issue. We are speaking about ~1000 test cases here, switching to strings or ints instead of the enum value doesn't seem right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be able to use attribute based testing if the RegexParseError enum isn't public
Sure you would. Either just use the numeric values in the attributes, or for strong-typing, create your own copy of the enum in your tests (or compile the same enum definition into your test), and cast to int.
I wouldn't be able to catch on a the specific exception
You can easily work around that, e.g. pseudo-code
private static readonly Type s_parsingExceptionType = typeof(Regex).Assembly.GetType(...);
private static void AssertParsingException(Action action) => Assert.Throws(s_parsingExceptionType, action);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes lets please try and avoid InternalsVisibleTo as it causes coupling that is too tight to one particular implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using reflection instead of IVT also makes it easier for us in Mono to consume the tests since our test assemblies have different names 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the input.
@@ -420,7 +423,7 @@ private RegexNode ScanRegex() | |||
break; | |||
|
|||
default: | |||
throw MakeException(SR.InternalError); | |||
throw MakeException(RegexParseError.InternalError, SR.InternalError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem like this should be a parse/argument exception. this should just be something like an InvalidOperationException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a breaking change right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically you literally should not be able to hit this. This is an internal error in the regex lirary library itself. It's the difference between improper input from teh user, versus a bug in your own code.
It's no more of a breaking change than the change you made to stop throwing an 'index out of bounds error' based off of the bugs i filed. Technically someone could have been dependent on internal details about how your code crashes... but i don't think .net ever tries to preserve internal crashing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran all the existing tests + the roslyn ones and never hit the internal error. I agree with @CyrusNajmabadi that changing to InvalidOperationException make more sense here. Are you ok with that Dan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, but I could turn this on its head and say that if we never throw it, it doesn't matter enough to change the type!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read through the code and both places where the internal errors are thrown can never be reached unless someone changes the detection (the helper functions) of what a quantifier or a special character is. We shouldn't bother too much with that.
[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you intending to implement this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the code line that Github displays here seems off. Are you talking about the TODO? If yes then my answer is yes :)
[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are probably fine to run with subtree checks, now that @danmosemsft put a limit on the memory allocation for these types of regexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After implementing the subtree check I will run all the excluded ones again and move if not failing.
[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be no problem for you guys, i don't tink. You don't use actual stack recursion when parsing. This was a special test for Roslyn because my parser is recursive descent, and so i needed to make sure something complex like this bailed out properly without causing a StackOverflowException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are talking about the DeepRecursion test, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep it, tests ideally shouldn't assume the choice of implementation
[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and the 'if' above coudl just be made into a simple assert right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry which code line are referring to? probably not the one that Github displays here...
[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm unfamliar with what subtree means in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubTree is something i did in the roslyn tests. specifically, given any input string we mangle it in many different ways and validate that we dont' crash on any of the mangled names. Examples of how i mangled:
- take the starting string and independently remove each character from it. i.e.
"cat" -> ["at", "ct", "ca"]
. test all those cases. - take the starting string and remove larger and larger prefixes/suffixes. i.e.
"cat" -> ["at", "t", "ca", "c"]
.
etc. etc.
basically, a minor form of fuzzing for regexes. Note: actual fuzzing would be highly appropriate here. Regex parsing is basically the poster child for that sort of an approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and believe we should do at least a one-shot fuzzing over our existing engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if you have an actual fuzzer that .net uses, let me know. If you create a test suite, i would lke to run it against the roslyn parser. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up on this. We do have some internal tools.
@@ -1143,7 +1143,9 @@ private static string SetFromProperty(string capname, bool invert, string patter | |||
} | |||
} | |||
} | |||
throw new ArgumentException(SR.Format(SR.MakeException, pattern, SR.Format(SR.UnknownProperty, capname))); | |||
|
|||
throw new RegexParseException(RegexParseError.UnknownProperty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orthogonal to this change, but at some point we could include the offset into the pattern, which might be helpful for large patterns or such things like mismatched parens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. That information should already be accessible in some of the private fields in the RegexParser therefore adding them in the custom exception should be straightforward.
UnknownProperty // Unicode block, \p{Property} | ||
} | ||
|
||
internal class RegexParseException : ArgumentException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have much precedent for throwing internal exception types to the user? If they have interesting extra information on them I suspect that enough users will use reflection to get that info that we effectively have made it public anyway. Probably we should follow up and get approval for making this public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious about this myself. Especially the use of enum error codes. It seems like you're signign yourself up to very specific ways of parsing in order to preserve the error codes you emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The owners of the System.Xml parser did not want to include error codes for exactly this reason. Given we're now SxS, the concern about breaks is a little lower. Certainly changing the error code in an exception is much less likely to break folks than changing a success case to an error case. And changing to another existing code a little less likely than changing to a newly added error code. I think it's probably OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: it's potentially worth pointing out that Roslyn (which exposes parses and handles tons of input error cases) specifically does not expose any type of enum for this sort of thing. Turns out, in practice, it's not that useful, and it's a bit of a maintenance burden.
instead, it simply returns back trees along with "diagnostics" which are effectively opaque objects that just say "here's a message about what's wrong, and here's the location it happened". That's been highly sufficient for our needs (and we specifically expose this stuff so people can analyze code effectively).
This is also the model i took for the new Roslyn regex parser.
- outside of stack-overflows, you always get a tree back.
- the tree always contains all input text of the original user input (including whitespace/comments).
- issues are returned as diagnostics that are provided along with the tree. diagnostics are simply a message, along with a position/span of relevance. This is used for doing things like placing a squiggle appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think of why a caller might want the error code and the only use case I can think of is something like what you're doing, ie., a syntactic checker for regexes. Not a mainstream scenario. How did you end up doing this -- did you copy and instrument our parser, or write your own?
The offset is the most broadly interesting thing that's missing, and we're not adding it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you end up doing this -- did you copy and instrument our parser, or write your own?
i rewrote it entirely, using the reference specification, and then diving into your impl to see how things deviated at times. For example, character class parsing is exceptionally strange (and almost certainly unintentionally buggy) in the native parser. But i replicated it in on our end because compatibility with the core parser is the #1 priority. Tons of tests were added (i.e. the basis of htis PR) to validate the same observable behavior between the two systems.
Note: i originally tried to just fork and use the native code. However, it was far too difficult as the existing code is designed to generate a lossy AST in place, which it continually mutates as it proceeds. Trying to work off of that was just too difficult.
When necessary, i also took code wholesale. For example, the unicode handling code was copied practically verbatim here: https://github.com/dotnet/roslyn/pull/23984/files#diff-f689c52f48c471beeaea4252031ed8f3
Currently, i know of only one place where i deviated slightly. And the only observable effect of that deviation is that roslyn will not emit an error with a very strange regex when parsing in ecmascript mode. I was ok with that deviation because it was such a corner corner corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you're signign yourself up to very specific ways of parsing in order to preserve the error codes you emit.
Given it's a field on an internal type, I do think we would be OK with letting this value change with parser modifications. If the field is public I think it's something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point why I added the enum in the first place was that I wanted to assert that specific patterns cause specific errors. With that I can be more precise and catch only cases that I expect to fail. I noticed that in your test bed you catched on various error cases like ArgumentException, InvalidOperationException, ... globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: my tests only cared about ArgumentException. The other exceptions were catching the internal parser errors that i reported and dan previously fixed :)
The enum would certainly be helpful, to ensure that the expected error is happening, and it's not some other 'argument' issue. However, as this enum is internal, that won't really help out roslyn that much :) Given that, i have no horse in this race since this choice is entirely internal and just about helping you test things.
If you do end up making these public, i would have a lot more feedback on things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the current goal is to enrich the current test suite with your tests. At a later point we can definitely talk about exposing more diagnostics to converge different parser implementations.
NestedQuantify, | ||
QuantifyAfterNothing, | ||
TooManyParens, | ||
UnknownProperty // Unicode block, \p{Property} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"UnknownUnicodeProperty" mayber?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming the options is on my todo, though will post-pone till the other points and the subexpression check is done.
|
||
[Theory] | ||
// \d, \D, \s, \S, \w, \W, \P, \p inside character range | ||
[InlineData(@"cat([a-\d]*)dog", RegexOptions.None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you end up retaining all these existing tests?
that's probably a good idea unless there is blatant duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jep, I ported/moved all the tests directly to RegexParserTetsts. Nothing should have slipped out of my radar.
[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)] | ||
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments explaining what this verification is doing/why
I imagine you'll have a small number you'll need to disable for NETFX runs (since we fixed a few parser bugs) but otherwise I hope to see all these pass on both NETFX and Core. |
The entire RegexParserTests.cs file is currently disabled for Netfx because I was using internals but as I now changed it to reflection I could reenable them. But I would need to add additional checks for netfx because we don't have a RegexParseException there but an ArgumentException instead. I notice that it seems important to you to run these tests also on netfx so I will add some extra code to handle that scenario. |
Ha, yes you're right: reason is that our regex code has diverged significantly from netfx, and netfx is the the only oracle that we can use to verify we aren't going off course. That is the fixed point that makes it safe for us to do the kind of refactoring you're doing. |
|
||
namespace System.Text.RegularExpressions | ||
{ | ||
internal class RegexParseException : ArgumentException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be [Serializable] right? It's not something that's inherited, so without that, you're breaking serializing the exceptoins from the parser.
And you'll need to add that test case ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks! And I also need to mask it as an ArgumentException via SetType to not break deserializability on netfx...
{ | ||
return new ArgumentException(SR.Format(SR.MakeException, _pattern, message)); | ||
return new RegexParseException(error,_currentPos, SR.Format(SR.MakeException, _pattern, message)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit spacing
/// <summary> | ||
/// The offset in the supplied pattern. | ||
/// </summary> | ||
public int Position { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since users currently can't get to this (without reflection) maybe include it in the message?
Tests are now all passing (except the ones I disabled for netfx 😁). Now I will tackle the remaining two tasks. |
Finished with the planned work here. PTAL |
so you think the substring manipulation doesn't bring any value here? |
The substring manipulations are very valuable (they uncovered most of the bugs i found in the .net parser). However, my point is simply, the roslyn tests use strings of the form: this is because we're generating a c# literal token first and then operating on that. You don't need/want to do that. You want the contents inside that string. |
|
||
namespace System.Text.RegularExpressions | ||
{ | ||
[Serializable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that breaks when this attribute (or SetType below) is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added one.
[InlineData(@"""(?#)(?#)""", RegexOptions.IgnorePatternWhitespace, null, true)] | ||
[InlineData(@"""(?#) (?#)""", RegexOptions.None, null, true)] | ||
[InlineData(@"""(?#) (?#)""", RegexOptions.IgnorePatternWhitespace, null, true)] | ||
[InlineData(@"@""[a\p{Lu}(?#)b]""", RegexOptions.None, null, true)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the nested @ for? I wasn't aware it was special in regexes - but maybe I'm wrong as I see char ch = '@'; // nonspecial ch, means at beginning
in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested @""...""
should be removed. This is a roslyn-test-ism since i'm first generating a C# string token. This is because the entire Roslyn embedded-regex stack starts with a String-token which then gets analyzed. I probably should have written my tests to not have that code and to instead just have the regex contents . I could have wrapped those tests in the test harness with @"..."
instead. But i wasn't thinking about these being reused here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CyrusNajmabadi now I feel bad that you have to explain several times. When Dan asked I was nearly done changing the test data 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel bad i put you through this! It was a silly shortcut i didn't think about when writing these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't feel bad :) You helped us a lot with the test cases. And I know that you are waiting for a response in the other thread, I'll answer as soon as possible.
e8ae26f
to
a239c6c
Compare
Status: I'm currently trying to repro the error on various machines... |
…num, compile parser tests on netfx
a239c6c
to
e0dcc0f
Compare
d0888cf
to
62474a1
Compare
@ViktorHofer Any thoughts about the probable bug in character class parsing htat i pulled you into the discussion on? |
* Add regex roslyn tests * Offset & serialization in exception, remove internal error in error enum, compile parser tests on netfx * Move match parser tests over to RegexParserTests * Add subexpression parsing * Add test case for SetType in RegexParseException Commit migrated from dotnet/corefx@455f969
Fixes https://github.com/dotnet/corefx/issues/28508
Fixes https://github.com/dotnet/corefx/issues/27618
RegexParseException
that derives from ArgumentException to provide internal error informationRegexParseError
.RegexParseError
for negative tests to be more precise (existing tests + roslyn tests)TODO: