Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Porting roslyn regex tests to corefx #29178

Merged
merged 19 commits into from
May 1, 2018

Conversation

ViktorHofer
Copy link
Member

@ViktorHofer ViktorHofer commented Apr 18, 2018

Fixes https://github.com/dotnet/corefx/issues/28508
Fixes https://github.com/dotnet/corefx/issues/27618

  • Adding internal RegexParseException that derives from ArgumentException to provide internal error information RegexParseError.
  • Porting test cases from the roslyn PR over to corefx and changing them to InlineData instead of a method per test case.
  • Asserting on the specific RegexParseError for negative tests to be more precise (existing tests + roslyn tests)

TODO:

  • Adding sub expression checks (probably without asserting on the specific RegexParseError)
  • Refine RegexParseError options and add comments to each.

// See the LICENSE file in the project root for more information.

using System.Runtime.CompilerServices;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tried to avoid doing this, preferring reflection if necessary instead. Other issues aside, now that this will prevent the Linker from shaking away dead code.
cc: @weshaggard

Copy link
Member Author

@ViktorHofer ViktorHofer Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know... I could change the RegexParseException and RegexParseError to be public in the src assembly only, whitelistening them and remove the extra RegexParser checks (https://github.com/dotnet/corefx/pull/29178/files#diff-e55ccc955dc3ddac57a0e77d1a8fc944R1001) in the test code. That would remove the InternalsVisibleTo requirement but would expose two types in the src assembly.

Copy link
Member

@stephentoub stephentoub Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change the RegexParseException and RegexParseError to be public in the src assembly only

Why do they need to be public? What can't you achieve using reflection?

As a principle, we do not want to negatively impact shipping assemblies in order to make testing a little easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be able to use attribute based testing if the RegexParseError enum isn't public: https://github.com/dotnet/corefx/pull/29178/files#diff-e55ccc955dc3ddac57a0e77d1a8fc944R947 and I wouldn't be able to catch on a the specific exception. I could workaround the second issue easily by catching the parent exception and then checking the exact type via reflection and its error code but I'm not sure how to handle the first issue. We are speaking about ~1000 test cases here, switching to strings or ints instead of the enum value doesn't seem right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be able to use attribute based testing if the RegexParseError enum isn't public

Sure you would. Either just use the numeric values in the attributes, or for strong-typing, create your own copy of the enum in your tests (or compile the same enum definition into your test), and cast to int.

I wouldn't be able to catch on a the specific exception

You can easily work around that, e.g. pseudo-code

private static readonly Type s_parsingExceptionType = typeof(Regex).Assembly.GetType(...);
private static void AssertParsingException(Action action) => Assert.Throws(s_parsingExceptionType, action);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes lets please try and avoid InternalsVisibleTo as it causes coupling that is too tight to one particular implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using reflection instead of IVT also makes it easier for us in Mono to consume the tests since our test assemblies have different names 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the input.

@@ -420,7 +423,7 @@ private RegexNode ScanRegex()
break;

default:
throw MakeException(SR.InternalError);
throw MakeException(RegexParseError.InternalError, SR.InternalError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem like this should be a parse/argument exception. this should just be something like an InvalidOperationException.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a breaking change right?

Copy link
Member

@CyrusNajmabadi CyrusNajmabadi Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you literally should not be able to hit this. This is an internal error in the regex lirary library itself. It's the difference between improper input from teh user, versus a bug in your own code.

It's no more of a breaking change than the change you made to stop throwing an 'index out of bounds error' based off of the bugs i filed. Technically someone could have been dependent on internal details about how your code crashes... but i don't think .net ever tries to preserve internal crashing behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran all the existing tests + the roslyn ones and never hit the internal error. I agree with @CyrusNajmabadi that changing to InvalidOperationException make more sense here. Are you ok with that Dan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but I could turn this on its head and say that if we never throw it, it doesn't matter enough to change the type!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read through the code and both places where the internal errors are thrown can never be reached unless someone changes the detection (the helper functions) of what a quantifier or a special character is. We shouldn't bother too much with that.

[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you intending to implement this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the code line that Github displays here seems off. Are you talking about the TODO? If yes then my answer is yes :)

[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are probably fine to run with subtree checks, now that @danmosemsft put a limit on the memory allocation for these types of regexes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After implementing the subtree check I will run all the excluded ones again and move if not failing.

[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be no problem for you guys, i don't tink. You don't use actual stack recursion when parsing. This was a special test for Roslyn because my parser is recursive descent, and so i needed to make sure something complex like this bailed out properly without causing a StackOverflowException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking about the DeepRecursion test, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep it, tests ideally shouldn't assume the choice of implementation

[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the 'if' above coudl just be made into a simple assert right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry which code line are referring to? probably not the one that Github displays here...

[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm unfamliar with what subtree means in this context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubTree is something i did in the roslyn tests. specifically, given any input string we mangle it in many different ways and validate that we dont' crash on any of the mangled names. Examples of how i mangled:

  1. take the starting string and independently remove each character from it. i.e. "cat" -> ["at", "ct", "ca"]. test all those cases.
  2. take the starting string and remove larger and larger prefixes/suffixes. i.e. "cat" -> ["at", "t", "ca", "c"].

etc. etc.

basically, a minor form of fuzzing for regexes. Note: actual fuzzing would be highly appropriate here. Regex parsing is basically the poster child for that sort of an approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and believe we should do at least a one-shot fuzzing over our existing engine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if you have an actual fuzzer that .net uses, let me know. If you create a test suite, i would lke to run it against the roslyn parser. thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up on this. We do have some internal tools.

@@ -1143,7 +1143,9 @@ private static string SetFromProperty(string capname, bool invert, string patter
}
}
}
throw new ArgumentException(SR.Format(SR.MakeException, pattern, SR.Format(SR.UnknownProperty, capname)));

throw new RegexParseException(RegexParseError.UnknownProperty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to this change, but at some point we could include the offset into the pattern, which might be helpful for large patterns or such things like mismatched parens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. That information should already be accessible in some of the private fields in the RegexParser therefore adding them in the custom exception should be straightforward.

UnknownProperty // Unicode block, \p{Property}
}

internal class RegexParseException : ArgumentException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have much precedent for throwing internal exception types to the user? If they have interesting extra information on them I suspect that enough users will use reflection to get that info that we effectively have made it public anyway. Probably we should follow up and get approval for making this public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious about this myself. Especially the use of enum error codes. It seems like you're signign yourself up to very specific ways of parsing in order to preserve the error codes you emit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The owners of the System.Xml parser did not want to include error codes for exactly this reason. Given we're now SxS, the concern about breaks is a little lower. Certainly changing the error code in an exception is much less likely to break folks than changing a success case to an error case. And changing to another existing code a little less likely than changing to a newly added error code. I think it's probably OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it's potentially worth pointing out that Roslyn (which exposes parses and handles tons of input error cases) specifically does not expose any type of enum for this sort of thing. Turns out, in practice, it's not that useful, and it's a bit of a maintenance burden.

instead, it simply returns back trees along with "diagnostics" which are effectively opaque objects that just say "here's a message about what's wrong, and here's the location it happened". That's been highly sufficient for our needs (and we specifically expose this stuff so people can analyze code effectively).

This is also the model i took for the new Roslyn regex parser.

  1. outside of stack-overflows, you always get a tree back.
  2. the tree always contains all input text of the original user input (including whitespace/comments).
  3. issues are returned as diagnostics that are provided along with the tree. diagnostics are simply a message, along with a position/span of relevance. This is used for doing things like placing a squiggle appropriately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think of why a caller might want the error code and the only use case I can think of is something like what you're doing, ie., a syntactic checker for regexes. Not a mainstream scenario. How did you end up doing this -- did you copy and instrument our parser, or write your own?

The offset is the most broadly interesting thing that's missing, and we're not adding it here.

Copy link
Member

@CyrusNajmabadi CyrusNajmabadi Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you end up doing this -- did you copy and instrument our parser, or write your own?

i rewrote it entirely, using the reference specification, and then diving into your impl to see how things deviated at times. For example, character class parsing is exceptionally strange (and almost certainly unintentionally buggy) in the native parser. But i replicated it in on our end because compatibility with the core parser is the #1 priority. Tons of tests were added (i.e. the basis of htis PR) to validate the same observable behavior between the two systems.

Note: i originally tried to just fork and use the native code. However, it was far too difficult as the existing code is designed to generate a lossy AST in place, which it continually mutates as it proceeds. Trying to work off of that was just too difficult.

When necessary, i also took code wholesale. For example, the unicode handling code was copied practically verbatim here: https://github.com/dotnet/roslyn/pull/23984/files#diff-f689c52f48c471beeaea4252031ed8f3

Currently, i know of only one place where i deviated slightly. And the only observable effect of that deviation is that roslyn will not emit an error with a very strange regex when parsing in ecmascript mode. I was ok with that deviation because it was such a corner corner corner case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you're signign yourself up to very specific ways of parsing in order to preserve the error codes you emit.

Given it's a field on an internal type, I do think we would be OK with letting this value change with parser modifications. If the field is public I think it's something to think about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point why I added the enum in the first place was that I wanted to assert that specific patterns cause specific errors. With that I can be more precise and catch only cases that I expect to fail. I noticed that in your test bed you catched on various error cases like ArgumentException, InvalidOperationException, ... globally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: my tests only cared about ArgumentException. The other exceptions were catching the internal parser errors that i reported and dan previously fixed :)

The enum would certainly be helpful, to ensure that the expected error is happening, and it's not some other 'argument' issue. However, as this enum is internal, that won't really help out roslyn that much :) Given that, i have no horse in this race since this choice is entirely internal and just about helping you test things.

If you do end up making these public, i would have a lot more feedback on things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the current goal is to enrich the current test suite with your tests. At a later point we can definitely talk about exposing more diagnostics to converge different parser implementations.

NestedQuantify,
QuantifyAfterNothing,
TooManyParens,
UnknownProperty // Unicode block, \p{Property}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"UnknownUnicodeProperty" mayber?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming the options is on my todo, though will post-pone till the other points and the subexpression check is done.


[Theory]
// \d, \D, \s, \S, \w, \W, \P, \p inside character range
[InlineData(@"cat([a-\d]*)dog", RegexOptions.None)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you end up retaining all these existing tests?
that's probably a good idea unless there is blatant duplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jep, I ported/moved all the tests directly to RegexParserTetsts. Nothing should have slipped out of my radar.

[InlineData(@"@""(?<=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?!""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?=""", RegexOptions.None, RegexParseError.NotEnoughParens)]
[InlineData(@"@""(?imn )""", RegexOptions.None, RegexParseError.UnrecognizedGro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments explaining what this verification is doing/why

@danmoseley
Copy link
Member

I imagine you'll have a small number you'll need to disable for NETFX runs (since we fixed a few parser bugs) but otherwise I hope to see all these pass on both NETFX and Core.

@ViktorHofer
Copy link
Member Author

I imagine you'll have a small number you'll need to disable for NETFX runs (since we fixed a few parser bugs) but otherwise I hope to see all these pass on both NETFX and Core.

The entire RegexParserTests.cs file is currently disabled for Netfx because I was using internals but as I now changed it to reflection I could reenable them. But I would need to add additional checks for netfx because we don't have a RegexParseException there but an ArgumentException instead. I notice that it seems important to you to run these tests also on netfx so I will add some extra code to handle that scenario.

@danmoseley
Copy link
Member

I notice that it seems important to you to run these tests also on netfx so I will add some extra code to handle that scenario.

Ha, yes you're right: reason is that our regex code has diverged significantly from netfx, and netfx is the the only oracle that we can use to verify we aren't going off course. That is the fixed point that makes it safe for us to do the kind of refactoring you're doing.


namespace System.Text.RegularExpressions
{
internal class RegexParseException : ArgumentException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be [Serializable] right? It's not something that's inherited, so without that, you're breaking serializing the exceptoins from the parser.

And you'll need to add that test case ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! And I also need to mask it as an ArgumentException via SetType to not break deserializability on netfx...

{
return new ArgumentException(SR.Format(SR.MakeException, _pattern, message));
return new RegexParseException(error,_currentPos, SR.Format(SR.MakeException, _pattern, message));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit spacing

/// <summary>
/// The offset in the supplied pattern.
/// </summary>
public int Position { get; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since users currently can't get to this (without reflection) maybe include it in the message?

@ViktorHofer
Copy link
Member Author

Tests are now all passing (except the ones I disabled for netfx 😁). Now I will tackle the remaining two tasks.

@ViktorHofer
Copy link
Member Author

Finished with the planned work here. PTAL

@karelz karelz added this to the 2.2.0 milestone Apr 21, 2018
@ViktorHofer
Copy link
Member Author

so you think the substring manipulation doesn't bring any value here?

@CyrusNajmabadi
Copy link
Member

The substring manipulations are very valuable (they uncovered most of the bugs i found in the .net parser). However, my point is simply, the roslyn tests use strings of the form: @"..."

this is because we're generating a c# literal token first and then operating on that. You don't need/want to do that. You want the contents inside that string.


namespace System.Text.RegularExpressions
{
[Serializable]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that breaks when this attribute (or SetType below) is missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added one.

[InlineData(@"""(?#)(?#)""", RegexOptions.IgnorePatternWhitespace, null, true)]
[InlineData(@"""(?#) (?#)""", RegexOptions.None, null, true)]
[InlineData(@"""(?#) (?#)""", RegexOptions.IgnorePatternWhitespace, null, true)]
[InlineData(@"@""[a\p{Lu}(?#)b]""", RegexOptions.None, null, true)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the nested @ for? I wasn't aware it was special in regexes - but maybe I'm wrong as I see char ch = '@'; // nonspecial ch, means at beginning in the code.

Copy link
Member

@CyrusNajmabadi CyrusNajmabadi Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested @""..."" should be removed. This is a roslyn-test-ism since i'm first generating a C# string token. This is because the entire Roslyn embedded-regex stack starts with a String-token which then gets analyzed. I probably should have written my tests to not have that code and to instead just have the regex contents . I could have wrapped those tests in the test harness with @"..." instead. But i wasn't thinking about these being reused here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CyrusNajmabadi now I feel bad that you have to explain several times. When Dan asked I was nearly done changing the test data 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel bad i put you through this! It was a silly shortcut i didn't think about when writing these tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't feel bad :) You helped us a lot with the test cases. And I know that you are waiting for a response in the other thread, I'll answer as soon as possible.

@ViktorHofer
Copy link
Member Author

Status: I'm currently trying to repro the error on various machines...

@ViktorHofer ViktorHofer merged commit 455f969 into dotnet:master May 1, 2018
@ViktorHofer ViktorHofer deleted the RegexTestsRoslyn branch May 1, 2018 14:06
@CyrusNajmabadi
Copy link
Member

@ViktorHofer Any thoughts about the probable bug in character class parsing htat i pulled you into the discussion on?

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Add regex roslyn tests

* Offset & serialization in exception, remove internal error in error enum, compile parser tests on netfx

* Move match parser tests over to RegexParserTests

* Add subexpression parsing

* Add test case for SetType in RegexParseException


Commit migrated from dotnet/corefx@455f969
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants