-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BadTokenizationError caused by Unicode control character c2a0 in document body #945
Comments
This was an issue where the stripping of whitespace characters was inconsistent. Spent the last few days hunting down every strip() call, and ensuring that it is either doing what it is supposed to, or changing the documentation to properly reflect what it should be doing. |
jackdewinter
added a commit
that referenced
this issue
Jan 26, 2024
jackdewinter
added a commit
that referenced
this issue
Jan 26, 2024
jackdewinter
added a commit
that referenced
this issue
Jan 26, 2024
jackdewinter
added a commit
that referenced
this issue
Jan 27, 2024
closing due to lack of confirmation after double checking. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, all!
I have a lot of Markdown documents that have been converted from OneNote notebooks a long time ago. The process has strewn various c2a0 - NO-BREAK SPACE (Source) characters in random files. Every time a file containing this character is run through pymarkdownlnt this causes a BadTokenizationError.
I have created a demo file that causes the issue: test.md
Here are a few c2a0-s: [Edit: Github converted them back to normal spaces. Download the file above]
Call:
pymarkdownlnt --log-level DEBUG scan test.md
Here is the exact error and last several lines of DEBUG-level logs:
Here is also a hash dump of the file above:
I will be patching the offending files, but am also reporting the issue since a random whitespace should not be breaking the program.
I hope this helps,
Bugzey
The text was updated successfully, but these errors were encountered: