Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BadTokenizationError caused by Unicode control character c2a0 in document body #945

Closed
Bugzey opened this issue Jan 5, 2024 · 2 comments

Comments

@Bugzey
Copy link

Bugzey commented Jan 5, 2024

Hi, all!

I have a lot of Markdown documents that have been converted from OneNote notebooks a long time ago. The process has strewn various c2a0 - NO-BREAK SPACE (Source) characters in random files. Every time a file containing this character is run through pymarkdownlnt this causes a BadTokenizationError.

I have created a demo file that causes the issue: test.md

Here are a few c2a0-s:                                     [Edit: Github converted them back to normal spaces. Download the file above]

Call: pymarkdownlnt --log-level DEBUG scan test.md

Here is the exact error and last several lines of DEBUG-level logs:

DEBUG:pymarkdown.general.tokenized_markdown:cob-end>>['[atx(1,1):1:0:]', '[text(1,1):Title:\t]', '[end-atx::]', '[BLANK(2,1):]', '[para(3,1):]', '[text(3,1):The next line contains UTF characters c2a0 (NO-BREAK SPA
CE)::]', '[end-para:::True]', '[BLANK(4,1):]']
DEBUG:pymarkdown.general.tokenized_markdown:cob-end>>new_tokens>>[]
DEBUG:pymarkdown.general.tokenized_markdown:list_stack_index>>0
DEBUG:pymarkdown.general.tokenized_markdown:block_stack_index>>0
DEBUG:pymarkdown.general.tokenized_markdown:hbl>>new_tokens>>[]
INFO:pymarkdown.file_scan_helper:Ending file 'test.md' with exception.
WARNING:pymarkdown.main:Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.
Traceback (most recent call last):
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 191, in __parse_blocks_pass
    ) = self.__parse_blocks_pass_next_line(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 252, in __parse_blocks_pass_next_line
    ) = self.__main_pass_did_not_start_close(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 344, in __main_pass_did_not_start_close
    ) = self.__handle_blank_line(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 834, in __handle_blank_line
    assert non_whitespace_index == len(input_line)
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 128, in __transform
    first_pass_results = self.__parse_blocks_pass(do_add_end_of_stream_token)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 201, in __parse_blocks_pass
    raise BadTokenizationError(error_message) from this_exception
pymarkdown.general.bad_tokenization_error.BadTokenizationError: A project assertion failed on line 5 of the current document.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/main.py", line 418, in main
    scan_result = self.__scan_files_if_no_errors(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/main.py", line 355, in __scan_files_if_no_errors
    did_fix_any_files = fsh.process_files_to_scan(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 90, in process_files_to_scan
    self.__scan_specific_file(next_file, next_file)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 131, in __scan_specific_file
    self.__scan_file(source_provider, next_file_name)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 152, in __scan_file
    actual_tokens = self.__tokenizer.transform_from_provider(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 97, in transform_from_provider
    return self.__transform(do_add_end_of_stream_token)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 144, in __transform
    raise BadTokenizationError(
pymarkdown.general.bad_tokenization_error.BadTokenizationError: An unhandled error occurred processing the document.


Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.

Here is also a hash dump of the file above:

$ xxd test.md
00000000: 2309 5469 746c 650a 0a54 6865 206e 6578  #.Title..The nex
00000010: 7420 6c69 6e65 2063 6f6e 7461 696e 7320  t line contains
00000020: 5554 4620 6368 6172 6163 7465 7273 2063  UTF characters c
00000030: 3261 3020 284e 4f2d 4252 4541 4b20 5350  2a0 (NO-BREAK SP
00000040: 4143 4529 3a0a 0ac2 a0c2 a0c2 a0c2 a0c2  ACE):...........
00000050: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000060: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000070: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000080: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000090: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
000000a0: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
000000b0: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
000000c0: a0c2 a0c2 a0c2 a0c2 a0c2 a00a 0a54 6869  .............Thi
000000d0: 7320 7061 6765 2073 686f 756c 6420 6272  s page should br
000000e0: 6561 6b20 7079 6d61 726b 646f 776e 6c6e  eak pymarkdownln
000000f0: 740a                                     t.
  • Pymarkdownlnt version: 0.9.15
  • Python version: Python 3.10.12

I will be patching the offending files, but am also reporting the issue since a random whitespace should not be breaking the program.

I hope this helps,
Bugzey

@jackdewinter
Copy link
Owner

This was an issue where the stripping of whitespace characters was inconsistent. Spent the last few days hunting down every strip() call, and ensuring that it is either doing what it is supposed to, or changing the documentation to properly reflect what it should be doing.

@jackdewinter
Copy link
Owner

closing due to lack of confirmation after double checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants