BadTokenizationError caused by Unicode control character c2a0 in document body #945

Bugzey · 2024-01-05T12:03:59Z

Hi, all!

I have a lot of Markdown documents that have been converted from OneNote notebooks a long time ago. The process has strewn various c2a0 - NO-BREAK SPACE (Source) characters in random files. Every time a file containing this character is run through pymarkdownlnt this causes a BadTokenizationError.

I have created a demo file that causes the issue: test.md

Here are a few c2a0-s: [Edit: Github converted them back to normal spaces. Download the file above]

Call: pymarkdownlnt --log-level DEBUG scan test.md

Here is the exact error and last several lines of DEBUG-level logs:

DEBUG:pymarkdown.general.tokenized_markdown:cob-end>>['[atx(1,1):1:0:]', '[text(1,1):Title:\t]', '[end-atx::]', '[BLANK(2,1):]', '[para(3,1):]', '[text(3,1):The next line contains UTF characters c2a0 (NO-BREAK SPA
CE)::]', '[end-para:::True]', '[BLANK(4,1):]']
DEBUG:pymarkdown.general.tokenized_markdown:cob-end>>new_tokens>>[]
DEBUG:pymarkdown.general.tokenized_markdown:list_stack_index>>0
DEBUG:pymarkdown.general.tokenized_markdown:block_stack_index>>0
DEBUG:pymarkdown.general.tokenized_markdown:hbl>>new_tokens>>[]
INFO:pymarkdown.file_scan_helper:Ending file 'test.md' with exception.
WARNING:pymarkdown.main:Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.
Traceback (most recent call last):
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 191, in __parse_blocks_pass
    ) = self.__parse_blocks_pass_next_line(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 252, in __parse_blocks_pass_next_line
    ) = self.__main_pass_did_not_start_close(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 344, in __main_pass_did_not_start_close
    ) = self.__handle_blank_line(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 834, in __handle_blank_line
    assert non_whitespace_index == len(input_line)
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 128, in __transform
    first_pass_results = self.__parse_blocks_pass(do_add_end_of_stream_token)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 201, in __parse_blocks_pass
    raise BadTokenizationError(error_message) from this_exception
pymarkdown.general.bad_tokenization_error.BadTokenizationError: A project assertion failed on line 5 of the current document.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/main.py", line 418, in main
    scan_result = self.__scan_files_if_no_errors(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/main.py", line 355, in __scan_files_if_no_errors
    did_fix_any_files = fsh.process_files_to_scan(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 90, in process_files_to_scan
    self.__scan_specific_file(next_file, next_file)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 131, in __scan_specific_file
    self.__scan_file(source_provider, next_file_name)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 152, in __scan_file
    actual_tokens = self.__tokenizer.transform_from_provider(
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 97, in transform_from_provider
    return self.__transform(do_add_end_of_stream_token)
  File "/home/rdimitrov/.local/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 144, in __transform
    raise BadTokenizationError(
pymarkdown.general.bad_tokenization_error.BadTokenizationError: An unhandled error occurred processing the document.


Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.

Here is also a hash dump of the file above:

$ xxd test.md
00000000: 2309 5469 746c 650a 0a54 6865 206e 6578  #.Title..The nex
00000010: 7420 6c69 6e65 2063 6f6e 7461 696e 7320  t line contains
00000020: 5554 4620 6368 6172 6163 7465 7273 2063  UTF characters c
00000030: 3261 3020 284e 4f2d 4252 4541 4b20 5350  2a0 (NO-BREAK SP
00000040: 4143 4529 3a0a 0ac2 a0c2 a0c2 a0c2 a0c2  ACE):...........
00000050: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000060: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000070: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000080: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
00000090: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
000000a0: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
000000b0: a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2 a0c2  ................
000000c0: a0c2 a0c2 a0c2 a0c2 a0c2 a00a 0a54 6869  .............Thi
000000d0: 7320 7061 6765 2073 686f 756c 6420 6272  s page should br
000000e0: 6561 6b20 7079 6d61 726b 646f 776e 6c6e  eak pymarkdownln
000000f0: 740a                                     t.

Pymarkdownlnt version: 0.9.15
Python version: Python 3.10.12

I will be patching the offending files, but am also reporting the issue since a random whitespace should not be breaking the program.

I hope this helps,
Bugzey

The text was updated successfully, but these errors were encountered:

jackdewinter · 2024-01-26T04:11:56Z

This was an issue where the stripping of whitespace characters was inconsistent. Spent the last few days hunting down every strip() call, and ensuring that it is either doing what it is supposed to, or changing the documentation to properly reflect what it should be doing.

* #945 * updating packages

jackdewinter · 2024-03-19T01:46:22Z

closing due to lack of confirmation after double checking.

jackdewinter added waiting for issue fix verification and removed waiting for issue fix verification labels Jan 20, 2024

jackdewinter added a commit that referenced this issue Jan 26, 2024

https://github.com/jackdewinter/pymarkdown/issues/945

a2e5e26

jackdewinter added a commit that referenced this issue Jan 26, 2024

https://github.com/jackdewinter/pymarkdown/issues/945

3d1b0ff

jackdewinter mentioned this issue Jan 26, 2024

https://github.com/jackdewinter/pymarkdown/issues/945 #961

Merged

jackdewinter added a commit that referenced this issue Jan 26, 2024

https://github.com/jackdewinter/pymarkdown/issues/945

6fec2c0

jackdewinter added a commit that referenced this issue Jan 27, 2024

#945 (#961)

67ee195

* #945 * updating packages

jackdewinter added the waiting for issue fix verification label Jan 28, 2024

jackdewinter closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BadTokenizationError caused by Unicode control character c2a0 in document body #945

BadTokenizationError caused by Unicode control character c2a0 in document body #945

Bugzey commented Jan 5, 2024 •

edited

Loading

jackdewinter commented Jan 26, 2024

jackdewinter commented Mar 19, 2024

BadTokenizationError caused by Unicode control character c2a0 in document body #945

BadTokenizationError caused by Unicode control character c2a0 in document body #945

Comments

Bugzey commented Jan 5, 2024 • edited Loading

jackdewinter commented Jan 26, 2024

jackdewinter commented Mar 19, 2024

Bugzey commented Jan 5, 2024 •

edited

Loading