Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German text recognized incorrectly #13226

Closed
rettetdemdativ opened this issue Apr 12, 2021 · 1 comment · Fixed by #13424
Closed

German text recognized incorrectly #13226

rettetdemdativ opened this issue Apr 12, 2021 · 1 comment · Fixed by #13424
Assignees

Comments

@rettetdemdativ
Copy link

https://boersengefluester.de/wp-content/uploads/assets/annuals/2019/578560.pdf

Configuration:

  • PDF.js version: 2.7.570

Steps to reproduce the problem:

  1. Load the linked PDF
  2. On the second page, highlight the word "Mitarbeiterinnen" by double-clicking it

What is the expected behavior?
The word should be recognized as a single word and highlighted as such.

What went wrong?
Instead, it's recognized as 3 words: "Mit", "arbei", "terinnen"
image

This happens for many words in this PDF.

Is there anything that can be changed in how I use pdfjs-dist to make it recognize this as a single word? Just looking at the PDF, I can't tell what would cause this behavior.

@rettetdemdativ rettetdemdativ changed the title German text incorrectly recognized German text recognized incorrectly Apr 23, 2021
@calixteman
Copy link
Contributor

calixteman commented May 2, 2021

In the pdf, we've this code to display Mitarberterinnen:

[(M)5 (i)5 (t)]TJ
/Span<</ActualText<FEFF00AD>>> BDC 
1.516 0 Td
( )Tj
EMC 
T*
(arbei)Tj
/Span<</ActualText<FEFF00AD>>> BDC 
( )Tj
EMC 
2.306 0 Td
[(terinnen und Mitarbeiter arbeiten in \374ber 100 L\344ndern engagiert im\
 Dienste)5 ( )]TJ

and there are 2 extra spaces (the ( )Tj).
Since there are no Td after the Tj then the spaces are behind the char following them.
Anyway those spaces are in some marked content sections (BDC/EMC) and I didn't find anything in specs on how to deal with that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants