-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glyphless font in pdf leads to spaces between characters #373
Comments
Please do me a favor and take a look at 2.pdf which is an attachment towards the bottom of |
Funny thing: Alfresco uses pdf.js as pdf viewer, and the search in pdf,js is actually working. Meaning: pdf.js doesn't put extra spaces between the characters. 2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox. So to summarize:
See attached pdf, which displays those problems: test-out-git.zip |
Good, because that is the future for Tesserct PDF output. 2.pdf has minor changes in the metrics of both the embedded font and the metrics in the PDF. I can't guarantee that this is going to work with every document, because PDF text extraction relies heavily in heuristics. (Root cause: PDF spec) |
Is there any roadmap for this rewritten (as I understand) pdf generation? |
It's more of a tweak than a rewrite. For logistical reasons, I hand all my changes to Ray who then merges them into the git repo. Ray is awesome in almost every way, but he is notoriously slow at this. I've already done the handoff. |
Note to other people running into this problem with pdfsandwich and ending up here, suspecting Tesseract: This is actually a problem with Ghostscript. pdfsandwich converts the images to PPM, hands those to Tesseract and since those files are missing resolution/DPI information, Tesseract outputs a huge PDF (0,9 by 1,20 metres for A4) but with correct text (i.e. without spaces between letters). Then, pdfsandwich runs this PDF through Ghostscript to resize it back to A4 and this step is what actually messes up the words. The author of pdfsandwich has a pre-release version 0.1.5 which now uses TIF images instead of PPM. And those contain resolution information, so the PDF Tesseract spits out is already in the correct format. (Side note: Tesseract seems to ignore resolution information from PNG files.) |
Wait, what? That's not expected at all. Please provide an example PNG file demonstrating |
Back to the spaces thing, I'd appreciate a retest once Tesseract pdf.ttf font matches the $ md5sum pdf.ttf |
new pdf.ttf came to master and 3.05 branch. |
Tried to re-test this, but got the following error when running pdfsandwich + tesseract.
... And this goed on and on |
Please test only tesseract and please provide command (how you run tesseract). |
I tried that after with this command: tesseract -l nld+eng pdfsandwich45aaf9.tif -pdf |
You used wrong command. It should be something like this: |
Same error, I'm afraid. |
In tessdata repository there are 4.00 data files and you use 3.05 tesseract... |
Okay, that was a bit silly on my end. |
I try to use tesseract to directly generate pdfs with an ocr'ed text layer. This is one step of several how pdfsandwich creates searchable pdfs.
The result of the tesseract-subprocess, is a pdf with an image and a text layer and is perfectly searchable.
Probably due to the high resolution input the dimensions of the resulting pdf are very large, which pdfsandwich solves by resizing the pages to more reasonable dimensions.
After this resize, when I open this file in, for example, Acrobat Reader DC, all recognized text is separated by extra spaces. So when it used to read 'hello', now it reads 'h e l l o'. So when you search for hello, the text isn't found.
A more technical explanation about this problem is in this thread: http://bugs.ghostscript.com/show_bug.cgi?id=696116
I thought I had a work around for this, by specifying a smaller DW than the default 500:
This solves the issue in Acrobat reader.
But when I put this file in Alfresco DMS, which uses PDFBox 1.8.4, I get the same problem again: I can only find words when I put spaces between the characters.
Setting the DW to a number smaller than 250 compromizes the text in the ocr'ed layer, so that's no option.
Is there any way to change the font type to a proper width, so most pdf-tools can properly read the text?
The text was updated successfully, but these errors were encountered: