Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glyphless font in pdf leads to spaces between characters #373

Closed
ebogaard opened this issue Jul 19, 2016 · 17 comments
Closed

Glyphless font in pdf leads to spaces between characters #373

ebogaard opened this issue Jul 19, 2016 · 17 comments
Labels

Comments

@ebogaard
Copy link

I try to use tesseract to directly generate pdfs with an ocr'ed text layer. This is one step of several how pdfsandwich creates searchable pdfs.

The result of the tesseract-subprocess, is a pdf with an image and a text layer and is perfectly searchable.
Probably due to the high resolution input the dimensions of the resulting pdf are very large, which pdfsandwich solves by resizing the pages to more reasonable dimensions.

After this resize, when I open this file in, for example, Acrobat Reader DC, all recognized text is separated by extra spaces. So when it used to read 'hello', now it reads 'h e l l o'. So when you search for hello, the text isn't found.
A more technical explanation about this problem is in this thread: http://bugs.ghostscript.com/show_bug.cgi?id=696116

I thought I had a work around for this, by specifying a smaller DW than the default 500:

--- api/pdfrenderer.cpp-orig       2016-07-14 14:55:53.299744815 +0200
+++ api/pdfrenderer.cpp    2016-07-14 15:16:23.619204071 +0200
@@ -543,7 +543,7 @@
                "  /FontDescriptor %ld 0 R\n"
                "  /Subtype /CIDFontType2\n"
                "  /Type /Font\n"
-               "  /DW %d\n"
+               "  /DW 250\n"
                ">>\n"
                "endobj\n",
                5L,         // CIDToGIDMap

This solves the issue in Acrobat reader.
But when I put this file in Alfresco DMS, which uses PDFBox 1.8.4, I get the same problem again: I can only find words when I put spaces between the characters.

Setting the DW to a number smaller than 250 compromizes the text in the ocr'ed layer, so that's no option.

Is there any way to change the font type to a proper width, so most pdf-tools can properly read the text?

@jbreiden
Copy link
Contributor

Please do me a favor and take a look at 2.pdf which is an attachment towards the bottom of
the following bug. Tell me if that demonstrates the same incompatibility.

mozilla/pdf.js#6863

@ebogaard
Copy link
Author

Funny thing: Alfresco uses pdf.js as pdf viewer, and the search in pdf,js is actually working. Meaning: pdf.js doesn't put extra spaces between the characters.

2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.

So to summarize:

  1. By default, there are extra spaces when converting or extracting text from pdfs generated by tesseract.
  2. I found a reasonable workaround by decreasing the '/DW' from 500 to 250. Because of this, the text isn't overlayed perfectly, but that is something I can live with for now.
  3. After this change, searching and copying/extracting text works for Acrobat Reader DC, ghostscript and pdf.js, but not for pdbfox.

See attached pdf, which displays those problems: test-out-git.zip

@jbreiden
Copy link
Contributor

2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.

Good, because that is the future for Tesserct PDF output. 2.pdf has minor changes in the metrics of both the embedded font and the metrics in the PDF. I can't guarantee that this is going to work with every document, because PDF text extraction relies heavily in heuristics. (Root cause: PDF spec)

@ebogaard
Copy link
Author

ebogaard commented Aug 2, 2016

Is there any roadmap for this rewritten (as I understand) pdf generation?

@jbreiden
Copy link
Contributor

jbreiden commented Aug 2, 2016

It's more of a tweak than a rewrite. For logistical reasons, I hand all my changes to Ray who then merges them into the git repo. Ray is awesome in almost every way, but he is notoriously slow at this. I've already done the handoff.

@mbirth
Copy link

mbirth commented Oct 28, 2016

Note to other people running into this problem with pdfsandwich and ending up here, suspecting Tesseract: This is actually a problem with Ghostscript. pdfsandwich converts the images to PPM, hands those to Tesseract and since those files are missing resolution/DPI information, Tesseract outputs a huge PDF (0,9 by 1,20 metres for A4) but with correct text (i.e. without spaces between letters). Then, pdfsandwich runs this PDF through Ghostscript to resize it back to A4 and this step is what actually messes up the words.

The author of pdfsandwich has a pre-release version 0.1.5 which now uses TIF images instead of PPM. And those contain resolution information, so the PDF Tesseract spits out is already in the correct format.

(Side note: Tesseract seems to ignore resolution information from PNG files.)

@jbreiden
Copy link
Contributor

Tesseract seems to ignore resolution information from PNG files.

Wait, what? That's not expected at all. Please provide an example PNG file demonstrating
the problem, and it will get attention right away.

@jbreiden
Copy link
Contributor

Back to the spaces thing, I'd appreciate a retest once Tesseract pdf.ttf font matches the
following checksum. (It currently does not.)

$ md5sum pdf.ttf
e436074b54ed9cc5bf4789f79059b01b pdf.ttf

@zdenop
Copy link
Contributor

zdenop commented Nov 28, 2016

new pdf.ttf came to master and 3.05 branch.
@ebogaard: Can you re-test?

@ebogaard
Copy link
Author

Tried to re-test this, but got the following error when running pdfsandwich + tesseract.
This is with a just-checked out and compiled tesseract-3.05-branch:

ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line ConvNL

ParamsModel::Incomplete line M,V*aramsModel::Incomplete line M8BraramsModel::Incomplete line u?p{}%(H;_9"xuĿaramsModel::Incomplete line ?C"}܋h
fÿB1
ParamsModel::Incomplete line :l\
nN|?]]
ParamsModel::Incomplete line J
ParamsModel::Incomplete line ?d>ڎW{8
ParamsModel::Incomplete line 9'<�J

                                                                                                              ParamsModel::Incomplete line ?�

                                                                                                                                              ParamsModel::Incomplete line 
ParamsModel::Incomplete line aramsModel::Incomplete line yf~$G?S<rI#w|&:QParamsModel::Incomplete line 䢿(O`DHYC03E!aramsModel::Incomplete line ?Q!^Q{�տ8at�v3DNƦ?˄
ParamsModel::Incomplete line 5'                                                                                                                                <�"ѿ?ѓnv=o�aramsModel::Incomplete line cҺ?�
ParamsModel::Incomplete line xÿOҭ
ParamsModel::Incomplete line ?,I�iTc?kKZfiP{hmuǿqEȿ
ParamsModel::Incomplete line �T?ESWJ&�ParamsModel::Incomplete line �92|&&
                                                                      Z
ParamsModel::Incomplete line �V

ParamsModel::Incomplete line KaramsModel::Incomplete line 㕳Ibamؿϴȿlm�)eParamsModel::Incomplete line U�~c[)f!��t8M
'?{�y+�?��{?dBi"?--?@N?*+�˹e-I?_+?�L?K6{b?x�?{
                                <
Pa_a+_M-de+::I+c-+-+e+e +i+e ž"0ְ|?}+?
31}
ParamsModel::Unknown parameter ne z#@     A|�a꿹xڿkPԿB"
ParamsModel::Incomplete line Ij>      Pa_a+_M-de+::I+c-+-+e+e +i+e O
    �i�ҿnP?9|\?
ParamsModel::Incomplete line ܿ
ParamsModel::Incomplete line aramsModelPa_a+_M-de+::I+c-+-+e+e +i+e ?\È?�>:Unknown parameter ^ҿ
Pa_a+_M-de+::I+c-+-+e+e +i+e GU`zCԿa8aş?S.ǿParamsModel::Incomplete line ?Z"
ParamsModel::Incomplete line 

                                        0&=ÿR]S�⽅?+>+*�'fٿE"�_-,�IJ�/FU
                                  ParamsModel::Incomplete line ParamsModel::Incomplete line Ó'C:

                                                                        Pa_a+_M-de+::I+c-+-+e+e +i+e ?c�=Q#>~+͜?F�ў?yRFU?T+ÿ7P&�>:?J?D2\NW?ٿ+
                            ParamsModel::Unknown parameter S.~"r
ParamsModel::Incomplete line qjؿ
                                                                                                                                   Pa_a+_M-de+::I+c-+-+e+e +i+e @+A}?!bS:?F?㖾Th?XF08>?LUdH?�Vb?-<ŵz0?Vb?+I
Pa_a+_M-de+::I+c-+-+e+e +i+e ?+^�п4�<Y_?[Me}|<?�W+A|տ*+�?)_|G7MG5V?3|<?

... And this goed on and on

@zdenop
Copy link
Contributor

zdenop commented Nov 29, 2016

Please test only tesseract and please provide command (how you run tesseract).

@ebogaard
Copy link
Author

I tried that after with this command: tesseract -l nld+eng pdfsandwich45aaf9.tif -pdf
Same problem.

@zdenop
Copy link
Contributor

zdenop commented Nov 29, 2016

You used wrong command. It should be something like this:
tesseract pdfsandwich45aaf9.tif pdfsandwich45aaf9 -l nld+eng pdf

@ebogaard
Copy link
Author

Same error, I'm afraid.
I just downloaded new nld.trainneddata & eng.traineddata from here: https://github.com/tesseract-ocr/tessdata/
Might that have something to do with it?

@zdenop
Copy link
Contributor

zdenop commented Nov 29, 2016

In tessdata repository there are 4.00 data files and you use 3.05 tesseract...
This is not supported. You need to use data files from the same or lower tesseract version (e.g. 3.04)

@amitdo
Copy link
Collaborator

amitdo commented Nov 29, 2016

https://github.com/tesseract-ocr/tessdata/tree/3.04.00

@ebogaard
Copy link
Author

Okay, that was a bit silly on my end.
But after exchanging the traineddata for the 3.04-versions: tesseract and pdfsandwich+tesseract work.
The resulting pdf from both tesseract and pdfsandwich look good, have a text layer and don't have any extra spaces between characters,
So this seems to be solved. Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants