Glyphless font in pdf leads to spaces between characters #373

ebogaard · 2016-07-19T14:43:42Z

I try to use tesseract to directly generate pdfs with an ocr'ed text layer. This is one step of several how pdfsandwich creates searchable pdfs.

The result of the tesseract-subprocess, is a pdf with an image and a text layer and is perfectly searchable.
Probably due to the high resolution input the dimensions of the resulting pdf are very large, which pdfsandwich solves by resizing the pages to more reasonable dimensions.

After this resize, when I open this file in, for example, Acrobat Reader DC, all recognized text is separated by extra spaces. So when it used to read 'hello', now it reads 'h e l l o'. So when you search for hello, the text isn't found.
A more technical explanation about this problem is in this thread: http://bugs.ghostscript.com/show_bug.cgi?id=696116

I thought I had a work around for this, by specifying a smaller DW than the default 500:

--- api/pdfrenderer.cpp-orig       2016-07-14 14:55:53.299744815 +0200
+++ api/pdfrenderer.cpp    2016-07-14 15:16:23.619204071 +0200
@@ -543,7 +543,7 @@
                "  /FontDescriptor %ld 0 R\n"
                "  /Subtype /CIDFontType2\n"
                "  /Type /Font\n"
-               "  /DW %d\n"
+               "  /DW 250\n"
                ">>\n"
                "endobj\n",
                5L,         // CIDToGIDMap

This solves the issue in Acrobat reader.
But when I put this file in Alfresco DMS, which uses PDFBox 1.8.4, I get the same problem again: I can only find words when I put spaces between the characters.

Setting the DW to a number smaller than 250 compromizes the text in the ocr'ed layer, so that's no option.

Is there any way to change the font type to a proper width, so most pdf-tools can properly read the text?

The text was updated successfully, but these errors were encountered:

jbreiden · 2016-07-19T16:52:45Z

Please do me a favor and take a look at 2.pdf which is an attachment towards the bottom of
the following bug. Tell me if that demonstrates the same incompatibility.

mozilla/pdf.js#6863

ebogaard · 2016-07-20T08:29:02Z

Funny thing: Alfresco uses pdf.js as pdf viewer, and the search in pdf,js is actually working. Meaning: pdf.js doesn't put extra spaces between the characters.

2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.

So to summarize:

By default, there are extra spaces when converting or extracting text from pdfs generated by tesseract.
I found a reasonable workaround by decreasing the '/DW' from 500 to 250. Because of this, the text isn't overlayed perfectly, but that is something I can live with for now.
After this change, searching and copying/extracting text works for Acrobat Reader DC, ghostscript and pdf.js, but not for pdbfox.

See attached pdf, which displays those problems: test-out-git.zip

jbreiden · 2016-07-20T17:11:26Z

2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.

Good, because that is the future for Tesserct PDF output. 2.pdf has minor changes in the metrics of both the embedded font and the metrics in the PDF. I can't guarantee that this is going to work with every document, because PDF text extraction relies heavily in heuristics. (Root cause: PDF spec)

ebogaard · 2016-08-02T15:16:29Z

Is there any roadmap for this rewritten (as I understand) pdf generation?

jbreiden · 2016-08-02T18:54:45Z

It's more of a tweak than a rewrite. For logistical reasons, I hand all my changes to Ray who then merges them into the git repo. Ray is awesome in almost every way, but he is notoriously slow at this. I've already done the handoff.

mbirth · 2016-10-28T00:07:14Z

Note to other people running into this problem with pdfsandwich and ending up here, suspecting Tesseract: This is actually a problem with Ghostscript. pdfsandwich converts the images to PPM, hands those to Tesseract and since those files are missing resolution/DPI information, Tesseract outputs a huge PDF (0,9 by 1,20 metres for A4) but with correct text (i.e. without spaces between letters). Then, pdfsandwich runs this PDF through Ghostscript to resize it back to A4 and this step is what actually messes up the words.

The author of pdfsandwich has a pre-release version 0.1.5 which now uses TIF images instead of PPM. And those contain resolution information, so the PDF Tesseract spits out is already in the correct format.

(Side note: Tesseract seems to ignore resolution information from PNG files.)

jbreiden · 2016-10-28T17:04:56Z

Tesseract seems to ignore resolution information from PNG files.

Wait, what? That's not expected at all. Please provide an example PNG file demonstrating
the problem, and it will get attention right away.

jbreiden · 2016-11-27T21:03:57Z

Back to the spaces thing, I'd appreciate a retest once Tesseract pdf.ttf font matches the
following checksum. (It currently does not.)

$ md5sum pdf.ttf
e436074b54ed9cc5bf4789f79059b01b pdf.ttf

zdenop · 2016-11-28T18:36:23Z

new pdf.ttf came to master and 3.05 branch.
@ebogaard: Can you re-test?

ebogaard · 2016-11-29T13:23:58Z

Tried to re-test this, but got the following error when running pdfsandwich + tesseract.
This is with a just-checked out and compiled tesseract-3.05-branch:

ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line ConvNL

ParamsModel::Incomplete line M,V*aramsModel::Incomplete line M8BraramsModel::Incomplete line u?p{}%(H;_9"xuĿaramsModel::Incomplete line ?C"}܋h
fÿB1
ParamsModel::Incomplete line :l\
nN|?]]
ParamsModel::Incomplete line J
ParamsModel::Incomplete line ?d>ڎW{8
ParamsModel::Incomplete line 9'<�J

                                                                                                              ParamsModel::Incomplete line ?�

                                                                                                                                              ParamsModel::Incomplete line 
ParamsModel::Incomplete line aramsModel::Incomplete line yf~$G?S<rI#w|&:QParamsModel::Incomplete line 䢿(O`DHYC03E!aramsModel::Incomplete line ?Q!^Q{�տ8at�v3DNƦ?˄
ParamsModel::Incomplete line 5'                                                                                                                                <�"ѿ?ѓnv=o�aramsModel::Incomplete line cҺ?�
ParamsModel::Incomplete line xÿOҭ
ParamsModel::Incomplete line ?,I�iTc?kKZfiP{hmuǿqEȿ
ParamsModel::Incomplete line �T?ESWJ&�ParamsModel::Incomplete line �92|&&
                                                                      Z
ParamsModel::Incomplete line �V

ParamsModel::Incomplete line KaramsModel::Incomplete line 㕳Ibamؿϴȿlm�)eParamsModel::Incomplete line U�~c[)f!��t8M
'?{�y+�?��{?dBi"?--?@N?*+�˹e-I?_+?�L?K6{b?x�?{
                                <
Pa_a+_M-de+::I+c-+-+e+e +i+e ž"0ְ|?}+?
31}
ParamsModel::Unknown parameter ne z#@     A|�a꿹xڿkPԿB"
ParamsModel::Incomplete line Ij>      Pa_a+_M-de+::I+c-+-+e+e +i+e O
    �i�ҿnP?9|\?
ParamsModel::Incomplete line ܿ
ParamsModel::Incomplete line aramsModelPa_a+_M-de+::I+c-+-+e+e +i+e ?\È?�>:Unknown parameter ^ҿ
Pa_a+_M-de+::I+c-+-+e+e +i+e GU`zCԿa8aş?S.ǿParamsModel::Incomplete line ?Z"
ParamsModel::Incomplete line 

                                        0&=ÿR]S�⽅?+>+*�'fٿE"�_-,�Ĳ�/FU
                                  ParamsModel::Incomplete line ParamsModel::Incomplete line Ó'C:

                                                                        Pa_a+_M-de+::I+c-+-+e+e +i+e ?c�=Q#>~+͜?F�ў?yRFU?T+ÿ7P&�>:?J?D2\NW?ٿ+
                            ParamsModel::Unknown parameter S.~"r
ParamsModel::Incomplete line qjؿ
                                                                                                                                   Pa_a+_M-de+::I+c-+-+e+e +i+e @+A}?!bS:?F?㖾Th?XF08>?LUdH?�Vb?-<ŵz0?Vb?+I
Pa_a+_M-de+::I+c-+-+e+e +i+e ?+^�п4�<Y_?[Me}|<?�W+A|տ*+�?)_|G7MG5V?3|<?

... And this goed on and on

zdenop · 2016-11-29T14:15:24Z

Please test only tesseract and please provide command (how you run tesseract).

ebogaard · 2016-11-29T14:19:13Z

I tried that after with this command: tesseract -l nld+eng pdfsandwich45aaf9.tif -pdf
Same problem.

zdenop · 2016-11-29T14:27:19Z

You used wrong command. It should be something like this:
tesseract pdfsandwich45aaf9.tif pdfsandwich45aaf9 -l nld+eng pdf

ebogaard · 2016-11-29T14:32:52Z

Same error, I'm afraid.
I just downloaded new nld.trainneddata & eng.traineddata from here: https://github.com/tesseract-ocr/tessdata/
Might that have something to do with it?

zdenop · 2016-11-29T14:41:57Z

In tessdata repository there are 4.00 data files and you use 3.05 tesseract...
This is not supported. You need to use data files from the same or lower tesseract version (e.g. 3.04)

amitdo · 2016-11-29T14:46:22Z

https://github.com/tesseract-ocr/tessdata/tree/3.04.00

ebogaard · 2016-11-29T14:51:11Z

Okay, that was a bit silly on my end.
But after exchanging the traineddata for the 3.04-versions: tesseract and pdfsandwich+tesseract work.
The resulting pdf from both tesseract and pdfsandwich look good, have a text layer and don't have any extra spaces between characters,
So this seems to be solved. Great!

edubya mentioned this issue Aug 9, 2016

Problem copying text from PDFs sumatrapdfreader/sumatrapdf#544

Open

edubya mentioned this issue Aug 17, 2016

Text is garbled in pdf.js (Cygwin / UB Mannheim binaries) #382

Closed

zdenop added the PDF label Sep 1, 2016

mbirth mentioned this issue Oct 28, 2016

Resolution information in PNG files is ignored #453

Closed

zdenop closed this as completed Nov 29, 2016

zdenop mentioned this issue Dec 23, 2016

ParamsModel::Incomplete line errors when using Vietnamese / VIE language #602

Closed

baskerville mentioned this issue Jun 10, 2018

[darwin] tesseract: tessdata version mismatch TritonDataCenter/pkgsrc#113

Closed

jh0274 mentioned this issue Oct 12, 2023

[BUG] Inserts extraneous spaces in PDF scambier/obsidian-text-extractor#41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glyphless font in pdf leads to spaces between characters #373

Glyphless font in pdf leads to spaces between characters #373

ebogaard commented Jul 19, 2016

jbreiden commented Jul 19, 2016

ebogaard commented Jul 20, 2016

jbreiden commented Jul 20, 2016

ebogaard commented Aug 2, 2016

jbreiden commented Aug 2, 2016

mbirth commented Oct 28, 2016 •

edited

Loading

jbreiden commented Oct 28, 2016

jbreiden commented Nov 27, 2016

zdenop commented Nov 28, 2016

ebogaard commented Nov 29, 2016

zdenop commented Nov 29, 2016

ebogaard commented Nov 29, 2016

zdenop commented Nov 29, 2016

ebogaard commented Nov 29, 2016

zdenop commented Nov 29, 2016

amitdo commented Nov 29, 2016

ebogaard commented Nov 29, 2016

Glyphless font in pdf leads to spaces between characters #373

Glyphless font in pdf leads to spaces between characters #373

Comments

ebogaard commented Jul 19, 2016

jbreiden commented Jul 19, 2016

ebogaard commented Jul 20, 2016

jbreiden commented Jul 20, 2016

ebogaard commented Aug 2, 2016

jbreiden commented Aug 2, 2016

mbirth commented Oct 28, 2016 • edited Loading

jbreiden commented Oct 28, 2016

jbreiden commented Nov 27, 2016

zdenop commented Nov 28, 2016

ebogaard commented Nov 29, 2016

zdenop commented Nov 29, 2016

ebogaard commented Nov 29, 2016

zdenop commented Nov 29, 2016

ebogaard commented Nov 29, 2016

zdenop commented Nov 29, 2016

amitdo commented Nov 29, 2016

ebogaard commented Nov 29, 2016

mbirth commented Oct 28, 2016 •

edited

Loading