Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618

yeus · 2021-05-07T22:21:24Z

Bug report

A description of the bug
All textlines are offset in y-direction by a constant amount.

Here is an example where I plottet the textboxes vs. the background pdf. As you can see
graphics-elements have the "correct" coordinates (dottet lines match up against the grey pdf graphics), but the
textlines have a constant offset.

The example pdf file can be found here: https://files.sma.de/downloads/SBSxx-10-DS-en-30.pdf

I am not sure, whats causing this I suspect it might have something o do with the font? Any help would be appreciated.

yeus · 2021-05-08T00:09:50Z

so after a little more investigation I am pretty sure that the error happens somewhere around here:

pdfminer.six/pdfminer/layout.py

Line 306 in 22f9052

bbox_lower_left = (0, descent + rise)

the position of the characters baseline from the matrix is actually correct. That means when calculating the bounding box
the bbox_lower_left has to be wrong.

In my case for example descent + rise where rise is zero. so one of the two seems to have the wrong value ...

--> I am not sure, where rise is coming from, but descent an rise seems to come from two different sources
rise is extracted much earlier, while descent comes directly from the font object. I am not sure if that makes sense and "rise" should actually be "ascent" from the font object just like descent ...

at least when I calculate descent + ascent it gives me the right boundingbox for the character (in y-direction).. (although I haven't checked out more pdfs)

I would appreciate if someone with more knowledge of the codebase could help me out here ..

yeus · 2021-05-08T07:31:21Z

just checked with some other pdfs and in all pdfs that I usd, the rise variable is always set to zero, even though it works correctly. Using the ascent would make those not work anymore. It seems the blame is solely on the descent variable and its calculation. Not sure, why for this pdf it is too large...

I am not sure though how this could happen though, from what I see, the calculation of the descent variable is pretty straight forward.

sreeni5493 · 2021-08-11T09:59:13Z

Did you figure out any solution?

sreeni5493 · 2021-08-11T10:10:26Z

I tried removing descent. It worked for some cases. Need to check more cases.

yeus · 2021-08-15T21:45:42Z

Did you figure out any solution?

No :(. Still no idea what exactly is causing this

yeus · 2022-01-27T12:21:59Z

I tried removing descent. It worked for some cases. Need to check more cases.

Also tried it for some cases and it worked....
Also working on checks for more pdfs right now ;).

yeus · 2022-01-27T12:38:02Z

@sreeni5493

I am not sure yet, but

I tried removing descent. It worked for some cases. Need to check more cases.
--> this doesn't work... in other pdfs I get completely wrong characters locations this way as well..

I wonder if we have the same problem here as this one from pdf.js:

mozilla/pdf.js#6863

yeus · 2022-01-27T12:45:55Z

And this one here might be related as well (at least it looks very similar):

#490

yeus · 2022-01-27T13:00:41Z

So in mozilla/pdf.js#6863 they say that the problem was taking the wrong metric from the font instead of the pdf and switching PDF height metrics to 1000 (mozilla/pdf.js#6863 (comment)).

Wonder if that would resolve our issue. As you can see we are also using he font information in order to calculate the "descent" of the font:

pdfminer.six/pdfminer/layout.py

Line 305 in c8cceb7

descent = font.get_descent() * fontsize

My assumption is, that thats the wrong way to do it...

yeus · 2022-01-27T15:39:13Z

ok... even more debugging:

I tried repairing he pdf file using this line:

gs -o repaired.pdf -sDEVICE=pdfwrite input.pdf

And it seems to correct the issue..

What I find interesting are the completely different values that get read out for the font object in this case:

original (descent = fontdescent * 0.001 * fontsize = -0.507):

font.descent= -507
fontsize = 1
matrix: (0, 9, -9, 0, 42.1831, 607.5866)

and repaired (descent = fontdescent * 0.001 * fontsize = -2.088):

font.descent= -232
fontsize = 9
matrix: (0.0, 1.0, -1.0, 0.0, 42.1832, 607.587)

So it seems that in the original version of the pdf, somehow the font information get extracted the wrong way.. fontsize 9 is obviously correct in the above pdf.
And no idea where it get the descent value of -507 from? additonally, the fontsize somehow ends up in the matrix? Maybe the original pdf is corrupted and the content stream orders these information in the wrong way? But why do I not have problems with this in other pdf viewing softwares? Also.. I have had these problems in quiet a few pdfs...

pietermarsman · 2022-01-29T14:01:46Z

It would be very helpful if you can determine if this is an error in your example pdf, or if this is a structural thing. Also use the pdf reference section 5.2 to check how the bounding boxes should be computed.

yeus changed the title ~~Text lines have a wrong constant offset in y-direction in some extracted pdf files~~ Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files May 7, 2021

yeus changed the title ~~Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files~~ Text line coordinates/boundingboxes have a wrong constant offset in y-direction (maybe also x) in some extracted pdf files May 8, 2021

yeus changed the title ~~Text line coordinates/boundingboxes have a wrong constant offset in y-direction (maybe also x) in some extracted pdf files~~ Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files May 8, 2021

sreeni5493 mentioned this issue Jun 2, 2021

Text extraction generated words have location that is completely off from where they actually are present jsvine/pdfplumber#425

Closed

yeus mentioned this issue Jan 27, 2022

Math formula position are detected wrongly #490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618

Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618

yeus commented May 7, 2021 •

edited

Loading

yeus commented May 8, 2021 •

edited

Loading

yeus commented May 8, 2021

sreeni5493 commented Aug 11, 2021

sreeni5493 commented Aug 11, 2021

yeus commented Aug 15, 2021

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

pietermarsman commented Jan 29, 2022

Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618

Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618

Comments

yeus commented May 7, 2021 • edited Loading

yeus commented May 8, 2021 • edited Loading

yeus commented May 8, 2021

sreeni5493 commented Aug 11, 2021

sreeni5493 commented Aug 11, 2021

yeus commented Aug 15, 2021

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

yeus commented Jan 27, 2022

pietermarsman commented Jan 29, 2022

yeus commented May 7, 2021 •

edited

Loading

yeus commented May 8, 2021 •

edited

Loading