-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618
Comments
so after a little more investigation I am pretty sure that the error happens somewhere around here: pdfminer.six/pdfminer/layout.py Line 306 in 22f9052
the position of the characters baseline from the matrix is actually correct. That means when calculating the bounding box In my case for example --> I am not sure, where rise is coming from, but descent an rise seems to come from two different sources at least when I calculate I would appreciate if someone with more knowledge of the codebase could help me out here .. |
just checked with some other pdfs and in all pdfs that I usd, the I am not sure though how this could happen though, from what I see, the calculation of the descent variable is pretty straight forward. |
Did you figure out any solution? |
I tried removing descent. It worked for some cases. Need to check more cases. |
No :(. Still no idea what exactly is causing this |
|
I am not sure yet, but
I wonder if we have the same problem here as this one from pdf.js: |
And this one here might be related as well (at least it looks very similar): |
So in mozilla/pdf.js#6863 they say that the problem was taking the wrong metric from the font instead of the pdf and switching PDF height metrics to 1000 (mozilla/pdf.js#6863 (comment)). Wonder if that would resolve our issue. As you can see we are also using he font information in order to calculate the "descent" of the font: pdfminer.six/pdfminer/layout.py Line 305 in c8cceb7
My assumption is, that thats the wrong way to do it... |
ok... even more debugging: I tried repairing he pdf file using this line:
And it seems to correct the issue.. What I find interesting are the completely different values that get read out for the font object in this case: original (descent = fontdescent * 0.001 * fontsize = -0.507):
and repaired (descent = fontdescent * 0.001 * fontsize = -2.088):
So it seems that in the original version of the pdf, somehow the font information get extracted the wrong way.. fontsize 9 is obviously correct in the above pdf. |
It would be very helpful if you can determine if this is an error in your example pdf, or if this is a structural thing. Also use the pdf reference section 5.2 to check how the bounding boxes should be computed. |
Bug report
Here is an example where I plottet the textboxes vs. the background pdf. As you can see
graphics-elements have the "correct" coordinates (dottet lines match up against the grey pdf graphics), but the
textlines have a constant offset.
The example pdf file can be found here: https://files.sma.de/downloads/SBSxx-10-DS-en-30.pdf
I am not sure, whats causing this I suspect it might have something o do with the font? Any help would be appreciated.
The text was updated successfully, but these errors were encountered: