-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doubled output from dumppdf.py #176
Comments
I can replicate this issue. |
I think this has to do with the See here: pdfminer.six/pdfminer/pdfdocument.py Line 553 in 0b1741b
|
We can fix this by using A more thorough fix is to only append the I know very little about xrefs and the impact of changing this. @wlbentley, do you know more about this? Which solution would you recommend? |
I did some thinking: the problem is basically what kind of default behavior we want:
I prefer 2 because it just shows you the pdf. If you don't like the pdf, you can repair it with something like mutools and check with pdfminer.six if it is fixed. If we go for option 2, you cannot dumppdf.py a PDF without xrefs anymore. @wlbentley what do you prefer? |
Is the xref actually corrupt in our samples? Or only missing? (And if they are corrupt, should we try to repair them?) Either way, since I don't know how often bad/missing xrefs occur, I don't have a preference for the default behavior. I was just drawing attention to an anomaly in the samples. But if we go with your option 2, and it prevents functioning in edge-cases, we should note that in the API component docs as a deliberate choice. Whichever feels cleaner to you. |
simple1.pdf and simple3.pdf are actually missing the xref. So the single output is the fallback xref.
The xref is just a pointer to the location of all the objects in the PDF. It is easy to replace by iterating over all the objects. I think it is a nice thing to do. Overall, I think I prefer 2. That is to say, change |
An informative warning sounds like a good solution to me. Better than silently ignoring an anomaly, and certainly better than the current double output behavior.
…On May 21, 2020, 07:50 -0700, Pieter Marsman ***@***.***>, wrote:
> Is the xref actually corrupt in our samples? Or only missing?
simple1.pdf and simple3.pdf are actually missing the xref. So the single output is the fallback xref.
> And if they are corrupt, should we try to repair them?
The xref is just a pointer to the location of all the objects in the PDF. It is easy to replace by iterating over all the objects. I think it is a nice thing to do.
Overall, I think I prefer 2. That is to say, change dumptrailers() such that it only prints trailers that are actually in the PDF. Maybe raise a warning if none exists with a flag to add the fallback xref.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I've implemented the infomative warning in #431. Do you have time for a review? |
…ag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message
The top level of the samples/ directory contains 4 sample PDF files. When I run dumppdf.py against each of them, the output for two (simple2.pdf and jo.pdf) is doubled: the
<trailer>
block is repeated. For the other two (simple1.pdf and simple3.pdf), the output is a single<trailer>
block, as expected. Each of the PDF files has only one actual trailer section.See this gist for the actual output for each PDF file.
Expected behavior:
The dumppdf.py script should output only one
<trailer>
block for each of the four PDFs.Python version: 3.7
The text was updated successfully, but these errors were encountered: