Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubled output from dumppdf.py #176

Closed
wlbentley opened this issue Aug 20, 2018 · 8 comments · Fixed by #431
Closed

Doubled output from dumppdf.py #176

wlbentley opened this issue Aug 20, 2018 · 8 comments · Fixed by #431
Labels
component:document Related to PDFDocument type: bug

Comments

@wlbentley
Copy link
Contributor

wlbentley commented Aug 20, 2018

The top level of the samples/ directory contains 4 sample PDF files. When I run dumppdf.py against each of them, the output for two (simple2.pdf and jo.pdf) is doubled: the <trailer> block is repeated. For the other two (simple1.pdf and simple3.pdf), the output is a single <trailer> block, as expected. Each of the PDF files has only one actual trailer section.

See this gist for the actual output for each PDF file.

Expected behavior:
The dumppdf.py script should output only one <trailer> block for each of the four PDFs.

Python version: 3.7

@pietermarsman
Copy link
Member

I can replicate this issue.

@pietermarsman
Copy link
Member

I think this has to do with the PDFXRefFallback(). By default PDFDocument adds a fallback xref that just enumerates all the objects.

See here:

xref = PDFXRefFallback()

@pietermarsman pietermarsman added the component:document Related to PDFDocument label Jan 22, 2020
@pietermarsman
Copy link
Member

We can fix this by using PDFDocument(..., fallback=False) in dumppdf.py.

A more thorough fix is to only append the PDFXRefFallback() to the self.xrefs if one of the xrefs from the document failed (and fallback is True).

I know very little about xrefs and the impact of changing this. @wlbentley, do you know more about this? Which solution would you recommend?

@pietermarsman
Copy link
Member

I did some thinking: the problem is basically what kind of default behavior we want:

  1. Always print the all trailers and objects, even if the pdf's xref is corrupted or non-existing (current behavior).
  2. Only print trailers and objects that are actually in the pdf's xref.
  3. Something dynamic; e.g. using a fallback xref if it does not exist in the pdf.

I prefer 2 because it just shows you the pdf. If you don't like the pdf, you can repair it with something like mutools and check with pdfminer.six if it is fixed. If we go for option 2, you cannot dumppdf.py a PDF without xrefs anymore.

@wlbentley what do you prefer?

@wlbentley
Copy link
Contributor Author

Is the xref actually corrupt in our samples? Or only missing? (And if they are corrupt, should we try to repair them?)

Either way, since I don't know how often bad/missing xrefs occur, I don't have a preference for the default behavior. I was just drawing attention to an anomaly in the samples.

But if we go with your option 2, and it prevents functioning in edge-cases, we should note that in the API component docs as a deliberate choice.

Whichever feels cleaner to you.

@pietermarsman
Copy link
Member

Is the xref actually corrupt in our samples? Or only missing?

simple1.pdf and simple3.pdf are actually missing the xref. So the single output is the fallback xref.

And if they are corrupt, should we try to repair them?

The xref is just a pointer to the location of all the objects in the PDF. It is easy to replace by iterating over all the objects. I think it is a nice thing to do.

Overall, I think I prefer 2. That is to say, change dumptrailers() such that it only prints trailers that are actually in the PDF. Maybe raise a warning if none exists with a flag to add the fallback xref.

@wlbentley
Copy link
Contributor Author

wlbentley commented May 21, 2020 via email

@pietermarsman
Copy link
Member

I've implemented the infomative warning in #431. Do you have time for a review?

pietermarsman added a commit that referenced this issue May 23, 2020
…ag to enable it

Fixes #176 

* Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's

* Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref

* Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know

* Adding changelog

* Extend help message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:document Related to PDFDocument type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants