Doubled output from dumppdf.py #176

wlbentley · 2018-08-20T05:07:14Z

The top level of the samples/ directory contains 4 sample PDF files. When I run dumppdf.py against each of them, the output for two (simple2.pdf and jo.pdf) is doubled: the <trailer> block is repeated. For the other two (simple1.pdf and simple3.pdf), the output is a single <trailer> block, as expected. Each of the PDF files has only one actual trailer section.

See this gist for the actual output for each PDF file.

Expected behavior:
The dumppdf.py script should output only one <trailer> block for each of the four PDFs.

Python version: 3.7

The text was updated successfully, but these errors were encountered:

pietermarsman · 2019-11-04T08:59:39Z

I can replicate this issue.

pietermarsman · 2020-01-14T21:45:42Z

I think this has to do with the PDFXRefFallback(). By default PDFDocument adds a fallback xref that just enumerates all the objects.

See here:

pdfminer.six/pdfminer/pdfdocument.py

Line 553 in 0b1741b

xref = PDFXRefFallback()

pietermarsman · 2020-03-14T15:21:57Z

We can fix this by using PDFDocument(..., fallback=False) in dumppdf.py.

A more thorough fix is to only append the PDFXRefFallback() to the self.xrefs if one of the xrefs from the document failed (and fallback is True).

I know very little about xrefs and the impact of changing this. @wlbentley, do you know more about this? Which solution would you recommend?

pietermarsman · 2020-05-17T16:33:39Z

I did some thinking: the problem is basically what kind of default behavior we want:

Always print the all trailers and objects, even if the pdf's xref is corrupted or non-existing (current behavior).
Only print trailers and objects that are actually in the pdf's xref.
Something dynamic; e.g. using a fallback xref if it does not exist in the pdf.

I prefer 2 because it just shows you the pdf. If you don't like the pdf, you can repair it with something like mutools and check with pdfminer.six if it is fixed. If we go for option 2, you cannot dumppdf.py a PDF without xrefs anymore.

@wlbentley what do you prefer?

wlbentley · 2020-05-19T06:22:31Z

Is the xref actually corrupt in our samples? Or only missing? (And if they are corrupt, should we try to repair them?)

Either way, since I don't know how often bad/missing xrefs occur, I don't have a preference for the default behavior. I was just drawing attention to an anomaly in the samples.

But if we go with your option 2, and it prevents functioning in edge-cases, we should note that in the API component docs as a deliberate choice.

Whichever feels cleaner to you.

pietermarsman · 2020-05-21T14:50:16Z

Is the xref actually corrupt in our samples? Or only missing?

simple1.pdf and simple3.pdf are actually missing the xref. So the single output is the fallback xref.

And if they are corrupt, should we try to repair them?

The xref is just a pointer to the location of all the objects in the PDF. It is easy to replace by iterating over all the objects. I think it is a nice thing to do.

Overall, I think I prefer 2. That is to say, change dumptrailers() such that it only prints trailers that are actually in the PDF. Maybe raise a warning if none exists with a flag to add the fallback xref.

wlbentley · 2020-05-21T16:09:37Z

An informative warning sounds like a good solution to me. Better than silently ignoring an anomaly, and certainly better than the current double output behavior.

…

On May 21, 2020, 07:50 -0700, Pieter Marsman ***@***.***>, wrote: > Is the xref actually corrupt in our samples? Or only missing? simple1.pdf and simple3.pdf are actually missing the xref. So the single output is the fallback xref. > And if they are corrupt, should we try to repair them? The xref is just a pointer to the location of all the objects in the PDF. It is easy to replace by iterating over all the objects. I think it is a nice thing to do. Overall, I think I prefer 2. That is to say, change dumptrailers() such that it only prints trailers that are actually in the PDF. Maybe raise a warning if none exists with a flag to add the fallback xref. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

pietermarsman · 2020-05-21T16:15:04Z

I've implemented the infomative warning in #431. Do you have time for a review?

…ag to enable it Fixes #176 * Add failing test for dumping simple1.pdf and simple3.pdf, because they should raise an error when dumppdf.py tries to dump a pdf without xref's * Raise PDFNoValidXRef with explanation if dumppdf.py is called on a pdf that does not have an xref * Use warning instead of error, because not output xrefs is just fine (there aren't any) but it is something the user should know * Adding changelog * Extend help message

pietermarsman added the type: bug label Oct 13, 2019

pietermarsman added the component:document Related to PDFDocument label Jan 22, 2020

pietermarsman mentioned this issue May 21, 2020

Add failing test for dumping simple1.pdf and simple3.pdf, because the… #431

Merged

6 tasks

pietermarsman closed this as completed in #431 May 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubled output from dumppdf.py #176

Doubled output from dumppdf.py #176

wlbentley commented Aug 20, 2018 •

edited

Loading

pietermarsman commented Nov 4, 2019

pietermarsman commented Jan 14, 2020

pietermarsman commented Mar 14, 2020

pietermarsman commented May 17, 2020

wlbentley commented May 19, 2020

pietermarsman commented May 21, 2020

wlbentley commented May 21, 2020 via email

pietermarsman commented May 21, 2020

Doubled output from dumppdf.py #176

Doubled output from dumppdf.py #176

Comments

wlbentley commented Aug 20, 2018 • edited Loading

pietermarsman commented Nov 4, 2019

pietermarsman commented Jan 14, 2020

pietermarsman commented Mar 14, 2020

pietermarsman commented May 17, 2020

wlbentley commented May 19, 2020

pietermarsman commented May 21, 2020

wlbentley commented May 21, 2020 via email

pietermarsman commented May 21, 2020

wlbentley commented Aug 20, 2018 •

edited

Loading