Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for FlateStream_getCode, making more pdfs parsable. #4897

Merged
merged 1 commit into from
Jun 9, 2014
Merged

Optimization for FlateStream_getCode, making more pdfs parsable. #4897

merged 1 commit into from
Jun 9, 2014

Conversation

CodingFabian
Copy link
Contributor

No description provided.

@CodingFabian
Copy link
Contributor Author

After having spent some time on the method, I found a PDF by accident which fails in getCode():

http://www.cafeculture.com/wp-content/uploads/2014/03/V-CM-BR-086-04002-1346-0258-GP-Brazil-Fazenda-Cafe-Cambara-Terra-Preta-Microlot-Sample-0460-13-Pulped-Natural-60Kg.pdf

The reason that it fails is that maxLen says 11 codes are to be found (but max, does mean it can be less?), but the document only contains 10. Therefor it errors out.

When I turn the "error" into a break, the remaining decoding and sanity check pass, and the PDF renders nicely (and looks the same as in macos preview)

So I think my modifications are legit. MaxLen really means max, and it is ok to find less.

With what I learned, I refined the sanity check below. if no codes are to be found, or the size is smaller than len, the error is still thrown.

@CodingFabian CodingFabian changed the title minor optimization for FlateStream_getCode Optimization for FlateStream_getCode, making more pdfs parsable. Jun 6, 2014
@yurydelendik
Copy link
Contributor

Normally DEFLATE stream shall have explicit end-of-block code to break the loop. Looking at your patch, you have incomplete DEFLATE stream in the PDF, which makes this PDF corrupted.

When I turn the "error" into a break, the remaining decoding and sanity check pass, and the PDF renders nicely (and looks the same as in macos preview)

Every PDF viewer handles broken PDF different way since there is no specification on how to recover from the corrupted files. The proposed patch is not a good solution to recover from bad PDFs -- no feedback to the developer about corrupted DEFLATE stream. We need a better solution, e.g. stop parsing of the existing operator list and display what was parsed successfully, and also show the info message about stream corruption.

@yurydelendik
Copy link
Contributor

... unless there are no need for those insignificant bits for this table/codes. In this case, check the codeLen of the incomplete code before deciding on throwing the exception.

if ((b = str.getByte()) === -1) {
error('Bad encoding in flate stream');
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment, e.g. // we might still have valid code, codeSize < codeLen check below will guard against incomplete codeVal

This commit cleans up the FlateStream_getCode method, and removes a few error
conditions.
Previously it would fail if the codeSize is less than maxLen if end of stream
is reached. However in the document linked below there is a sub-stream
(the one starting at pos 337) which has maxLen set to 11, but actually
contains only 10. After breaking the sanity check still applies, and in this
case passes validating codeSize(10)==codeLen(10).

 http://www.cafeculture.com/wp-content/uploads/2014/03/V-CM-BR-086-04002-1346-0258-GP-Brazil-Fazenda-Cafe-Cambara-Terra-Preta-Microlot-Sample-0460-13-Pulped-Natural-60Kg.pdf
@yurydelendik
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Jun 9, 2014

From: Bot.io (Windows)


Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/f77ec3f2e0cd33f/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jun 9, 2014

From: Bot.io (Linux)


Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/09734c007515f11/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jun 9, 2014

From: Bot.io (Windows)


Failed

Full output at http://107.22.172.223:8877/f77ec3f2e0cd33f/output.txt

Total script time: 23.01 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/f77ec3f2e0cd33f/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Jun 9, 2014

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/09734c007515f11/output.txt

Total script time: 26.13 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

yurydelendik added a commit that referenced this pull request Jun 9, 2014
Optimization for FlateStream_getCode, making more pdfs parsable.
@yurydelendik yurydelendik merged commit 806aa36 into mozilla:master Jun 9, 2014
@yurydelendik
Copy link
Contributor

Thank you

@CodingFabian CodingFabian deleted the optimize-stream-getCode branch June 20, 2014 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants