Optimization for FlateStream_getCode, making more pdfs parsable. #4897

CodingFabian · 2014-06-05T14:50:33Z

No description provided.

CodingFabian · 2014-06-06T18:08:37Z

After having spent some time on the method, I found a PDF by accident which fails in getCode():

http://www.cafeculture.com/wp-content/uploads/2014/03/V-CM-BR-086-04002-1346-0258-GP-Brazil-Fazenda-Cafe-Cambara-Terra-Preta-Microlot-Sample-0460-13-Pulped-Natural-60Kg.pdf

The reason that it fails is that maxLen says 11 codes are to be found (but max, does mean it can be less?), but the document only contains 10. Therefor it errors out.

When I turn the "error" into a break, the remaining decoding and sanity check pass, and the PDF renders nicely (and looks the same as in macos preview)

So I think my modifications are legit. MaxLen really means max, and it is ok to find less.

With what I learned, I refined the sanity check below. if no codes are to be found, or the size is smaller than len, the error is still thrown.

yurydelendik · 2014-06-09T18:33:13Z

Normally DEFLATE stream shall have explicit end-of-block code to break the loop. Looking at your patch, you have incomplete DEFLATE stream in the PDF, which makes this PDF corrupted.

When I turn the "error" into a break, the remaining decoding and sanity check pass, and the PDF renders nicely (and looks the same as in macos preview)

Every PDF viewer handles broken PDF different way since there is no specification on how to recover from the corrupted files. The proposed patch is not a good solution to recover from bad PDFs -- no feedback to the developer about corrupted DEFLATE stream. We need a better solution, e.g. stop parsing of the existing operator list and display what was parsed successfully, and also show the info message about stream corruption.

yurydelendik · 2014-06-09T18:39:02Z

... unless there are no need for those insignificant bits for this table/codes. In this case, check the codeLen of the incomplete code before deciding on throwing the exception.

yurydelendik · 2014-06-09T18:49:32Z

src/core/stream.js

      if ((b = str.getByte()) === -1) {
-        error('Bad encoding in flate stream');
+        break;


Add a comment, e.g. // we might still have valid code, codeSize < codeLen check below will guard against incomplete codeVal

This commit cleans up the FlateStream_getCode method, and removes a few error conditions. Previously it would fail if the codeSize is less than maxLen if end of stream is reached. However in the document linked below there is a sub-stream (the one starting at pos 337) which has maxLen set to 11, but actually contains only 10. After breaking the sanity check still applies, and in this case passes validating codeSize(10)==codeLen(10). http://www.cafeculture.com/wp-content/uploads/2014/03/V-CM-BR-086-04002-1346-0258-GP-Brazil-Fazenda-Cafe-Cambara-Terra-Preta-Microlot-Sample-0460-13-Pulped-Natural-60Kg.pdf

yurydelendik · 2014-06-09T18:57:37Z

/botio test

pdfjsbot · 2014-06-09T18:57:38Z

From: Bot.io (Windows)

Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/f77ec3f2e0cd33f/output.txt

pdfjsbot · 2014-06-09T18:57:38Z

From: Bot.io (Linux)

Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/09734c007515f11/output.txt

pdfjsbot · 2014-06-09T19:20:39Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/f77ec3f2e0cd33f/output.txt

Total script time: 23.01 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/f77ec3f2e0cd33f/reftest-analyzer.html#web=eq.log

pdfjsbot · 2014-06-09T19:23:46Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/09734c007515f11/output.txt

Total script time: 26.13 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

Optimization for FlateStream_getCode, making more pdfs parsable.

yurydelendik · 2014-06-09T19:51:51Z

Thank you

CodingFabian changed the title ~~minor optimization for FlateStream_getCode~~ Optimization for FlateStream_getCode, making more pdfs parsable. Jun 6, 2014

yurydelendik reviewed Jun 9, 2014
View reviewed changes

yurydelendik added a commit that referenced this pull request Jun 9, 2014

Merge pull request #4897 from CodingFabian/optimize-stream-getCode

806aa36

Optimization for FlateStream_getCode, making more pdfs parsable.

yurydelendik merged commit 806aa36 into mozilla:master Jun 9, 2014

CodingFabian deleted the optimize-stream-getCode branch June 20, 2014 13:54

Snuffleupagus mentioned this pull request Oct 4, 2015

Add a (linked) test-case for PR 4897 #6500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for FlateStream_getCode, making more pdfs parsable. #4897

Optimization for FlateStream_getCode, making more pdfs parsable. #4897

CodingFabian commented Jun 5, 2014

CodingFabian commented Jun 6, 2014

yurydelendik commented Jun 9, 2014

yurydelendik commented Jun 9, 2014

yurydelendik Jun 9, 2014

yurydelendik commented Jun 9, 2014

pdfjsbot commented Jun 9, 2014

pdfjsbot commented Jun 9, 2014

pdfjsbot commented Jun 9, 2014

pdfjsbot commented Jun 9, 2014

yurydelendik commented Jun 9, 2014

Optimization for FlateStream_getCode, making more pdfs parsable. #4897

Optimization for FlateStream_getCode, making more pdfs parsable. #4897

Conversation

CodingFabian commented Jun 5, 2014

CodingFabian commented Jun 6, 2014

yurydelendik commented Jun 9, 2014

yurydelendik commented Jun 9, 2014

yurydelendik Jun 9, 2014

Choose a reason for hiding this comment

yurydelendik commented Jun 9, 2014

pdfjsbot commented Jun 9, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Jun 9, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Jun 9, 2014

From: Bot.io (Windows)

Failed

pdfjsbot commented Jun 9, 2014

From: Bot.io (Linux)

Success

yurydelendik commented Jun 9, 2014