-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for FlateStream_getCode, making more pdfs parsable. #4897
Optimization for FlateStream_getCode, making more pdfs parsable. #4897
Conversation
After having spent some time on the method, I found a PDF by accident which fails in getCode(): The reason that it fails is that maxLen says 11 codes are to be found (but max, does mean it can be less?), but the document only contains 10. Therefor it errors out. When I turn the "error" into a break, the remaining decoding and sanity check pass, and the PDF renders nicely (and looks the same as in macos preview) So I think my modifications are legit. MaxLen really means max, and it is ok to find less. With what I learned, I refined the sanity check below. if no codes are to be found, or the size is smaller than len, the error is still thrown. |
Normally DEFLATE stream shall have explicit end-of-block code to break the loop. Looking at your patch, you have incomplete DEFLATE stream in the PDF, which makes this PDF corrupted.
Every PDF viewer handles broken PDF different way since there is no specification on how to recover from the corrupted files. The proposed patch is not a good solution to recover from bad PDFs -- no feedback to the developer about corrupted DEFLATE stream. We need a better solution, e.g. stop parsing of the existing operator list and display what was parsed successfully, and also show the info message about stream corruption. |
... unless there are no need for those insignificant bits for this table/codes. In this case, check the |
if ((b = str.getByte()) === -1) { | ||
error('Bad encoding in flate stream'); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment, e.g. // we might still have valid code, codeSize < codeLen check below will guard against incomplete codeVal
This commit cleans up the FlateStream_getCode method, and removes a few error conditions. Previously it would fail if the codeSize is less than maxLen if end of stream is reached. However in the document linked below there is a sub-stream (the one starting at pos 337) which has maxLen set to 11, but actually contains only 10. After breaking the sanity check still applies, and in this case passes validating codeSize(10)==codeLen(10). http://www.cafeculture.com/wp-content/uploads/2014/03/V-CM-BR-086-04002-1346-0258-GP-Brazil-Fazenda-Cafe-Cambara-Terra-Preta-Microlot-Sample-0460-13-Pulped-Natural-60Kg.pdf
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @yurydelendik received. Current queue size: 0 Live output at: http://107.22.172.223:8877/f77ec3f2e0cd33f/output.txt |
From: Bot.io (Linux)ReceivedCommand cmd_test from @yurydelendik received. Current queue size: 0 Live output at: http://107.21.233.14:8877/09734c007515f11/output.txt |
From: Bot.io (Windows)FailedFull output at http://107.22.172.223:8877/f77ec3f2e0cd33f/output.txt Total script time: 23.01 mins
Image differences available at: http://107.22.172.223:8877/f77ec3f2e0cd33f/reftest-analyzer.html#web=eq.log |
From: Bot.io (Linux)SuccessFull output at http://107.21.233.14:8877/09734c007515f11/output.txt Total script time: 26.13 mins
|
Optimization for FlateStream_getCode, making more pdfs parsable.
Thank you |
No description provided.