Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor searching for end of inline (EI) JPEG image streams #5631

Merged
merged 2 commits into from
Jan 12, 2015
Merged

Refactor searching for end of inline (EI) JPEG image streams #5631

merged 2 commits into from
Jan 12, 2015

Conversation

Snuffleupagus
Copy link
Collaborator

This patch changes searching for EI image streams to rely on the EOI (end-of-image) marker for DCTDecode filters (i.e. JPEG images).

Note: This code was initially included in PR #5383, but removed from it based on this comment from @yurydelendik: #5383 (comment).
It should be quite easy to, if necessary, ignore that particular marker sequence, but unfortunately I've been unable to find any information about it.

@yurydelendik
Copy link
Contributor

Okay, I found it at http://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=95&zoom=auto,-89,622 see "F.1.2.3 Byte stuffing". Looks like JPEG encoding takes care of it by itself. Let me think if EOI can potentially appear in other tags, e.g. looking at "B.2.2 Frame header syntax" there is 'Lf' field present, so we can skip the bytes and avoid false-EOI inside SOF body.

@yurydelendik
Copy link
Contributor

Looks like "Table B.1 – Marker code assignments" has everything we need. Can we use length field for all tags that are not marked with (*) and not reserved to skip bytes?

@Snuffleupagus
Copy link
Collaborator Author

Since JPEG uses two byte markers, it seemed nice/easy to just use the getUint16 methods of {Stream, DecodeStream, ChunkedStream}. Doing so, I realised that those methods will return garbage values if getByte() === -1 (i.e. when we've reached the end of the stream). Hence the first commit, which I hope is a good way to address that.

After carefully reading through the relevant parts of the marker spec, I hope I've managed to produce a good implement of the searching for EOI. The code will now read the length of the various marker segments, and jump straight to the next marker.
This seems to work very well, and in all cases I've checked this reduces the number of iterations necessary to find the EOI. Especially for small inline images, the number of iterations are considerably reduced.

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/dbd34b02f4b4dd5/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/f272c826fa7eff1/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/dbd34b02f4b4dd5/output.txt

Total script time: 17.31 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/f272c826fa7eff1/output.txt

Total script time: 23.35 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@yurydelendik
Copy link
Contributor

Yep, that's what I was looking for. However reading by 16-bit is throwing me off a bit. Can you modify the outer loop to read by bytes and to whitelist tags with length instead of blacklisting know to be without length? Something like this will explicitly specify which marker really have length field and reduce amount of skip(-1) calls:

while ((b = stream.readByte()) !== -1) {
  if (b !== 0xFF) continue; // not a marker
  switch (stream.readByte()) {
    case 0xFF: // padding
      steam.skip(-1); // needing to rewind
      break;
    case 0xC0: // SOF0
    case 0xC1: // SOF1
      var markerLength = stream.readUint16();
      if (markerLength > 2) {
        stream.skip(makerLength - 2);
      }
      break;
  }
}

This patch changes searching for EI image streams to rely on the EOI (end-of-image) marker for DCTDecode filters (i.e. JPEG images).
@Snuffleupagus
Copy link
Collaborator Author

Thanks for the feedback! How does it look now?
I completely agree that white-listing is better, even though the list became quite long :-)

I also added a special-case for the 0xFF00 "marker", since in many images it's by far the single most common 0xFF.. sequence, and this lets us bypass a bunch of pointless comparisons further down.

Edit: In practice, the fill byte case (i.e. sequences of 0xFF) doesn't seem particularly common. In the handful PDF files I used when developing the patch, I never actually came across that one. So it would appear the with the new code, we very rarely need to move back in the stream.

@yurydelendik
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/84110b7dec5a8c2/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/d4b81b63f8c2666/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/84110b7dec5a8c2/output.txt

Total script time: 17.50 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/d4b81b63f8c2666/output.txt

Total script time: 23.04 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

yurydelendik added a commit that referenced this pull request Jan 12, 2015
…i-marker

Refactor searching for end of inline (EI) JPEG image streams
@yurydelendik yurydelendik merged commit 36dd6c1 into mozilla:master Jan 12, 2015
@yurydelendik
Copy link
Contributor

Looks good, thank you!

@Snuffleupagus Snuffleupagus deleted the inline-jpeg-image-find-eoi-marker branch January 12, 2015 18:48
speedplane pushed a commit to speedplane/pdf.js that referenced this pull request Feb 24, 2015
…find-eoi-marker

Refactor searching for end of inline (EI) JPEG image streams
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants