Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse the font file to determine the correct type/subtype, rather than relying on the (often incorrect) data in the font dictionary #9961

Merged
merged 3 commits into from
Aug 5, 2018

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Aug 4, 2018

The current font type/subtype detection code is quite inconsistent/unwieldy. In some cases it will simply assume that the font dictionary is correct, in others it will somewhat "arbitrarily" check the actual font file (more of these cases have been added over the years to fix specific bugs).

As is evident from e.g. issue #9949, the font type/subtype detection code is continuing to cause issues. In an attempt to get rid of these hacks once and for all, this patch instead re-factors the type/subtype detection to always parse the font file.

Please note that, as far as I can tell, we still appear to need to rely on the composite font detection based on the font dictionary. However, even if the composite/non-composite detection would get it wrong, that shouldn't really matter too much given that there's basically only two different code-paths (for "TrueType-like" vs "Type1-like" fonts).

This PR provides a complete fix for issue #9949, in addition to the generally useful workaround implemented by PR #9958. Furthermore, this PR also ought to prevent any future font type/subtype detection bugs too.

@Snuffleupagus
Copy link
Collaborator Author

Since I've not run any tests locally, I suppose that this could all fail spectacularly :-)

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/f0223b1439add40/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/31cfe6bf0474055/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/31cfe6bf0474055/output.txt

Total script time: 28.26 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/31cfe6bf0474055/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/f0223b1439add40/output.txt

Total script time: 35.84 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/f0223b1439add40/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator Author

The failures should hopefully all be caused by a pre-existing existing, since the isTrueTypeFile helper function wasn't able to detect certain Mac fonts.

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/fa6d144f1ee0ca8/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/1e43c45a6f0fe7c/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/1e43c45a6f0fe7c/output.txt

Total script time: 28.13 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Aug 4, 2018

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/fa6d144f1ee0ca8/output.txt

Total script time: 36.04 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@Snuffleupagus Snuffleupagus force-pushed the getFontFileType branch 2 times, most recently from 9e57a80 to 18b9c5b Compare August 5, 2018 08:32
@Snuffleupagus Snuffleupagus force-pushed the getFontFileType branch 3 times, most recently from e8714c5 to 189fa4d Compare August 5, 2018 08:41
Compared to most other font formats, the CFF doesn't have a constant header which makes is slightly more difficult to detect such font files.

Please refer to the Compact Font Format specification: https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5176.CFF.pdf#G3.32094
…n relying on the (often incorrect) data in the font dictionary

The current font type/subtype detection code is quite inconsistent/unwieldy. In some cases it will simply assume that the font dictionary is correct, in others it will somewhat "arbitrarily" check the actual font file (more of these cases have been added over the years to fix specific bugs).

As is evident from e.g. issue 9949, the font type/subtype detection code is continuing to cause issues. In an attempt to get rid of these hacks once and for all, this patch instead re-factors the type/subtype detection to *always* parse the font file.

Please note that, as far as I can tell, we still appear to need to rely on the composite font detection based on the font dictionary. However, even if the composite/non-composite detection would get it wrong, that shouldn't really matter too much given that there's basically only two different code-paths (for "TrueType-like" vs "Type1-like" fonts).
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2018

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/475437f1133487f/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2018

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f121e937e602f73/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2018

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/f121e937e602f73/output.txt

Total script time: 28.22 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2018

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/475437f1133487f/output.txt

Total script time: 35.44 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2018

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a981105c1fd19e7/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Aug 5, 2018

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/a981105c1fd19e7/output.txt

Total script time: 6.91 mins

Published

@timvandermeij timvandermeij merged commit eec7e18 into mozilla:master Aug 5, 2018
@timvandermeij
Copy link
Contributor

Thank you for improving this; much more readable this way and I verified that it fixes the issue too.

@Snuffleupagus Snuffleupagus deleted the getFontFileType branch August 5, 2018 17:09
@brendandahl
Copy link
Contributor

brendandahl commented Aug 6, 2018

Another possibility is doing something like tx and peek into the file:

https://github.com/adobe-type-tools/afdko/blob/d9f9d28ac6959d7c12858c28a8a488e63e3dd174/c/tx/source/tx.c#L6767

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented Aug 6, 2018

Another possibility is doing something like tx and peek into the file:

But... that's what this PR does :-)

The newly added getFontFileType function will "peek" into the font files themselves to determine their actual types, by making use of these existing helper functions:

pdf.js/src/core/fonts.js

Lines 656 to 700 in c8ee633

function isTrueTypeFile(file) {
var header = file.peekBytes(4);
return (readUint32(header, 0) === 0x00010000 ||
bytesToString(header) === 'true');
}
function isTrueTypeCollectionFile(file) {
let header = file.peekBytes(4);
return bytesToString(header) === 'ttcf';
}
function isOpenTypeFile(file) {
var header = file.peekBytes(4);
return bytesToString(header) === 'OTTO';
}
function isType1File(file) {
var header = file.peekBytes(2);
// All Type1 font programs must begin with the comment '%!' (0x25 + 0x21).
if (header[0] === 0x25 && header[1] === 0x21) {
return true;
}
// ... obviously some fonts violate that part of the specification,
// please refer to the comment in |Type1Font| below.
if (header[0] === 0x80 && header[1] === 0x01) { // pfb file header.
return true;
}
return false;
}
/**
* Compared to other font formats, the header in CFF files is not constant
* but contains version numbers. To reduce the possibility of misclassifying
* font files as CFF, it's recommended to check for other font formats first.
*/
function isCFFFile(file) {
const header = file.peekBytes(4);
if (/* major version, [1, 255] */ header[0] >= 1 &&
/* minor version, [0, 255]; header[1] */
/* header size, [0, 255]; header[2] */
/* offset(0) size, [1, 4] */ (header[3] >= 1 && header[3] <= 4)) {
return true;
}
return false;
}

@brendandahl
Copy link
Contributor

brendandahl commented Aug 6, 2018

Hah, okay, I was lazy and didn't look into what all the is<SomeType> method did. I assumed we were just looking at the font dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants