The font name from an embedded font contains a strange prefix #349

pietermarsman · 2019-12-30T17:08:39Z

Describe the bug

This is a follow up of #72. When converting the test.pdf to html, the font is not recognized because it is named incorrectly.

There are actually two issues here:

The font-name has an strange prefix, e.g. "VZWISY+Georgia" instead of "Georgia"
The font-family css attribute has a binary string as value, including the b'' prefix.

To Reproduce

Convert the test.pdf to html:

pdf2txt.py test.pdf -t html -o test.html

Observe the span element:

<span style="font-family: b'VZWISY+Georgia'; font-size:12px">The Portable Document Format (PDF) is the world’s leading language for describing 
<br>the printed page</span>

Expected behavior

Extract font-name correctly, or rename it in HTMLConverter such that Georgia is recognized by the browser.

The text was updated successfully, but these errors were encountered:

pietermarsman · 2019-12-30T19:50:10Z

I figured out what the strange prefix is. According to Section 5.5.3 from the PDF Reference:

For a font subset, the PostScript name of the font - the value of the font's BaseFont entry and the font descriptor's FontName entry - begins with a tag followed by a plus sign (+). The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags. For example, EOODIA+Poetica is the name of a subset of Poetica, a Type 1 font.

pietermarsman mentioned this issue Dec 30, 2019

PDF to HTML conversion issues #72

Closed

pietermarsman added the type: bug label Dec 30, 2019

pietermarsman mentioned this issue Jan 9, 2020

Fix font name by removing subset tag #357

Merged

5 tasks

pietermarsman closed this as completed in #357 Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The font name from an embedded font contains a strange prefix #349

The font name from an embedded font contains a strange prefix #349

pietermarsman commented Dec 30, 2019 •

edited

Loading

pietermarsman commented Dec 30, 2019

The font name from an embedded font contains a strange prefix #349

The font name from an embedded font contains a strange prefix #349

Comments

pietermarsman commented Dec 30, 2019 • edited Loading

pietermarsman commented Dec 30, 2019

pietermarsman commented Dec 30, 2019 •

edited

Loading