You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a follow up of #72. When converting the test.pdf to html, the font is not recognized because it is named incorrectly.
There are actually two issues here:
The font-name has an strange prefix, e.g. "VZWISY+Georgia" instead of "Georgia"
The font-family css attribute has a binary string as value, including the b'' prefix.
To Reproduce
Convert the test.pdf to html:
pdf2txt.py test.pdf -t html -o test.html
Observe the span element:
<spanstyle="font-family: b'VZWISY+Georgia'; font-size:12px">The Portable Document Format (PDF) is the world’s leading language for describing
<br>the printed page</span>
Expected behavior
Extract font-name correctly, or rename it in HTMLConverter such that Georgia is recognized by the browser.
The text was updated successfully, but these errors were encountered:
I figured out what the strange prefix is. According to Section 5.5.3 from the PDF Reference:
For a font subset, the PostScript name of the font - the value of the font's BaseFont entry and the font descriptor's FontName entry - begins with a tag followed by a plus sign (+). The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags. For example, EOODIA+Poetica is the name of a subset of Poetica, a Type 1 font.
Describe the bug
This is a follow up of #72. When converting the test.pdf to html, the font is not recognized because it is named incorrectly.
There are actually two issues here:
b''
prefix.To Reproduce
Convert the
test.pdf
to html:Observe the span element:
Expected behavior
Extract font-name correctly, or rename it in
HTMLConverter
such that Georgia is recognized by the browser.The text was updated successfully, but these errors were encountered: