Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: All page lines are merged into a single line #16209

Closed
DoneDeal0 opened this issue Mar 25, 2023 · 4 comments
Closed

[BUG]: All page lines are merged into a single line #16209

DoneDeal0 opened this issue Mar 25, 2023 · 4 comments

Comments

@DoneDeal0
Copy link

DoneDeal0 commented Mar 25, 2023

Attach (recommended) or Link to PDF file here:
pdf example link

Configuration
Osx 10.13.4
"pdfjs-dist": "^3.2.146",

Steps to reproduce the problem

public async readFile(
    pageParser: (data: PDFLine[]) => any
  ): Promise<void> {
    try {
      PDFJS.GlobalWorkerOptions.workerSrc = pdfjsWorker;
      const doc = await PDFJS.getDocument(this.pdfPath).promise;
      const pages = doc.numPages;
      for (let i = 1; i <= pages; i++) {
        const page = await doc.getPage(i);
        const data = await page.getTextContent({
          disableCombineTextItems: true,
        });
        pageParser(data.items as PDFLine[]);
      }
    } catch (err) {
      return err;
    }
  }

What is the expected behavior?
Each line of the pdf should be considered as a unique line.

What went wrong?
All lines of the page are merged into one or two lines (instead of 40+ lines). The line breaks are not properly taken into account. Please note it doesn't always happens depending on the pdf. Sometimes it works. PDFJS parsing is not regular.

@Snuffleupagus
Copy link
Collaborator

"pdfjs-dist": "^3.2.146",

That version is no longer supported, please find the latest releases at https://mozilla.github.io/pdf.js/getting_started/#download

Steps to reproduce the problem

Please see https://github.com/mozilla/pdf.js/blob/master/.github/CONTRIBUTING.md (emphasis mine):

If you are developing a custom solution, first check the examples at https://github.com/mozilla/pdf.js#learning and search existing issues. If this does not help, please prepare a short well-documented example that demonstrates the problem and make it accessible online on your website, JS Bin, GitHub, etc. before opening a new issue or contacting us in the Matrix room -- keep in mind that just code snippets won't help us troubleshoot the problem.

Each line of the pdf should be considered as a unique line.

I checked a couple of different pages in that document with the browser dev-tools, calling the the getTextContent method manually, and generally it seemed that "lines" were handled correctly.
It's not really clear here which specific page(s) you're referring to!?

All lines of the page are merged into one or two lines (instead of 40+ lines). The line breaks are not properly taken into account. Please note it doesn't always happens depending on the pdf. Sometimes it works. PDFJS parsing is not regular.

That's a limitation of the PDF format itself, since in most cases glyphs are absolutely positioned and there's simply no concept of "lines" in the majority of all PDF documents.

@DoneDeal0
Copy link
Author

It's not really clear here which specific page(s) you're referring to!?

All of the pages actually. For exemple, on page 1 (juste after the title page), I should receive:
lines[0].str = "INT. FITTS HOUSE - RICKY'S BEDROOM - NIGHT"
lines[1].str = "On VIDEO: JANE BURNHAM lays in bed, wearing a tank top. She's"
lines[2].str = "sixteen, with dark, intense eyes."
etc.

Instead, I receive only 2 lines. The first one has a huge width of 6559px:
lines[0].str = "INT. FITTS HOUSE - RICKY'S BEDROOM - NIGHT On VIDEO: JANE BURNHAM lays in bed, wearing a tank top. She's (...) Of course, I don't know that yet."
lines[1].str = "(CONTINUED)"

So it doesn't work. I've upgraded pdfjs to the latest version: 3.4.120.

@Snuffleupagus
Copy link
Collaborator

Opening your PDF document with https://mozilla.github.io/pdf.js/web/viewer.html (using e.g. drag-and-drop) and executing await PDFViewerApplication.pdfViewer.getPageView(1).pdfPage.getTextContent() in the console gives reasonable looking results:

{
  "items": [
    {
      "str": "INT. FITTS HOUSE - RICKY'S BEDROOM - NIGHT",
      "dir": "ltr",
      "width": 294.5279999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        711
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "On VIDEO: JANE BURNHAM lays in bed, wearing a tank top. She's",
      "dir": "ltr",
      "width": 427.67999999999984,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        687
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "sixteen, with dark, intense eyes.",
      "dir": "ltr",
      "width": 231.45599999999988,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        675
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "JANE",
      "dir": "ltr",
      "width": 28.04399999999997,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        252,
        651
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "I need a father who's a role model,",
      "dir": "ltr",
      "width": 245.47199999999978,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        639
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "not some horny geek-boy who's gonna",
      "dir": "ltr",
      "width": 245.47199999999978,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        627
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "spray his shorts whenever I bring a",
      "dir": "ltr",
      "width": 245.47199999999978,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        615
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "girlfriend home from school.",
      "dir": "ltr",
      "width": 196.41599999999977,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        603
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "(snorts)",
      "dir": "ltr",
      "width": 56.08799999999994,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        209.004,
        591
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "What a lame-o. Somebody really",
      "dir": "ltr",
      "width": 210.4319999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        579
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "should put him out of his misery.",
      "dir": "ltr",
      "width": 231.45599999999973,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        567
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "Her mind wanders for a beat.",
      "dir": "ltr",
      "width": 196.41599999999985,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        543
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "RICKY (O.S.)",
      "dir": "ltr",
      "width": 84.15599999999989,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        252,
        519
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "Want me to kill him for you?",
      "dir": "ltr",
      "width": 196.41599999999977,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        507
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "Jane looks at us and sits up.",
      "dir": "ltr",
      "width": 203.42399999999986,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        483
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "JANE",
      "dir": "ltr",
      "width": 28.04399999999997,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        252,
        459
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "(deadpan)",
      "dir": "ltr",
      "width": 63.07199999999993,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        209.004,
        447
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "Yeah, would you?",
      "dir": "ltr",
      "width": 112.13999999999986,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        435
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "FADE TO BLACK:",
      "dir": "ltr",
      "width": 98.14799999999997,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        411
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "FADE IN:",
      "dir": "ltr",
      "width": 56.08799999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        387
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "EXT. ROBIN HOOD TRAIL - EARLY MORNING",
      "dir": "ltr",
      "width": 259.48799999999983,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        351
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "We're FLYING above suburban America, DESCENDING SLOWLY toward",
      "dir": "ltr",
      "width": 427.67999999999995,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        327
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "a tree-lined street.",
      "dir": "ltr",
      "width": 140.1239999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        315
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "LESTER (V.O.)",
      "dir": "ltr",
      "width": 91.15199999999989,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        252,
        291
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "My name is Lester Burnham. This is",
      "dir": "ltr",
      "width": 238.46399999999977,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        279
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "my neighborhood. This is my street.",
      "dir": "ltr",
      "width": 245.4719999999997,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        267
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "This... is my life. I'm forty-two",
      "dir": "ltr",
      "width": 231.4559999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        255
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "years old. In less than a year,",
      "dir": "ltr",
      "width": 217.43999999999977,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        243
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "I'll be dead.",
      "dir": "ltr",
      "width": 91.15199999999993,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        231
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "INT. BURNHAM HOUSE - MASTER BEDROOM - CONTINUOUS",
      "dir": "ltr",
      "width": 336.5759999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        195
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "We're looking down at a king-sized BED from OVERHEAD:",
      "dir": "ltr",
      "width": 371.61599999999976,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        171
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "LESTER BURNHAM lies sleeping amidst expensive bed linens,",
      "dir": "ltr",
      "width": 399.6479999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        147
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "face down, wearing PAJAMAS. An irritating ALARM CLOCK RINGS.",
      "dir": "ltr",
      "width": 420.67199999999997,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        135
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "Lester gropes blindly to shut it off.",
      "dir": "ltr",
      "width": 259.4879999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        108,
        123
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "LESTER (V.O.)",
      "dir": "ltr",
      "width": 91.15199999999989,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        252,
        99
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "Of course, I don't know that yet.",
      "dir": "ltr",
      "width": 231.4559999999998,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        180,
        87
      ],
      "fontName": "g_d1_f3",
      "hasEOL": false
    },
    {
      "str": "",
      "dir": "ltr",
      "width": 0,
      "height": 0,
      "transform": [
        12,
        0,
        0,
        12,
        463,
        39
      ],
      "fontName": "g_d1_f3",
      "hasEOL": true
    },
    {
      "str": "(CONTINUED)",
      "dir": "ltr",
      "width": 77.16000000000007,
      "height": 12,
      "transform": [
        12,
        0,
        0,
        12,
        463,
        39
      ],
      "fontName": "g_d1_f3",
      "hasEOL": false
    }
  ],
  "styles": {
    "g_d1_f3": {
      "fontFamily": "monospace",
      "ascent": 0.79150390625,
      "descent": -0.21630859375,
      "vertical": false
    }
  }
}

@DoneDeal0
Copy link
Author

Ok, the error came from the option "disableCombineTextItems: true". Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants