Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow text to be selectable/findable #10

Closed
westonruter opened this issue Jun 15, 2011 · 6 comments
Closed

Allow text to be selectable/findable #10

westonruter opened this issue Jun 15, 2011 · 6 comments

Comments

@westonruter
Copy link

I'm sure this feature has been considered, but this library would be a magnitude cooler if the text in the PDF were interactive, that is, can be selected or traverses by the browser's find functionality.

I'm sure there are many reasons why the text should be embedded directly into the canvas (e.g. for layering), but could transparent text be layered on top of the canvas to allow it to still be selected? This text can be absolutely positioned and have a color of rgba(0,0,0,0.0). See demo: http://jsfiddle.net/westonruter/UGZWE/

@vingtetun
Copy link
Contributor

Definitively this is something we want to do, Chris Jones speak of some direction about that at the end of his blog post http://blog.mozilla.com/cjones/2011/06/15/overview-of-pdf-js-guts/

For the moment we're still learning things about PDF and looking for what's missing on the browser side and what existing technologies (such as SVG) can do about it. Nothing has been decided how the right way to implement the selection feature and we are are open to suggestions, even more opened to patches! :)

Also words inside a PDF are chunks or letters, in order to implement a search/selection feature one needs to figure out an algorithm to rebuild the strings and determine which chunks lives together.

On my side I'm busy working on fonts extraction of the document in order to render Type1 Fonts via @font-face (not natively supported by the browser) and doing rewrite on the fly of badly formed TrueType embed inside the pdf documents (in order to pass the fonts sanitizer of the browser...), bug I would be more than happy to provide directions to implement something or to discuss a solution.

@joneschrisg
Copy link
Contributor

Basically, we have two options.
(1) Convert PDF to SVG, let browser do text selection/find. This is obviously attractive because the browser does the "hard work".
(2) Do text selection/find from within pdf.js, on canvas or SVG. This has highest upside because we can use heuristics specific to known PDFs to decide what text to select. For example, a vertical line extending most of the length of the page is probably a column separator. The browser can't assume these things.

Since (1) is less work for us, we're targeting that first. We'll have to see whether that works well enough for us to drop (2). There are probably many other ways to approach this problem.

@notmasteryet
Copy link
Contributor

Selectable text prototype https://github.com/notmasteryet/pdf.js/tree/text-1 via div and no-color text. Uses mozCurrentTransform, so will work only with Beta, Aurora and NIghtly. Something to play with...

@notmasteryet
Copy link
Contributor

@arturadib
Copy link
Contributor

Added to Milestone.

Who wants to get self-assigned to this issue?

@wfwalker

@arturadib
Copy link
Contributor

Text selection has been implemented. There's another open issue for text search (see #819). Closing, please reopen if we missed something.

bovardtiberi-wf referenced this issue in Workiva/pdf.js Jul 15, 2013
FIrst round of instructions generated from our artificial canvas context
movsb pushed a commit to movsb/pdf.js that referenced this issue Jul 14, 2018
…pageNumbers

PR 7341 added special handling for `nameddest`s that look like pageNumbers, to prevent issues since we previously *incorrectly* supported specifying a pageNumber directly in the hash; i.e. `mozilla#10` versus the correct `#page=10` format.

Since this behaviour wasn't correct, PR 7757 fixed and deprecated the old format, which means that we no longer need to maintain the `nameddest` hack in multiple files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants