Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanned Paper with Images #224

Closed
mealCode opened this issue Oct 22, 2019 · 1 comment
Closed

Scanned Paper with Images #224

mealCode opened this issue Oct 22, 2019 · 1 comment

Comments

@mealCode
Copy link

mealCode commented Oct 22, 2019

this [https://drive.google.com/file/d/1aFdRKjujeHDHTLw9SSGz0keuAhRcgbhR/view?usp=sharing](pdf link) is a scanned paper with images. Can the images within the scanned paper can be extracted? Right now, it extracted images but just a dark images.

Please see pdf in the link above. Thanks for your kind support. Cheers

@Hopding
Copy link
Owner

Hopding commented Dec 23, 2019

Hello @mealCode! pdf-lib does not provide any APIs for image extraction. However, this does not mean that image extraction is impossible to do with pdf-lib. It certainly is possible, but doing so requires writing a substantial amount of logic yourself.

I've provided an example that is able to extract some types of images from PDF files here: #83 (comment). But, as I note in that thread, it doesn't work for all images:

First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.

pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.

In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).

I do not have the time to port over all the logic myself. I am simply to busy with other pdf-lib maintenance and development work. But perhaps sometime in the future I'll port it over when other more pressing things are taken care of.

Or, if any enterprising individual would like to do it themselves, I would be willing to consider merging a PR for it. Either that, or it could be provided as a standalone "addon" library for pdf-lib. Whichever would make the most sense. I think it would mostly depend on (a) the amount of code required, and (b) whether any new external dependencies would be required.

I hope this helps. Please let me know if you have any additional questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants