Scanned Paper with Images #224

mealCode · 2019-10-22T11:48:51Z

this [https://drive.google.com/file/d/1aFdRKjujeHDHTLw9SSGz0keuAhRcgbhR/view?usp=sharing](pdf link) is a scanned paper with images. Can the images within the scanned paper can be extracted? Right now, it extracted images but just a dark images.

Please see pdf in the link above. Thanks for your kind support. Cheers

Hopding · 2019-12-23T18:18:00Z

Hello @mealCode! pdf-lib does not provide any APIs for image extraction. However, this does not mean that image extraction is impossible to do with pdf-lib. It certainly is possible, but doing so requires writing a substantial amount of logic yourself.

I've provided an example that is able to extract some types of images from PDF files here: #83 (comment). But, as I note in that thread, it doesn't work for all images:

First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.

pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.

In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).

I do not have the time to port over all the logic myself. I am simply to busy with other pdf-lib maintenance and development work. But perhaps sometime in the future I'll port it over when other more pressing things are taken care of.

Or, if any enterprising individual would like to do it themselves, I would be willing to consider merging a PR for it. Either that, or it could be provided as a standalone "addon" library for pdf-lib. Whichever would make the most sense. I think it would mostly depend on (a) the amount of code required, and (b) whether any new external dependencies would be required.

I hope this helps. Please let me know if you have any additional questions!

Hopding closed this as completed Dec 23, 2019

Hopding mentioned this issue Dec 27, 2019

extract and replace image #175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanned Paper with Images #224

Scanned Paper with Images #224

mealCode commented Oct 22, 2019 •

edited

Loading

Hopding commented Dec 23, 2019 •

edited

Loading

Scanned Paper with Images #224

Scanned Paper with Images #224

Comments

mealCode commented Oct 22, 2019 • edited Loading

Hopding commented Dec 23, 2019 • edited Loading

mealCode commented Oct 22, 2019 •

edited

Loading

Hopding commented Dec 23, 2019 •

edited

Loading