-
-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract images from a pdf page #83
Comments
Hello @totorelmatador! There are a couple of ways to go about this. Some more challenging than others. I wrote a Node script that "scans" an existing PDF and finds all the images it contains, and redraws them all on a new page: You just need to unzip the file and run The script will also log some information about each image in the document, e.g.
Let me know if this is what you're looking for, or if you have any questions! |
Hello @Hopding ! Thank you so much for your answer ! This is exactly the kind of thing I was trying to do !
But it doesn't work well. Saved images can't be opened... Only one of the two images is saved and ready to be opened. I think that the cause is the transparency of the image, but I would like to know if it is possible to face this issue... Thank you again for your time ! |
@totorelmatador Sorry for taking so long to respond to this. I've been swamped with work and school lately, so I haven't had a lot of time to devote to this. However, I've made some progress on creating an example script that shows how to do this (though there are some limitations). I'll try to post a more detailed response soon. |
Thank you a lot for your time @Hopding !
where every |
Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this. First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.
In particular, it's All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is: You just need to unzip the file and run Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from I did a bit of googling to see if I think that adding proper support for image extraction would be an interesting feature to implement in |
It does work perfectly, thank you a lot ! |
thanks a lot, this helps me 29/07/2019 :) |
Thank you so much! |
This does not work on scanned pdf. It results in an "unknown compression method error" |
The pages are filter = JBIG2Decode |
@Hopding thanks for taking the time to post such a useful reply and provide the script. I was wondering if there's an easy way to get the x/y position of the image as well as the width/height? |
My pdf contains images and tables. I need to remove images from all pages of pdf and keep tables as they are and save new document is it possible? |
Hi, I tried the solution of this comment :
When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get So i changed it, and added The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ? Thanks :) |
Hi, hafsa110 I solved this problem just removed "- 1 " in savePng function |
Just a small refresh of the proposed code in another context, this one should work as-is in the browser:
|
Is it possible to get the x,y position of the images? |
In the original extract-image project, this image in existing1.pdf:
Any other ideas? Follow-Up |
Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF. This is my code for putting back the image into the pdf:
This is how i extract the image from the pdf:
In between, you have the function what put the new metadata into the image buffer:
Any solution to this? |
Exif data is appended to the end of the image file.
1 it is not part of the image
2 it makes the file larger
I am not sure exif tags can be added to embedded images
…Sent from my iPhone
On Oct 31, 2023, at 12:52 PM, jappoman ***@***.***> wrote:
Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF.
This is my code for putting back the image into the pdf:
const replaceImagesInPdf = async (pdfDoc, currentPage, newImages) => {
console.log(`Replacing images in page ${currentPage}...`);
console.time("replaceImagesInPdfForPage" + currentPage);
for (let newImage of newImages) {
// Cycling throug the image of the only page in pdf
const imageData = newImage.data;
const imageRef = newImage.ref;
const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
let objectIdx = 0;
enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
objectIdx += 1;
if (!(pdfObject instanceof PDFRawStream)) return;
const { dict } = pdfObject;
const subtype = dict.get(PDFName.of("Subtype"));
if (subtype == PDFName.of("Image") && ref == imageRef) {
pdfObject.contents = imageData;
}
});
}
console.log("Replaced images into page " + currentPage + ".");
console.timeEnd("replaceImagesInPdfForPage" + currentPage);
return pdfDoc;
};
Any solution to this?
—
Reply to this email directly, view it on GitHub<#83 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC5P77QABIMHS2HZR7W32GLYCEUEPAVCNFSM4G7FCDG2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYG43DANRYGQZQ>.
You are receiving this because you commented.Message ID: ***@***.***>
|
@K-R-M Could you elaborate on how you got round this issue as I'm facing the same issue with PNGs. I'm already using the Jimp library for other purposes but can't seem to get around the triple image issue. Thanks |
@search-acumen, unfortunately, I don't remember exactly how I did it. I got laid off and no longer have access to the source code that handled this correctly. |
@K-R-M No problem, thanks for replying anyway. Has anyone else managed to solve this issue? |
@search-acumen |
For those who want to extract only form field images in the pdf and not all of them, I update it yannbertrand's code in following way (return image is in base64 format):
The key point here is that I'm finding the
|
I found this thread very helpful, but, unfortunately, not working for me. I tried many other options for extracting images from PDFs, but none worked. Most options can handle JPG easily but fail on PNG data. Since the technique discussed here at least created PNGs, albeit garbled, I decided to debug this solution, which took many hours. So sharing what is working for me right now. The core problem was matching and properly indexing the alpha lay to the raw image layer. The original code relied on "ref," which is a number to match to smaskRef, which is an object. The solution was to use pdfRef to match to smaskRef. Also, there was a bug in the original code called out by hafsa110, where the image layer itself was set to the alpha layer instead of the alpha layer. Because the alpha layer is a single-band greyscale image, and not a three-band RGB, after making this correction, we can no longer use the image layer pixel indexer to reference pixel data from the alpha layer. To solve this I created a new alpha layer pixel indexer. I marked key changes with comments below.
|
@thomaspurk for implmentation checkout codesandbox |
I did try pdf-image-extractor, among several others. This module was able to handle the JPEGs in my PDFs but threw an error on the PNGs. I just again verified this using the code sandbox link you provided. It was the same behavior as I saw with my test. It gets the JPGs but not the PNGs |
@thomaspurk have you solved it? |
Absolutely. The code I posted above is working well for me! |
@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and |
Node v20.11.0 As I recall, Hopding's original code (posted elsewhere not in this issue) did not work for me. I can only assume there has been some refactoring to the module's class names over the versions between 0.6.1 and 1.17.1. The current documentation references the use of PDFDocument.load. See the examples here, https://pdf-lib.js.org/ |
Thanks @thomaspurk ! |
Hi everyone !
I am trying to extract all images from a pdf page. I don't know if it is possible, but I would to do something like this website does.
I am currently manipulating the pdf as follows :
const pdfDoc = PDFDocumentFactory.load('pdf/path');
const pages = pdfDoc.getPages();
const existingPage = pages[0];
Thank you four your answers :)
The text was updated successfully, but these errors were encountered: