-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813
Comments
There is https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2svg.js . Closing as answered. |
Hi @yurydelendik , thanks for the link you shared, but i noticed that the script just converts each page of the pdf to an svg, how about one that extracts the images from all the pages, something like what this does. |
Hi @zeddysoft , I am looking for a way to extract images from a pdf page too. Have you found something ? |
Yes i have |
Haha nice, and can you share it ? :) Because i tried many things... but everything failed x) |
You can find it here https://github.com/zeddysoft/pdf-processor/blob/master/handler.js Don't forget to star the repo :) |
Thank you, i will check that ! And i won't forget the star ;) |
Hi @zeddysoft Could you explain how to use that repo |
Guys guys I figured it out, or so I think. :D You are going to need a canvas library to extract the image (if you are using nodejs), (and maybe a converter) but other than that, I got it working. Sorry for the code not being too great, I will explain. // first here I open the document
pdf.getDocument('solar.pdf').promise.then(async function(pdfObj) {
// because I am testing, I just wanted to get page 7
const page = await pdfObj.getPage(7);
// now I need to get the image information and for that I get the operator list
const operators = await page.getOperatorList();
// this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted
const rawImgOperator = operators.fnArray.map((f, index) => f === pdf.OPS.paintImageXObject ? index : null).filter((n) => n !== null);
// now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here
const filename = operators.argsArray[rawImgOperator[0]][0];
// now we get the object itself from page.objs using the filename
page.objs.get(filename, async (arg) => {
// and here is where we need the canvas, the object contains information such as width and height
const canvas = ccc.createCanvas(arg.width, arg.height);
const ctx = canvas.getContext('2d');
// now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case
const data = new Uint8ClampedArray(arg.width * arg.height * 4);
let k = 0;
let i = 0;
while (i < arg.data.length) {
data[k] = arg.data[i]; // r
data[k + 1] = arg.data[i + 1]; // g
data[k + 2] = arg.data[i + 2]; // b
data[k + 3] = 255; // a
i += 3;
k += 4;
}
// now here I create the image data context
const imgData = ctx.createImageData(arg.width, arg.height);
imgData.data.set(data);
ctx.putImageData(imgData, 0, 0);
// get myself a buffer
const buff = canvas.toBuffer();
// and I wrote the file, worked like charm, but this buffer encodes for a png image, which can be rather large, with an image conversion utility like sharp.js you may get better results by compressing the thing.
fs.writeFile("test", buff);
});
}); |
@onzag Thank you for this solution you provided. Do you know if there is any way to do something similar but without using Canvas? In other words, need to convert |
Just like what you said, I want to directly extract images from a pdf file instead of drawing it (canvas or whatever, I dislike it). But pdf file format is somehow very complicated, I only succeeded in extracting all bmp images from a pdf file (of course without canvas). When it comes to other images types (say png), you need to properly handle the raw data (read from a pdf file) and construct the needed image format you want to get (say bmp). This requires the knowledge of handling image formats, and it just takes too much time to learn ... |
I just want to convert a pdf file to a folder that include text/json, image, fonts, and render them in browser By myself.
Are there relevant solutions?
The text was updated successfully, but these errors were encountered: