Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813

Closed
l5oo00 opened this issue Nov 15, 2016 · 11 comments

Comments

@l5oo00
Copy link

l5oo00 commented Nov 15, 2016

I just want to convert a pdf file to a folder that include text/json, image, fonts, and render them in browser By myself.

Are there relevant solutions?

@yurydelendik
Copy link
Contributor

There is https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2svg.js . Closing as answered.

@zeddysoft
Copy link

zeddysoft commented Feb 23, 2019

Hi @yurydelendik , thanks for the link you shared, but i noticed that the script just converts each page of the pdf to an svg, how about one that extracts the images from all the pages, something like what this does.

@totorelmatador
Copy link

Hi @zeddysoft , I am looking for a way to extract images from a pdf page too. Have you found something ?

@zeddysoft
Copy link

Yes i have

@totorelmatador
Copy link

Haha nice, and can you share it ? :) Because i tried many things... but everything failed x)

@zeddysoft
Copy link

You can find it here

https://github.com/zeddysoft/pdf-processor/blob/master/handler.js

Don't forget to star the repo :)

@totorelmatador
Copy link

Thank you, i will check that ! And i won't forget the star ;)

@sasiprojs
Copy link

Hi @zeddysoft Could you explain how to use that repo

@onzag
Copy link

onzag commented Jan 8, 2022

Guys guys I figured it out, or so I think. :D

You are going to need a canvas library to extract the image (if you are using nodejs), (and maybe a converter) but other than that, I got it working.

Sorry for the code not being too great, I will explain.

// first here I open the document
pdf.getDocument('solar.pdf').promise.then(async function(pdfObj) {
   // because I am testing, I just wanted to get page 7
   const page = await pdfObj.getPage(7);

   // now I need to get the image information and for that I get the operator list
   const operators = await page.getOperatorList();

   // this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted
   const rawImgOperator = operators.fnArray.map((f, index) => f === pdf.OPS.paintImageXObject ? index : null).filter((n) => n !== null);

   // now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here
   const filename = operators.argsArray[rawImgOperator[0]][0];

  // now we get the object itself from page.objs using the filename
   page.objs.get(filename, async (arg) => {

      // and here is where we need the canvas, the object contains information such as width and height
       const canvas = ccc.createCanvas(arg.width, arg.height);
       const ctx = canvas.getContext('2d');

       // now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case
       const data = new Uint8ClampedArray(arg.width * arg.height * 4);
       let k = 0;
       let i = 0;
       while (i < arg.data.length) {
        data[k] = arg.data[i]; // r
        data[k + 1] = arg.data[i + 1]; // g
        data[k + 2] = arg.data[i + 2]; // b
        data[k + 3] = 255; // a

        i += 3;
        k += 4;
       }

       // now here I create the image data context
       const imgData = ctx.createImageData(arg.width, arg.height);
       imgData.data.set(data);
       ctx.putImageData(imgData, 0, 0);

       // get myself a buffer
       const buff = canvas.toBuffer();

       // and I wrote the file, worked like charm, but this buffer encodes for a png image, which can be rather large, with an image conversion utility like sharp.js you may get better results by compressing the thing.
       fs.writeFile("test", buff);
   });
});

@GitMurf
Copy link

GitMurf commented Nov 11, 2022

@onzag Thank you for this solution you provided. Do you know if there is any way to do something similar but without using Canvas? In other words, need to convert arg.data to something like a blob or b64 that can we converted/written to file as an image (PNG or JPEG)? I have problems with Electron using node-canvas (long story) so need a way to convert image data from PDF to an image file without using Canvas. Is this possible?

@dreamer2q
Copy link

dreamer2q commented Jan 8, 2023

@onzag Thank you for this solution you provided. Do you know if there is any way to do something similar but without using Canvas? In other words, need to convert arg.data to something like a blob or b64 that can we converted/written to file as an image (PNG or JPEG)? I have problems with Electron using node-canvas (long story) so need a way to convert image data from PDF to an image file without using Canvas. Is this possible?

Just like what you said, I want to directly extract images from a pdf file instead of drawing it (canvas or whatever, I dislike it).

But pdf file format is somehow very complicated, I only succeeded in extracting all bmp images from a pdf file (of course without canvas).

When it comes to other images types (say png), you need to properly handle the raw data (read from a pdf file) and construct the needed image format you want to get (say bmp). This requires the knowledge of handling image formats, and it just takes too much time to learn ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants