Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813

l5oo00 · 2016-11-15T07:02:12Z

I just want to convert a pdf file to a folder that include text/json, image, fonts, and render them in browser By myself.

Are there relevant solutions?

yurydelendik · 2016-11-15T13:50:04Z

There is https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2svg.js . Closing as answered.

zeddysoft · 2019-02-23T04:21:40Z

Hi @yurydelendik , thanks for the link you shared, but i noticed that the script just converts each page of the pdf to an svg, how about one that extracts the images from all the pages, something like what this does.

totorelmatador · 2019-03-15T15:50:29Z

Hi @zeddysoft , I am looking for a way to extract images from a pdf page too. Have you found something ?

zeddysoft · 2019-03-18T12:36:05Z

Yes i have

totorelmatador · 2019-03-18T12:50:42Z

Haha nice, and can you share it ? :) Because i tried many things... but everything failed x)

zeddysoft · 2019-03-18T13:02:57Z

You can find it here

https://github.com/zeddysoft/pdf-processor/blob/master/handler.js

Don't forget to star the repo :)

totorelmatador · 2019-03-18T13:12:26Z

Thank you, i will check that ! And i won't forget the star ;)

sasiprojs · 2020-05-12T05:34:31Z

Hi @zeddysoft Could you explain how to use that repo

onzag · 2022-01-08T17:01:08Z

Guys guys I figured it out, or so I think. :D

You are going to need a canvas library to extract the image (if you are using nodejs), (and maybe a converter) but other than that, I got it working.

Sorry for the code not being too great, I will explain.

// first here I open the document
pdf.getDocument('solar.pdf').promise.then(async function(pdfObj) {
   // because I am testing, I just wanted to get page 7
   const page = await pdfObj.getPage(7);

   // now I need to get the image information and for that I get the operator list
   const operators = await page.getOperatorList();

   // this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted
   const rawImgOperator = operators.fnArray.map((f, index) => f === pdf.OPS.paintImageXObject ? index : null).filter((n) => n !== null);

   // now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here
   const filename = operators.argsArray[rawImgOperator[0]][0];

  // now we get the object itself from page.objs using the filename
   page.objs.get(filename, async (arg) => {

      // and here is where we need the canvas, the object contains information such as width and height
       const canvas = ccc.createCanvas(arg.width, arg.height);
       const ctx = canvas.getContext('2d');

       // now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case
       const data = new Uint8ClampedArray(arg.width * arg.height * 4);
       let k = 0;
       let i = 0;
       while (i < arg.data.length) {
        data[k] = arg.data[i]; // r
        data[k + 1] = arg.data[i + 1]; // g
        data[k + 2] = arg.data[i + 2]; // b
        data[k + 3] = 255; // a

        i += 3;
        k += 4;
       }

       // now here I create the image data context
       const imgData = ctx.createImageData(arg.width, arg.height);
       imgData.data.set(data);
       ctx.putImageData(imgData, 0, 0);

       // get myself a buffer
       const buff = canvas.toBuffer();

       // and I wrote the file, worked like charm, but this buffer encodes for a png image, which can be rather large, with an image conversion utility like sharp.js you may get better results by compressing the thing.
       fs.writeFile("test", buff);
   });
});

GitMurf · 2022-11-11T22:12:42Z

@onzag Thank you for this solution you provided. Do you know if there is any way to do something similar but without using Canvas? In other words, need to convert arg.data to something like a blob or b64 that can we converted/written to file as an image (PNG or JPEG)? I have problems with Electron using node-canvas (long story) so need a way to convert image data from PDF to an image file without using Canvas. Is this possible?

dreamer2q · 2023-01-08T16:25:59Z

@onzag Thank you for this solution you provided. Do you know if there is any way to do something similar but without using Canvas? In other words, need to convert arg.data to something like a blob or b64 that can we converted/written to file as an image (PNG or JPEG)? I have problems with Electron using node-canvas (long story) so need a way to convert image data from PDF to an image file without using Canvas. Is this possible?

Just like what you said, I want to directly extract images from a pdf file instead of drawing it (canvas or whatever, I dislike it).

But pdf file format is somehow very complicated, I only succeeded in extracting all bmp images from a pdf file (of course without canvas).

When it comes to other images types (say png), you need to properly handle the raw data (read from a pdf file) and construct the needed image format you want to get (say bmp). This requires the knowledge of handling image formats, and it just takes too much time to learn ...

yurydelendik closed this as completed Nov 15, 2016

Hopding mentioned this issue Apr 28, 2019

Extract images from a pdf page Hopding/pdf-lib#83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813

Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813

l5oo00 commented Nov 15, 2016

yurydelendik commented Nov 15, 2016

zeddysoft commented Feb 23, 2019 •

edited

Loading

totorelmatador commented Mar 15, 2019

zeddysoft commented Mar 18, 2019

totorelmatador commented Mar 18, 2019

zeddysoft commented Mar 18, 2019

totorelmatador commented Mar 18, 2019

sasiprojs commented May 12, 2020

onzag commented Jan 8, 2022

GitMurf commented Nov 11, 2022

dreamer2q commented Jan 8, 2023 •

edited

Loading

Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813

Can I use pdf.js in node.js to extract the pdf file's data like image, text(or json), fonts ..., and save them in a folder? #7813

Comments

l5oo00 commented Nov 15, 2016

yurydelendik commented Nov 15, 2016

zeddysoft commented Feb 23, 2019 • edited Loading

totorelmatador commented Mar 15, 2019

zeddysoft commented Mar 18, 2019

totorelmatador commented Mar 18, 2019

zeddysoft commented Mar 18, 2019

totorelmatador commented Mar 18, 2019

sasiprojs commented May 12, 2020

onzag commented Jan 8, 2022

GitMurf commented Nov 11, 2022

dreamer2q commented Jan 8, 2023 • edited Loading

zeddysoft commented Feb 23, 2019 •

edited

Loading

dreamer2q commented Jan 8, 2023 •

edited

Loading