Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract images from a pdf page #83

Closed
totorelmatador opened this issue Mar 18, 2019 · 32 comments
Closed

Extract images from a pdf page #83

totorelmatador opened this issue Mar 18, 2019 · 32 comments

Comments

@totorelmatador
Copy link

Hi everyone !
I am trying to extract all images from a pdf page. I don't know if it is possible, but I would to do something like this website does.
I am currently manipulating the pdf as follows :
const pdfDoc = PDFDocumentFactory.load('pdf/path');
const pages = pdfDoc.getPages();
const existingPage = pages[0];
Thank you four your answers :)

@Hopding
Copy link
Owner

Hopding commented Mar 21, 2019

Hello @totorelmatador!

There are a couple of ways to go about this. Some more challenging than others. I wrote a Node script that "scans" an existing PDF and finds all the images it contains, and redraws them all on a new page:

redraw-images.zip

You just need to unzip the file and run yarn install (or npm install) and then run node index.js. The script will write the new PDF to modified.pdf. Here's what modified.pdf looks like:

modified.pdf

The script will also log some information about each image in the document, e.g.

Images in PDF:
Name: JfImage0001
  Width: 176
  Height: 157
  Bits Per Component: 1
  Data: Uint8Array(1778)
  Ref: 20 0 R
...
Name: JfImage0036
  Width: 556
  Height: 271
  Bits Per Component: 8
  Data: Uint8Array(461)
  Ref: 58 0 R

Let me know if this is what you're looking for, or if you have any questions!

@totorelmatador
Copy link
Author

Hello @Hopding !

Thank you so much for your answer ! This is exactly the kind of thing I was trying to do !
But I still have a question. My final objective is to save these images as separated files. I tried to do so with the following code (added to your file index.js) :

var i = 0;
imagesInDoc.forEach(image => {
  fs.writeFile("./images/out"+i+".png", image.data, 'base64', function(err) {
    console.log(err);
  });
  i+=1;
});

But it doesn't work well. Saved images can't be opened...
The funny thing is that the code works for some images. When I try on this document :

existing.pdf

Only one of the two images is saved and ready to be opened. I think that the cause is the transparency of the image, but I would like to know if it is possible to face this issue...

Thank you again for your time !

@Hopding
Copy link
Owner

Hopding commented Apr 10, 2019

@totorelmatador Sorry for taking so long to respond to this. I've been swamped with work and school lately, so I haven't had a lot of time to devote to this. However, I've made some progress on creating an example script that shows how to do this (though there are some limitations). I'll try to post a more detailed response soon.

@totorelmatador
Copy link
Author

Thank you a lot for your time @Hopding !
I have observed something. When we add a png image in a pdf file, we use the following function:
[imgRef, imgDims] = pdfDoc.embedPNG(PNGimage)
The type of imgRef is PDFIndirectReference, the one of imgDims is PNGXObjectFactory, and PNGimage is an image buffer.
When we find all the image objects in the PDF we use the following code:

pdfDoc.index.index.forEach((pdfObject, ref) => {
  objectIdx += 1;

  if (!(pdfObject instanceof PDFRawStream)) return;

  const { lookupMaybe } = pdfDoc.index;
  const { dictionary: dict } = pdfObject;

  const subtype = lookupMaybe(dict.getMaybe('Subtype'));
  const width = lookupMaybe(dict.getMaybe('Width'));
  const height = lookupMaybe(dict.getMaybe('Height'));
  const name = lookupMaybe(dict.getMaybe('Name'));
  const bitsPerComponent = lookupMaybe(dict.getMaybe('BitsPerComponent'));

  if (subtype === PDFName.from('Image')) {
      imagesInDoc.push({
      ref,
      name: name ? name.key : `Object${objectIdx}`,
      width: width.number,
      height: height.number,
      bitsPerComponent: bitsPerComponent.number,
      data: pdfObject.content,
    });
  }
});

where every pdfObject is a PDFRawStream object and ref a PDFIndirectReference. It is possible to extract the image buffer associated to the couple pdfObject & ref ?

@Hopding
Copy link
Owner

Hopding commented Apr 28, 2019

Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.

First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.

pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.

In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).


All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:

extract-images.zip

You just need to unzip the file and run yarn install (or npm install) and then run node index.js existing1.pdf or node index.js existing2.pdf. The script will extract as many embedded images as it can from the PDF into the images/ directory.

Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js.


I did a bit of googling to see if pdf.js has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available.

I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib.

@Hopding Hopding closed this as completed Apr 28, 2019
@totorelmatador
Copy link
Author

It does work perfectly, thank you a lot !

@mealCode
Copy link

thanks a lot, this helps me 29/07/2019 :)

@danielhanford
Copy link

Thank you so much!
Exactly what I needed and worked perfectly.
Aces!

@mcmspark
Copy link

mcmspark commented Jul 2, 2020

This does not work on scanned pdf. It results in an "unknown compression method error"

@mcmspark
Copy link

mcmspark commented Jul 2, 2020

The pages are filter = JBIG2Decode

@jowo-io
Copy link

jowo-io commented Nov 9, 2020

@Hopding thanks for taking the time to post such a useful reply and provide the script. I was wondering if there's an easy way to get the x/y position of the image as well as the width/height?

@Swapnil-Kunjir
Copy link

My pdf contains images and tables. I need to remove images from all pages of pdf and keep tables as they are and save new document is it possible?

@hafsa-dmnt
Copy link

Hi, I tried the solution of this comment :

Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.

First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.

pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.

In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).

All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:

extract-images.zip

You just need to unzip the file and run yarn install (or npm install) and then run node index.js existing1.pdf or node index.js existing2.pdf. The script will extract as many embedded images as it can from the PDF into the images/ directory.

Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js.

I did a bit of googling to see if pdf.js has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available.

I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib.

When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get

image

So i changed it, and added
image.alphaLayer = smaskimg;

The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ?

Thanks :)

ps :
the full image without smask
image

the full image when adding smaskimg
image

@Dragon3DGraff
Copy link

Hi, I tried the solution of this comment :

Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.
First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.
pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.
In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).
All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:
extract-images.zip
You just need to unzip the file and run yarn install (or npm install) and then run node index.js existing1.pdf or node index.js existing2.pdf. The script will extract as many embedded images as it can from the PDF into the images/ directory.
Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js.
I did a bit of googling to see if pdf.js has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available.
I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib.

When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get

image

So i changed it, and added image.alphaLayer = smaskimg;

The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ?

Thanks :)

ps : the full image without smask image

the full image when adding smaskimg image

Hi, hafsa110 I solved this problem just removed "- 1 " in savePng function
image

@yannbertrand
Copy link

Just a small refresh of the proposed code in another context, this one should work as-is in the browser:

<html>
  <head>
    <meta charset="utf-8" />
    <script src="https://unpkg.com/[email protected]/browser.js"></script>
    <script src="https://unpkg.com/[email protected]/dist/pdf-lib.js"></script>
    <script src="https://unpkg.com/[email protected]/dist/pako.js"></script>
  </head>
  <body>
    <input type="file" id="ticket" />
    <div id="images"></div>

    <script>
      const fileInput = document.getElementById('ticket');
      const imagesContainer = document.getElementById('images');
      fileInput.addEventListener('change', async (event) => {
        imagesContainer.innerHTML = '';
        const buffer = await event.target.files[0].arrayBuffer();
        await extractPdfImages(buffer);
      });

      const extractPdfImages = async (pdfBytes) => {
        const pdfDoc = await PDFLib.PDFDocument.load(pdfBytes);
        const enumeratedIndirectObjects =
          pdfDoc.context.enumerateIndirectObjects();
        const imagesInDoc = [];
        let objectIdx = 0;
        enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
          objectIdx += 1;

          if (!(pdfObject instanceof PDFLib.PDFRawStream)) return;

          const { dict } = pdfObject;

          const smaskRef = dict.get(PDFLib.PDFName.of('SMask'));
          const colorSpace = dict.get(PDFLib.PDFName.of('ColorSpace'));
          const subtype = dict.get(PDFLib.PDFName.of('Subtype'));
          const width = dict.get(PDFLib.PDFName.of('Width'));
          const height = dict.get(PDFLib.PDFName.of('Height'));
          const name = dict.get(PDFLib.PDFName.of('Name'));
          const bitsPerComponent = dict.get(
            PDFLib.PDFName.of('BitsPerComponent')
          );
          const filter = dict.get(PDFLib.PDFName.of('Filter'));

          if (subtype == PDFLib.PDFName.of('Image')) {
            imagesInDoc.push({
              ref,
              smaskRef,
              colorSpace,
              name: name ? name.key : `Object${objectIdx}`,
              width: width.numberValue,
              height: height.numberValue,
              bitsPerComponent: bitsPerComponent.numberValue,
              data: pdfObject.contents,
              type: filter === PDFLib.PDFName.of('DCTDecode') ? 'jpg' : 'png',
            });
          }
        });

        // Find and mark SMasks as alpha layers
        // Note: doesn't work in all PDFs, I decided to remove it
        // imagesInDoc.forEach((image) => {
        //   if (image.type === 'png' && image.smaskRef) {
        //     const smaskImg = imagesInDoc.find(
        //       ({ ref }) => ref === image.smaskRef
        //     );
        //     smaskImg.isAlphaLayer = true;
        //     image.alphaLayer = image;
        //   }
        // });

        // Log info about the images we found in the PDF
        console.log(`===== ${imagesInDoc.length} Images found in PDF =====`);
        imagesInDoc.forEach((image) => {
          console.log(
            'Name:',
            image.name,
            '\n  Type:',
            image.type,
            '\n  Color Space:',
            image.colorSpace.toString(),
            '\n  Has Alpha Layer?',
            image.alphaLayer ? true : false,
            // '\n  Is Alpha Layer?',
            // image.isAlphaLayer || false,
            '\n  Width:',
            image.width,
            '\n  Height:',
            image.height,
            '\n  Bits Per Component:',
            image.bitsPerComponent,
            '\n  Data:',
            `Uint8Array(${image.data.length})`,
            '\n  Ref:',
            image.ref.toString()
          );
        });

        const PngColorTypes = {
          Grayscale: 0,
          Rgb: 2,
          GrayscaleAlpha: 4,
          RgbAlpha: 6,
        };
        const ComponentsPerPixelOfColorType = {
          [PngColorTypes.Rgb]: 3,
          [PngColorTypes.Grayscale]: 1,
          [PngColorTypes.RgbAlpha]: 4,
          [PngColorTypes.GrayscaleAlpha]: 2,
        };

        const readBitAtOffsetOfByte = (byte, bitOffset) => {
          const bit = (byte >> bitOffset) & 1;
          return bit;
        };

        const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
          const byteOffset = Math.floor(bitOffsetWithinArray / 8);
          const byte = uint8Array[uint8Array.length - byteOffset];
          const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8);
          return readBitAtOffsetOfByte(byte, bitOffsetWithinByte);
        };

        const savePng = (image) =>
          new Promise((resolve, reject) => {
            const isGrayscale =
              image.colorSpace === PDFLib.PDFName.of('DeviceGray');
            const colorPixels = pako.inflate(image.data);
            const alphaPixels = image.alphaLayer
              ? pako.inflate(image.alphaLayer.data)
              : undefined;

            const colorType =
              isGrayscale && alphaPixels
                ? PngColorTypes.GrayscaleAlpha
                : !isGrayscale && alphaPixels
                ? PngColorTypes.RgbAlpha
                : isGrayscale
                ? PngColorTypes.Grayscale
                : PngColorTypes.Rgb;

            const colorByteSize = 1;
            const width = image.width * colorByteSize;
            const height = image.height * colorByteSize;
            const inputHasAlpha = [
              PngColorTypes.RgbAlpha,
              PngColorTypes.GrayscaleAlpha,
            ].includes(colorType);

            const pngData = new png.PNG({
              width,
              height,
              colorType,
              inputColorType: colorType,
              inputHasAlpha,
            });

            const componentsPerPixel = ComponentsPerPixelOfColorType[colorType];
            pngData.data = new Uint8Array(width * height * componentsPerPixel);

            let colorPixelIdx = 0;
            let pixelIdx = 0;

            while (pixelIdx < pngData.data.length) {
              if (colorType === PngColorTypes.Rgb) {
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
              } else if (colorType === PngColorTypes.RgbAlpha) {
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1];
              } else if (colorType === PngColorTypes.Grayscale) {
                const bit =
                  readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
                    ? 0x00
                    : 0xff;
                pngData.data[pngData.data.length - pixelIdx++] = bit;
              } else if (colorType === PngColorTypes.GrayscaleAlpha) {
                const bit =
                  readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
                    ? 0x00
                    : 0xff;
                pngData.data[pngData.data.length - pixelIdx++] = bit;
                pngData.data[pngData.data.length - pixelIdx++] =
                  alphaPixels[colorPixelIdx - 1];
              } else {
                throw new Error(`Unknown colorType=${colorType}`);
              }
            }

            const buffer = [];
            pngData
              .pack()
              .on('data', (data) => buffer.push(...data))
              .on('end', () => resolve(Uint8Array.from(buffer)))
              .on('error', (err) => reject(err));
          });

        for (const image of imagesInDoc) {
          if (!image.isAlphaLayer) {
            const imageData =
              image.type === 'jpg' ? image.data : await savePng(image);
            const imgElement = document.createElement('img');
            imgElement.setAttribute(
              'src',
              URL.createObjectURL(
                new Blob([imageData], { type: `image/${image.type}` })
              )
            );
            imgElement.setAttribute('width', image.width);
            imgElement.setAttribute('height', image.height);

            imagesContainer.appendChild(imgElement);
          }
        }
      };
    </script>
  </body>
</html>

@zivni
Copy link

zivni commented Jul 3, 2023

Is it possible to get the x,y position of the images?

@K-R-M
Copy link

K-R-M commented Jul 11, 2023

In the original extract-image project, this image in existing1.pdf:
OriginalExtractImage
gets output in triplicate in /images/out21.png, like so:
out21
Does anyone know what causes this? I've got the same issue happening when I extract images from a PDF. I have a feeling it's because this code ignores the /Mask operator and the sub-dictionary of the image's ColorSpace that points to the hival (255) and another stream or array (in this case identified as object 37 0 R), like in this image dictionary:

<<
/Type /XObject
/Subtype /Image
/Filter /FlateDecode
/Width 567
/Height 234
/BitsPerComponent 8
/Length 8636
/ColorSpace [ /Indexed /DeviceRGB 255 37 0 R ]
/Mask [ 251 251 ]
>>

Any other ideas?
An indexed ColorSpace is described in section 7.6.6.2 of the Acrobat SDK.

Follow-Up
I ended up working around this by using the Jimp library to handle the output or any image that use a separate color palette, instead of PNGJS and it works fine.

@jappoman
Copy link

jappoman commented Oct 31, 2023

Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF.

This is my code for putting back the image into the pdf:

const replaceImagesInPdf = async (pdfDoc, currentPage, newImages) => {
  console.log(`Replacing images in page ${currentPage}...`);
  console.time("replaceImagesInPdfForPage" + currentPage);

  for (let newImage of newImages) {
    // Cycling throug the image of the only page in pdf
    const imageData = newImage.data;
    const imageRef = newImage.ref;

    const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
    let objectIdx = 0;
    enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
      objectIdx += 1;

      if (!(pdfObject instanceof PDFRawStream)) return;

      const { dict } = pdfObject;
      const subtype = dict.get(PDFName.of("Subtype"));

      if (subtype == PDFName.of("Image") && ref == imageRef) {
        pdfObject.contents = imageData;
      }
    });
  }

  console.log("Replaced images into page " + currentPage + ".");
  console.timeEnd("replaceImagesInPdfForPage" + currentPage);

  return pdfDoc;
};

This is how i extract the image from the pdf:

const indexPDFImages = async (pdfDoc) => {
  const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
  const imagesInDoc = [];
  let objectIdx = 0;

  enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
    objectIdx += 1;

    if (!(pdfObject instanceof PDFRawStream)) return;

    const { dict } = pdfObject;

    const subtype = dict.get(PDFName.of("Subtype"));
    if (subtype !== PDFName.of("Image")) return; // If it's not an image, return

    const filter = dict.get(PDFName.of("Filter"));
    let imageType = null;

    switch (filter) {
      case PDFName.of("DCTDecode"):
        imageType = "jpg";
        break;
      case PDFName.of("FlateDecode"):
        imageType = "png";
        break;
      case PDFName.of("JPXDecode"):
        imageType = "jpeg2000"; // JPX is typically used for JPEG2000 in PDFs
        break;
      // ... Add more filters for other image formats like WebP, GIF, AVIF, TIFF, SVG etc.
      default:
        console.log(
          `Unsupported image format detected for ref: ${pdfRef}. Filter used: ${filter}`
        );
        return; // If it's neither JPEG nor PNG, return
    }

    // Extract other image information
    const smaskRef = dict.get(PDFName.of("SMask"));
    const colorSpace = dict.get(PDFName.of("ColorSpace"));
    const width = dict.get(PDFName.of("Width"));
    const height = dict.get(PDFName.of("Height"));
    const name = dict.get(PDFName.of("Name"));
    const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"));

    imagesInDoc.push({
      ref,
      smaskRef,
      colorSpace,
      name: name ? name.key : `Object${objectIdx}`,
      width: width.numberValue,
      height: height.numberValue,
      pxsize: width.numberValue * height.numberValue,
      bitsPerComponent: bitsPerComponent.numberValue,
      data: pdfObject.contents,
      type: imageType,
    });
  });

  return imagesInDoc;
};

In between, you have the function what put the new metadata into the image buffer:

async function generateImageMetadataWatermark(
  imageBufferObj,
  currentPage,
  watermark,
) {
  console.log(`Generating ImageMetadataWatermark for page ${currentPage}...`);
  console.time("generateImageMetadataWatermarkForPage" + currentPage);
  try {
    // Extracting image data and reference
    const actualImageBuffer = imageBufferObj.image;
    const imageRef = imageBufferObj.ref;

    //Convert the full image buffer to base 64
    const base64Image =
      "data:image/jpeg;base64," + actualImageBuffer.toString("base64");
    const exifObj = piexifjs.load(base64Image);

    // Add watermark string in the EXIF data. Using "0th" ImageDescription.
    exifObj["0th"][piexifjs.ImageIFD.ImageDescription] = watermark;
    // Create new EXIF binary string
    const exifBytes = piexifjs.dump(exifObj);
    // Insert the new EXIF data into the image
    const newImageBase64 = piexifjs.insert(exifBytes, base64Image);
    // Convert base64 image to buffer
    const newImageBuffer = Buffer.from(newImageBase64.split(",")[1], "base64");

    // Returning the modified image
    const modifiedImage = {
      watermarkType: "imageMetadata",
      ref: imageRef,
      data: newImageBuffer,
    };
    console.log(`ImageMetadataWatermark for page ${currentPage} generated.`);
    console.timeEnd("generateImageMetadataWatermarkForPage" + currentPage);
    return modifiedImage;
  } catch (e) {
    throw e;
  }
}

Any solution to this?

@mcmspark
Copy link

mcmspark commented Nov 1, 2023 via email

@search-acumen
Copy link

In the original extract-image project, this image in existing1.pdf: OriginalExtractImage gets output in triplicate in /images/out21.png, like so: out21 Does anyone know what causes this? I've got the same issue happening when I extract images from a PDF. I have a feeling it's because this code ignores the /Mask operator and the sub-dictionary of the image's ColorSpace that points to the hival (255) and another stream or array (in this case identified as object 37 0 R), like in this image dictionary:

<<
/Type /XObject
/Subtype /Image
/Filter /FlateDecode
/Width 567
/Height 234
/BitsPerComponent 8
/Length 8636
/ColorSpace [ /Indexed /DeviceRGB 255 37 0 R ]
/Mask [ 251 251 ]
>>

Any other ideas? An indexed ColorSpace is described in section 7.6.6.2 of the Acrobat SDK.

Follow-Up I ended up working around this by using the Jimp library to handle the output or any image that use a separate color palette, instead of PNGJS and it works fine.

@K-R-M Could you elaborate on how you got round this issue as I'm facing the same issue with PNGs. I'm already using the Jimp library for other purposes but can't seem to get around the triple image issue. Thanks

@K-R-M
Copy link

K-R-M commented Mar 14, 2024

@search-acumen, unfortunately, I don't remember exactly how I did it. I got laid off and no longer have access to the source code that handled this correctly.

@search-acumen
Copy link

@K-R-M No problem, thanks for replying anyway. Has anyone else managed to solve this issue?

@devanshsinghvaluecoders

@search-acumen
try this package to extract the images
https://www.npmjs.com/package/pdf-image-extractor

@AhmadrezaHK
Copy link

AhmadrezaHK commented Apr 22, 2024

For those who want to extract only form field images in the pdf and not all of them, I update it yannbertrand's code in following way (return image is in base64 format):

export async function extractFormImages(pdfDoc, imageFieldNameList) {
  const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects()
  const imagesInDoc = []
  let objectIdx = 0

  const form = pdfDoc.getForm()
  const imageRefMap = new Map()

  imageFieldNameList.forEach((fName) => {
    const image = form
      .getButton(fName)
      .acroField.getWidgets()[0]
      .getAppearances()?.normal

    const imageRef = [
      ...image.dict
        .get(PDFName.of("Resources"))
        .dict.get(PDFName.of("XObject"))
        .dict.values(),
    ][0]

    imageRefMap.set(imageRef.toString(), fName)
  })

  enumeratedIndirectObjects.forEach(([pdfRef, pdfObject], ref) => {
    objectIdx += 1

    if (!(pdfObject instanceof PDFRawStream)) return

    const { dict } = pdfObject

    const smaskRef = dict.get(PDFName.of("SMask"))
    const colorSpace = dict.get(PDFName.of("ColorSpace"))
    const subtype = dict.get(PDFName.of("Subtype"))
    const width = dict.get(PDFName.of("Width"))
    const height = dict.get(PDFName.of("Height"))
    const name = dict.get(PDFName.of("Name"))
    const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"))
    const filter = dict.get(PDFName.of("Filter"))

    if (subtype == PDFName.of("Image") && imageRefMap.has(pdfRef.toString())) {
      imagesInDoc.push({
        ref,
        smaskRef,
        colorSpace,
        name: name ? name.key : `Object${objectIdx}`,
        width: width.numberValue,
        height: height.numberValue,
        bitsPerComponent: bitsPerComponent.numberValue,
        data: pdfObject.contents,
        type: filter === PDFName.of("DCTDecode") ? "jpg" : "png",
        fieldName: imageRefMap.get(pdfRef.toString()),
      })
    }
  })

  const PngColorTypes = {
    Grayscale: 0,
    Rgb: 2,
    GrayscaleAlpha: 4,
    RgbAlpha: 6,
  }
  const ComponentsPerPixelOfColorType = {
    [PngColorTypes.Rgb]: 3,
    [PngColorTypes.Grayscale]: 1,
    [PngColorTypes.RgbAlpha]: 4,
    [PngColorTypes.GrayscaleAlpha]: 2,
  }

  const readBitAtOffsetOfByte = (byte, bitOffset) => {
    const bit = (byte >> bitOffset) & 1
    return bit
  }

  const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
    const byteOffset = Math.floor(bitOffsetWithinArray / 8)
    const byte = uint8Array[uint8Array.length - byteOffset]
    const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8)
    return readBitAtOffsetOfByte(byte, bitOffsetWithinByte)
  }

  const savePng = (image) =>
    new Promise((resolve, reject) => {
      const isGrayscale = image.colorSpace === PDFName.of("DeviceGray")
      const colorPixels = pako.inflate(image.data)
      const alphaPixels = image.alphaLayer
        ? pako.inflate(image.alphaLayer.data)
        : undefined

      const colorType =
        isGrayscale && alphaPixels
          ? PngColorTypes.GrayscaleAlpha
          : !isGrayscale && alphaPixels
          ? PngColorTypes.RgbAlpha
          : isGrayscale
          ? PngColorTypes.Grayscale
          : PngColorTypes.Rgb

      const colorByteSize = 1
      const width = image.width * colorByteSize
      const height = image.height * colorByteSize
      const inputHasAlpha = [
        PngColorTypes.RgbAlpha,
        PngColorTypes.GrayscaleAlpha,
      ].includes(colorType)

      const pngData = new png.PNG({
        width,
        height,
        colorType,
        inputColorType: colorType,
        inputHasAlpha,
      })

      const componentsPerPixel = ComponentsPerPixelOfColorType[colorType]
      pngData.data = new Uint8Array(width * height * componentsPerPixel)

      let colorPixelIdx = 0
      let pixelIdx = 0

      while (pixelIdx < pngData.data.length) {
        if (colorType === PngColorTypes.Rgb) {
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
        } else if (colorType === PngColorTypes.RgbAlpha) {
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1]
        } else if (colorType === PngColorTypes.Grayscale) {
          const bit =
            readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
              ? 0x00
              : 0xff
          pngData.data[pngData.data.length - pixelIdx++] = bit
        } else if (colorType === PngColorTypes.GrayscaleAlpha) {
          const bit =
            readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
              ? 0x00
              : 0xff
          pngData.data[pngData.data.length - pixelIdx++] = bit
          pngData.data[pngData.data.length - pixelIdx++] =
            alphaPixels[colorPixelIdx - 1]
        } else {
          throw new Error(`Unknown colorType=${colorType}`)
        }
      }

      const buffer = []
      pngData
        .pack()
        .on("data", (data) => buffer.push(...data))
        .on("end", () => resolve(Uint8Array.from(buffer)))
        .on("error", (err) => reject(err))
    })

  let result = {}
  for (const img of imagesInDoc) {
    if (!img.isAlphaLayer) {
      const imageData = img.type === "jpg" ? img.data : await savePng(img)

      const imageBase64 = await new Promise((resolve, reject) => {
        const reader = new FileReader()
        reader.onloadend = () => resolve(reader.result)
        reader.onerror = reject
        reader.readAsDataURL(
          new Blob([imageData], { type: `image/${img.type}` })
        )
      })
      result[img.fieldName] = imageBase64
    }
  }

  return result
}

The key point here is that I'm finding the PDFRef related to the image of the form field and use it to recognise the related PDFObject:

imageFieldNameList.forEach((fName) => {
  const image = form
    .getButton(fName)
    .acroField.getWidgets()[0]
    .getAppearances()?.normal

  const imageRef = [
    ...image.dict
      .get(PDFName.of("Resources"))
      .dict.get(PDFName.of("XObject"))
      .dict.values(),
  ][0]

  imageRefMap.set(imageRef.toString(), fName)
})

.
.
.

if (subtype == PDFName.of("Image") && imageRefMap.has(pdfRef.toString())) {

.
.
.

@thomaspurk
Copy link

thomaspurk commented Apr 25, 2024

I found this thread very helpful, but, unfortunately, not working for me. I tried many other options for extracting images from PDFs, but none worked. Most options can handle JPG easily but fail on PNG data. Since the technique discussed here at least created PNGs, albeit garbled, I decided to debug this solution, which took many hours. So sharing what is working for me right now.

The core problem was matching and properly indexing the alpha lay to the raw image layer. The original code relied on "ref," which is a number to match to smaskRef, which is an object. The solution was to use pdfRef to match to smaskRef. Also, there was a bug in the original code called out by hafsa110, where the image layer itself was set to the alpha layer instead of the alpha layer. Because the alpha layer is a single-band greyscale image, and not a three-band RGB, after making this correction, we can no longer use the image layer pixel indexer to reference pixel data from the alpha layer. To solve this I created a new alpha layer pixel indexer. I marked key changes with comments below.

const fs = require("fs");
const { PDFDocument, PDFRawStream, PDFName } = require("pdf-lib");
const rimraf = require("rimraf");
const { PNG } = require("pngjs");
const pako = require("pako");

async function getImageFromPdf(inPath) {
  const existingPdfBytes = fs.readFileSync(inPath);
  const pdfDoc = await PDFDocument.load(existingPdfBytes);
  const imagesInDoc = [];

  pdfDoc.context
    .enumerateIndirectObjects()
    .forEach(async ([pdfRef, pdfObject], ref) => {
      if (!(pdfObject instanceof PDFRawStream)) {
        return;
      }
      const { dict } = pdfObject;
      const smaskRef = dict.get(PDFName.of("SMask"));
      const colorSpace = dict.get(PDFName.of("ColorSpace"));
      const subtype = dict.get(PDFName.of("Subtype"));
      const width = dict.get(PDFName.of("Width"));
      const height = dict.get(PDFName.of("Height"));
      const name = dict.get(PDFName.of("Name"));
      const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"));
      const filter = dict.get(PDFName.of("Filter"));

      if (subtype == PDFName.of("Image")) {
        imagesInDoc.push({
          pdfRef, // added, must use pdfRef to locate alpha layers
          ref,
          smaskRef,
          colorSpace,
          name: name ? name.key : `Object${ref}`,
          width: width.numberValue,
          height: height.numberValue,
          bitsPerComponent: bitsPerComponent.numberValue,
          data: pdfObject.contents,
          type: filter === PDFName.of("DCTDecode") ? "jpg" : "png",
        });
      }
    });

  // Log info about the images we found in the PDF
  console.log(`===== ${imagesInDoc.length} Images found in PDF =====`);
  imagesInDoc.forEach((image) => {
    // Find and mark SMasks as alpha layers
    if (image.type === "png" && image.smaskRef) {
      const smaskImg = imagesInDoc.find((sm) => {
        return image.smaskRef == sm.pdfRef; // ref cannot match to smaskRef, must use pdfRef
      });
      if (smaskImg) {
        smaskImg.isAlphaLayer = true;
        //image.alphaLayer = image; // change suggest by hafsa110, but creates a alpha layer pixel indexing problem (see savePNG)
        image.alphaLayer = smaskImg;
      }
    }
  });

  imagesInDoc.forEach((image) => {
    // Find and mark SMasks as alpha layers

    console.log(
      "Name:",
      image.name,
      "\n  Type:",
      image.type,
      "\n  Color Space:",
      image.colorSpace.toString(),
      "\n  Has Alpha Layer?",
      image.alphaLayer ? image.alphaLayer : false,
      "\n  Is Alpha Layer?",
      image.isAlphaLayer, // change, true or undefined
      "\n  SmaskRef:",
      image.smaskRef, // added to debug the smaskRef
      "\n  Width:",
      image.width,
      "\n  Height:",
      image.height,
      "\n  Bits Per Component:",
      image.bitsPerComponent,
      "\n  Data:",
      `Uint8Array(${image.data.length})`,
      "\n  Ref:",
      image.ref.toString()
    );
  });

  // changed to hard code my folder
  rimraf("./pdf2json/test//*.{jpg,png}", async (err) => {
    if (err) console.error(err);
    else {
      for (const img of imagesInDoc) {
        if (!img.isAlphaLayer) {
          const imageData = img.type === "jpg" ? img.data : await savePng(img);
          fs.writeFileSync(`./pdf2json/test/${img.ref}.` + img.type, imageData);
        }
      }
      console.log();
      console.log("Images written to ./pdf2json/test/");
    }
  });

  console.log("done");
}

const PngColorTypes = {
  Grayscale: 0,
  Rgb: 2,
  GrayscaleAlpha: 4,
  RgbAlpha: 6,
};

const ComponentsPerPixelOfColorType = {
  [PngColorTypes.Rgb]: 3,
  [PngColorTypes.Grayscale]: 1,
  [PngColorTypes.RgbAlpha]: 4,
  [PngColorTypes.GrayscaleAlpha]: 2,
};

const readBitAtOffsetOfByte = (byte, bitOffset) => {
  const bit = (byte >> bitOffset) & 1;
  return bit;
};

const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
  const byteOffset = Math.floor(bitOffsetWithinArray / 8);
  const byte = uint8Array[uint8Array.length - byteOffset];
  const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8);
  return readBitAtOffsetOfByte(byte, bitOffsetWithinByte);
};

const savePng = (image) =>
  new Promise((resolve, reject) => {
    const isGrayscale = image.colorSpace === PDFName.of("DeviceGray");
    const colorPixels = pako.inflate(image.data);
    const alphaPixels = image.alphaLayer
      ? pako.inflate(image.alphaLayer.data)
      : undefined;

    // prettier-ignore
    const colorType =
        isGrayscale  && alphaPixels ? PngColorTypes.GrayscaleAlpha
      : !isGrayscale && alphaPixels ? PngColorTypes.RgbAlpha
      : isGrayscale                 ? PngColorTypes.Grayscale
      : PngColorTypes.Rgb;

    const colorByteSize = 1;
    const width = image.width * colorByteSize;
    const height = image.height * colorByteSize;
    const inputHasAlpha = [
      PngColorTypes.RgbAlpha,
      PngColorTypes.GrayscaleAlpha,
    ].includes(colorType);

    const png = new PNG({
      width,
      height,
      colorType,
      inputColorType: colorType,
      inputHasAlpha,
    });

    const componentsPerPixel = ComponentsPerPixelOfColorType[colorType];
    png.data = new Uint8Array(width * height * componentsPerPixel);

    let colorPixelIdx = 0;
    let alphaPixelIdx = 0; // add nee index tracker for the alpha later
    let pixelIdx = 0;
    // prettier-ignore
    while (pixelIdx < png.data.length) {
      if (colorType === PngColorTypes.Rgb) {
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
      } 
      else if (colorType === PngColorTypes.RgbAlpha) {
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        //png.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1]; // must reference alpha layer pixel index here
        png.data[pixelIdx++] = alphaPixels[alphaPixelIdx++ -1];

      } 
      else if (colorType === PngColorTypes.Grayscale) {
        const bit = readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0 
          ? 0x00 
          : 0xff;
        png.data[png.data.length - (pixelIdx++)] = bit
      } 
      else if (colorType === PngColorTypes.GrayscaleAlpha) {
        const bit =
          readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
            ? 0x00
            : 0xff;
        png.data[png.data.length - pixelIdx++] = bit;
        //png.data[png.data.length - pixelIdx++] = alphaPixels[colorPixelIdx - 1]; // must reference alpha layer pixel index here
        png.data[png.data.length - pixelIdx++] = alphaPixels[alphaPixelIdx++ - 1];
      } 
      else {
        throw new Error(`Unknown colorType=${colorType}`);
      }
    }

    const buffer = [];
    png
      .pack()
      .on("data", (data) => buffer.push(...data))
      .on("end", () => resolve(Buffer.from(buffer)))
      .on("error", (err) => reject(err));
  });

const pdfSource = "./documents/1960782.pdf";
getImageFromPdf(pdfSource);```

@devanshsingh7727
Copy link

@thomaspurk
try this to get png and jpeg images from pdf file
https://www.npmjs.com/package/pdf-image-extractor

for implmentation checkout codesandbox

@thomaspurk
Copy link

thomaspurk commented Apr 26, 2024

@thomaspurk try this to get png and jpeg images from pdf file https://www.npmjs.com/package/pdf-image-extractor

for implmentation checkout codesandbox

I did try pdf-image-extractor, among several others. This module was able to handle the JPEGs in my PDFs but threw an error on the PNGs.

I just again verified this using the code sandbox link you provided. It was the same behavior as I saw with my test. It gets the JPGs but not the PNGs

@Alexufo
Copy link

Alexufo commented May 14, 2024

@thomaspurk have you solved it?

@thomaspurk
Copy link

@thomaspurk have you solved it?

Absolutely. The code I posted above is working well for me!

@hanifanggawi
Copy link

@thomaspurk have you solved it?

Absolutely. The code I posted above is working well for me!

@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and pdf-lib version 0.6.1, i got a TypeError: PDFDocument.load is not a function, the original code that Hopding wrote does not have this issue, since it uses PDFDocumentFactory to load the pdf

@thomaspurk
Copy link

@thomaspurk have you solved it?

Absolutely. The code I posted above is working well for me!

@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and pdf-lib version 0.6.1, i got a TypeError: PDFDocument.load is not a function, the original code that Hopding wrote does not have this issue, since it uses PDFDocumentFactory to load the pdf

Node v20.11.0
pdf-lib 1.17.1

As I recall, Hopding's original code (posted elsewhere not in this issue) did not work for me. I can only assume there has been some refactoring to the module's class names over the versions between 0.6.1 and 1.17.1. The current documentation references the use of PDFDocument.load. See the examples here, https://pdf-lib.js.org/

@LucaSorvillo
Copy link

Thanks @thomaspurk !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests