Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Pdfminer incorrectly raises PDFNotImplementedError: Unsupported filtererror #174

Open
marcelhekking opened this issue Mar 8, 2017 · 0 comments

Comments

@marcelhekking
Copy link

I am trying to get the text of this PDF:
factuur___0098559514___20170303.pdf.
It results in an error (pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: PDFObjRef:18). However, on inspecting object PDFObjRef:18 it was found that when running it's resolve method it would generate a one-item list containing a correct filter. When using this obtained filter rather than the object, pdminer runs OK with no error.

How to reproduce:
pdf2txt.py -o text.txt factuur___0098559514___20170303.pdf results in an error.

Change made to module pdftypes in order to run it without error:

    def decode(self):
        assert self.data is None and self.rawdata is not None
        data = self.rawdata
        if self.decipher:
            # Handle encryption
            data = self.decipher(self.objid, self.genno, data)
        filters = self.get_filters()
        if not filters:
            self.data = data
            self.rawdata = None
            return
        for f in filters:
            params = self.get_any(('DP', 'DecodeParms', 'FDecodeParms'), {})

            # ----difference with original `decode` method starts here--------
            try:
                f = f.resolve()[0]
            except AttributeError:
                f = f
            # ----and ends here-----------------------------------------------

            if f in LITERALS_FLATE_DECODE:
                # will get errors if the document is encrypted.
                try:
                    data = zlib.decompress(data)
                except zlib.error, e:
                    if STRICT:
                        raise PDFException('Invalid zlib bytes: %r, %r' % (e, data))
                    data = ''
            elif f in LITERALS_LZW_DECODE:
                data = lzwdecode(data)
            elif f in LITERALS_ASCII85_DECODE:
                data = ascii85decode(data)
            elif f in LITERALS_ASCIIHEX_DECODE:
                data = asciihexdecode(data)
            elif f in LITERALS_RUNLENGTH_DECODE:
                data = rldecode(data)
            elif f in LITERALS_CCITTFAX_DECODE:
                data = ccittfaxdecode(data, params)
            elif f == LITERAL_CRYPT:
                # not yet..
                raise PDFNotImplementedError('/Crypt filter is unsupported')
            else:
                raise PDFNotImplementedError('Unsupported filter: %r' % f)
            # apply predictors
            if 'Predictor' in params:
                pred = int_value(params['Predictor'])
                if pred == 1:
                    # no predictor
                    pass
                elif 10 <= pred:
                    # PNG predictor
                    colors = int_value(params.get('Colors', 1))
                    columns = int_value(params.get('Columns', 1))
                    bitspercomponent = int_value(params.get('BitsPerComponent', 8))
                    data = apply_png_predictor(pred, colors, columns, bitspercomponent, data)
                else:
                    raise PDFNotImplementedError('Unsupported predictor: %r' % pred)
        self.data = data
        self.rawdata = None
        return
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant