Class COSParser

    • Field Detail

      • PDF_DEFAULT_VERSION

        private static final java.lang.String PDF_DEFAULT_VERSION
        See Also:
        Constant Field Values
      • FDF_DEFAULT_VERSION

        private static final java.lang.String FDF_DEFAULT_VERSION
        See Also:
        Constant Field Values
      • XREF_TABLE

        private static final char[] XREF_TABLE
      • XREF_STREAM

        private static final char[] XREF_STREAM
      • STARTXREF

        private static final char[] STARTXREF
      • ENDSTREAM

        private static final byte[] ENDSTREAM
      • ENDOBJ

        private static final byte[] ENDOBJ
      • strmBuf

        private final byte[] strmBuf
      • keyStoreInputStream

        private java.io.InputStream keyStoreInputStream
      • password

        private java.lang.String password
      • keyAlias

        private java.lang.String keyAlias
      • SYSPROP_PARSEMINIMAL

        public static final java.lang.String SYSPROP_PARSEMINIMAL
        Only parse the PDF file minimally allowing access to basic information.
        See Also:
        Constant Field Values
      • SYSPROP_EOFLOOKUPRANGE

        public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
        The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.
        See Also:
        Constant Field Values
      • DEFAULT_TRAIL_BYTECOUNT

        private static final int DEFAULT_TRAIL_BYTECOUNT
        How many trailing bytes to read for EOF marker.
        See Also:
        Constant Field Values
      • EOF_MARKER

        protected static final char[] EOF_MARKER
        EOF-marker.
      • OBJ_MARKER

        protected static final char[] OBJ_MARKER
        obj-marker.
      • TRAILER_MARKER

        private static final char[] TRAILER_MARKER
        trailer-marker.
      • OBJ_STREAM

        private static final char[] OBJ_STREAM
        ObjStream-marker.
      • trailerOffset

        private long trailerOffset
      • fileLen

        protected long fileLen
        file length.
      • isLenient

        private boolean isLenient
        is parser using auto healing capacity ?
      • initialParseDone

        protected boolean initialParseDone
      • trailerWasRebuild

        private boolean trailerWasRebuild
      • bfSearchCOSObjectKeyOffsets

        private java.util.Map<COSObjectKey,​java.lang.Long> bfSearchCOSObjectKeyOffsets
        Contains all found objects of a brute force search.
      • lastEOFMarker

        private java.lang.Long lastEOFMarker
      • bfSearchXRefTablesOffsets

        private java.util.List<java.lang.Long> bfSearchXRefTablesOffsets
      • bfSearchXRefStreamsOffsets

        private java.util.List<java.lang.Long> bfSearchXRefStreamsOffsets
      • securityHandler

        protected SecurityHandler securityHandler
        The security handler.
      • readTrailBytes

        private int readTrailBytes
        how many trailing bytes to read for EOF marker.
      • LOG

        private static final org.apache.commons.logging.Log LOG
      • xrefTrailerResolver

        protected XrefTrailerResolver xrefTrailerResolver
        Collects all Xref/trailer objects and resolves them into single object using startxref reference.
      • TMP_FILE_PREFIX

        public static final java.lang.String TMP_FILE_PREFIX
        The prefix for the temp file being used.
        See Also:
        Constant Field Values
      • streamCopyBuf

        private final byte[] streamCopyBuf
    • Constructor Detail

      • COSParser

        public COSParser​(RandomAccessRead source)
        Default constructor.
        Parameters:
        source - input representing the pdf.
      • COSParser

        public COSParser​(RandomAccessRead source,
                         java.lang.String password,
                         java.io.InputStream keyStore,
                         java.lang.String keyAlias)
        Constructor for encrypted pdfs.
        Parameters:
        source - input representing the pdf.
        password - password to be used for decryption.
        keyStore - key store to be used for decryption when using public key security
        keyAlias - alias to be used for decryption when using public key security
    • Method Detail

      • setEOFLookupRange

        public void setEOFLookupRange​(int byteCount)
        Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

        We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.

        In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

        Parameters:
        byteCount - number of trailing bytes
      • retrieveTrailer

        protected COSDictionary retrieveTrailer()
                                         throws java.io.IOException
        Read the trailer information and provide a COSDictionary containing the trailer information.
        Returns:
        a COSDictionary containing the trailer information
        Throws:
        java.io.IOException - if something went wrong
      • parseXref

        protected COSDictionary parseXref​(long startXRefOffset)
                                   throws java.io.IOException
        Parses cross reference tables.
        Parameters:
        startXRefOffset - start offset of the first table
        Returns:
        the trailer dictionary
        Throws:
        java.io.IOException - if something went wrong
      • parseXrefObjStream

        private long parseXrefObjStream​(long objByteOffset,
                                        boolean isStandalone)
                                 throws java.io.IOException
        Parses an xref object stream starting with indirect object id.
        Returns:
        value of PREV item in dictionary or -1 if no such item exists
        Throws:
        java.io.IOException
      • getStartxrefOffset

        protected final long getStartxrefOffset()
                                         throws java.io.IOException
        Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
        Returns:
        the offset of StartXref
        Throws:
        java.io.IOException - If something went wrong.
      • lastIndexOf

        protected int lastIndexOf​(char[] pattern,
                                  byte[] buf,
                                  int endOff)
        Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
        Parameters:
        pattern - pattern to search for
        buf - buffer to search pattern in
        endOff - offset (exclusive) where lookup starts at
        Returns:
        start offset of pattern within buffer or -1 if pattern could not be found
      • isLenient

        public boolean isLenient()
        Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
        Returns:
        true if parser is lenient
      • setLenient

        public void setLenient​(boolean lenient)
        Change the parser leniency flag. This method can only be called before the parsing of the file.
        Parameters:
        lenient - try to handle malformed PDFs.
      • getObjectId

        private long getObjectId​(COSObject obj)
        Creates a unique object id using object number and object generation number. (requires object number < 2^31))
      • addNewToList

        private void addNewToList​(java.util.Queue<COSBase> toBeParsedList,
                                  java.util.Collection<COSBase> newObjects,
                                  java.util.Set<java.lang.Long> addedObjects)
        Adds all from newObjects to toBeParsedList if it is not an COSObject or we didn't add this COSObject already (checked via addedObjects).
      • addNewToList

        private void addNewToList​(java.util.Queue<COSBase> toBeParsedList,
                                  COSBase newObject,
                                  java.util.Set<java.lang.Long> addedObjects)
        Adds newObject to toBeParsedList if it is not an COSObject or we didn't add this COSObject already (checked via addedObjects). Simple objects are not added because nothing is done with them when toBeParsedList is processed.
      • parseDictObjects

        protected void parseDictObjects​(COSDictionary dict,
                                        COSName... excludeObjects)
                                 throws java.io.IOException
        Will parse every object necessary to load a single page from the pdf document. We try our best to order objects according to offset in file before reading to minimize seek operations.
        Parameters:
        dict - the COSObject from the parent pages.
        excludeObjects - dictionary object reference entries with these names will not be parsed
        Throws:
        java.io.IOException - if something went wrong
      • addExcludedToList

        private void addExcludedToList​(COSName[] excludeObjects,
                                       COSDictionary dict,
                                       java.util.Set<java.lang.Long> parsedObjects)
      • parseObjectDynamically

        protected final COSBase parseObjectDynamically​(COSObject obj,
                                                       boolean requireExistingNotCompressedObj)
                                                throws java.io.IOException
        This will parse the next object from the stream and add it to the local state.
        Parameters:
        obj - object to be parsed (we only take object number and generation number for lookup start offset)
        requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        java.io.IOException - If an IO error occurs.
      • parseObjectDynamically

        protected COSBase parseObjectDynamically​(long objNr,
                                                 int objGenNr,
                                                 boolean requireExistingNotCompressedObj)
                                          throws java.io.IOException
        This will parse the next object from the stream and add it to the local state. It's reduced to parsing an indirect object.
        Parameters:
        objNr - object number of object to be parsed
        objGenNr - object generation number of object to be parsed
        requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        java.io.IOException - If an IO error occurs.
      • parseFileObject

        private void parseFileObject​(java.lang.Long offsetOrObjstmObNr,
                                     COSObjectKey objKey,
                                     COSObject pdfObject)
                              throws java.io.IOException
        Throws:
        java.io.IOException
      • parseObjectStream

        private void parseObjectStream​(int objstmObjNr)
                                throws java.io.IOException
        Throws:
        java.io.IOException
      • getLength

        private COSNumber getLength​(COSBase lengthBaseObj,
                                    COSName streamType)
                             throws java.io.IOException
        Returns length value referred to or defined in given object.
        Throws:
        java.io.IOException
      • parseCOSStream

        protected COSStream parseCOSStream​(COSDictionary dic)
                                    throws java.io.IOException
        This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
        Parameters:
        dic - dictionary that goes with this stream.
        Returns:
        parsed pdf stream.
        Throws:
        java.io.IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
      • readUntilEndStream

        private void readUntilEndStream​(java.io.OutputStream out)
                                 throws java.io.IOException
        This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object. Some pdf files, however, forget to write some endstream tags and just close off objects with an "endobj" tag so we have to handle this case as well. This method is optimized using buffered IO and reduced number of byte compare operations.
        Parameters:
        out - stream we write out to.
        Throws:
        java.io.IOException - if something went wrong
      • readValidStream

        private void readValidStream​(java.io.OutputStream out,
                                     COSNumber streamLengthObj)
                              throws java.io.IOException
        Throws:
        java.io.IOException
      • validateStreamLength

        private boolean validateStreamLength​(long streamLength)
                                      throws java.io.IOException
        Throws:
        java.io.IOException
      • checkXRefOffset

        private long checkXRefOffset​(long startXRefOffset)
                              throws java.io.IOException
        Check if the cross reference table/stream can be found at the current offset.
        Parameters:
        startXRefOffset -
        Returns:
        the revised offset
        Throws:
        java.io.IOException
      • checkXRefStreamOffset

        private boolean checkXRefStreamOffset​(long startXRefOffset)
                                       throws java.io.IOException
        Check if the cross reference stream can be found at the current offset.
        Parameters:
        startXRefOffset - the expected start offset of the XRef stream
        Returns:
        the revised offset
        Throws:
        java.io.IOException - if something went wrong
      • calculateXRefFixedOffset

        private long calculateXRefFixedOffset​(long objectOffset,
                                              boolean streamsOnly)
                                       throws java.io.IOException
        Try to find a fixed offset for the given xref table/stream.
        Parameters:
        objectOffset - the given offset where to look at
        streamsOnly - search for xref streams only
        Returns:
        the fixed offset
        Throws:
        java.io.IOException - if something went wrong
      • validateXrefOffsets

        private boolean validateXrefOffsets​(java.util.Map<COSObjectKey,​java.lang.Long> xrefOffset)
                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • checkXrefOffsets

        private void checkXrefOffsets()
                               throws java.io.IOException
        Check the XRef table by dereferencing all objects and fixing the offset if necessary.
        Throws:
        java.io.IOException - if something went wrong.
      • checkObjectKey

        private boolean checkObjectKey​(COSObjectKey objectKey,
                                       long offset)
                                throws java.io.IOException
        Check if the given object can be found at the given offset.
        Parameters:
        objectKey - the object we are looking for
        offset - the offset where to look
        Returns:
        returns true if the given object can be dereferenced at the given offset
        Throws:
        java.io.IOException - if something went wrong
      • bfSearchForObjects

        private void bfSearchForObjects()
                                 throws java.io.IOException
        Brute force search for every object in the pdf.
        Throws:
        java.io.IOException - if something went wrong
      • bfSearchForXRef

        private long bfSearchForXRef​(long xrefOffset,
                                     boolean streamsOnly)
                              throws java.io.IOException
        Search for the offset of the given xref table/stream among those found by a brute force search.
        Parameters:
        streamsOnly - search for xref streams only
        Returns:
        the offset of the xref entry
        Throws:
        java.io.IOException - if something went wrong
      • searchNearestValue

        private long searchNearestValue​(java.util.List<java.lang.Long> values,
                                        long offset)
      • bfSearchForTrailer

        private boolean bfSearchForTrailer​(COSDictionary trailer)
                                    throws java.io.IOException
        Brute force search for all trailer marker.
        Throws:
        java.io.IOException - if something went wrong
      • bfSearchForLastEOFMarker

        private void bfSearchForLastEOFMarker()
                                       throws java.io.IOException
        Brute force search for the last EOF marker.
        Throws:
        java.io.IOException - if something went wrong
      • bfSearchForObjStreams

        private void bfSearchForObjStreams()
                                    throws java.io.IOException
        Brute force search for all object streams.
        Throws:
        java.io.IOException - if something went wrong
      • bfSearchForXRefTables

        private void bfSearchForXRefTables()
                                    throws java.io.IOException
        Brute force search for all xref entries (tables).
        Throws:
        java.io.IOException - if something went wrong
      • bfSearchForXRefStreams

        private void bfSearchForXRefStreams()
                                     throws java.io.IOException
        Brute force search for all /XRef entries (streams).
        Throws:
        java.io.IOException - if something went wrong
      • rebuildTrailer

        protected final COSDictionary rebuildTrailer()
                                              throws java.io.IOException
        Rebuild the trailer dictionary if startxref can't be found.
        Returns:
        the rebuild trailer dictionary
        Throws:
        java.io.IOException - if something went wrong
      • searchForTrailerItems

        private boolean searchForTrailerItems​(COSDictionary trailer)
                                       throws java.io.IOException
        Search for the different parts of the trailer dictionary.
        Parameters:
        trailer -
        Returns:
        true if the root was found, false if not.
        Throws:
        java.io.IOException
      • retrieveCOSDictionary

        private COSDictionary retrieveCOSDictionary​(COSObject object)
                                             throws java.io.IOException
        Throws:
        java.io.IOException
      • retrieveCOSDictionary

        private COSDictionary retrieveCOSDictionary​(COSObjectKey key,
                                                    long offset)
                                             throws java.io.IOException
        Throws:
        java.io.IOException
      • checkPages

        protected void checkPages​(COSDictionary root)
        Check if all entries of the pages dictionary are present. Those which can't be dereferenced are removed.
        Parameters:
        root - the root dictionary of the pdf
      • checkPagesDictionary

        private int checkPagesDictionary​(COSDictionary pagesDict,
                                         java.util.Set<COSObject> set)
      • isCatalog

        protected boolean isCatalog​(COSDictionary dictionary)
        Tell if the dictionary is a PDF catalog. Override this for an FDF catalog.
        Parameters:
        dictionary -
        Returns:
        true if the given dictionary is a root dictionary
      • isInfo

        private boolean isInfo​(COSDictionary dictionary)
        Tell if the dictionary is an info dictionary.
        Parameters:
        dictionary -
        Returns:
        true if the given dictionary is an info dictionary
      • parseStartXref

        private long parseStartXref()
                             throws java.io.IOException
        This will parse the startxref section from the stream. The startxref value is ignored.
        Returns:
        the startxref value or -1 on parsing error
        Throws:
        java.io.IOException - If an IO error occurs.
      • isString

        private boolean isString​(byte[] string)
                          throws java.io.IOException
        Checks if the given string can be found at the current offset.
        Parameters:
        string - the bytes of the string to look for
        Returns:
        true if the bytes are in place, false if not
        Throws:
        java.io.IOException - if something went wrong
      • isString

        private boolean isString​(char[] string)
                          throws java.io.IOException
        Checks if the given string can be found at the current offset.
        Parameters:
        string - the bytes of the string to look for
        Returns:
        true if the bytes are in place, false if not
        Throws:
        java.io.IOException - if something went wrong
      • parseTrailer

        private boolean parseTrailer()
                              throws java.io.IOException
        This will parse the trailer from the stream and add it to the state.
        Returns:
        false on parsing error
        Throws:
        java.io.IOException - If an IO error occurs.
      • parsePDFHeader

        protected boolean parsePDFHeader()
                                  throws java.io.IOException
        Parse the header of a pdf.
        Returns:
        true if a PDF header was found
        Throws:
        java.io.IOException - if something went wrong
      • parseFDFHeader

        protected boolean parseFDFHeader()
                                  throws java.io.IOException
        Parse the header of a fdf.
        Returns:
        true if a FDF header was found
        Throws:
        java.io.IOException - if something went wrong
      • parseHeader

        private boolean parseHeader​(java.lang.String headerMarker,
                                    java.lang.String defaultVersion)
                             throws java.io.IOException
        Throws:
        java.io.IOException
      • parseXrefTable

        protected boolean parseXrefTable​(long startByteOffset)
                                  throws java.io.IOException
        This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.
        Parameters:
        startByteOffset - the offset to start at
        Returns:
        false on parsing error
        Throws:
        java.io.IOException - If an IO error occurs.
      • parseXrefStream

        private void parseXrefStream​(COSStream stream,
                                     long objByteOffset,
                                     boolean isStandalone)
                              throws java.io.IOException
        Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.
        Parameters:
        stream - the stream to be read
        objByteOffset - the offset to start at
        isStandalone - should be set to true if the stream is not part of a hybrid xref table
        Throws:
        java.io.IOException - if there is an error parsing the stream
      • getDocument

        public COSDocument getDocument()
                                throws java.io.IOException
        This will get the document that was parsed. The document must be parsed before this is called. When you are done with this document you must call close() on it to release resources.
        Returns:
        The document that was parsed.
        Throws:
        java.io.IOException - If there is an error getting the document.
      • getEncryption

        public PDEncryption getEncryption()
                                   throws java.io.IOException
        This will get the encryption dictionary. The document must be parsed before this is called.
        Returns:
        The encryption dictionary of the document that was parsed.
        Throws:
        java.io.IOException - If there is an error getting the document.
      • getAccessPermission

        public AccessPermission getAccessPermission()
                                             throws java.io.IOException
        This will get the AccessPermission. The document must be parsed before this is called.
        Returns:
        The access permission of document that was parsed.
        Throws:
        java.io.IOException - If there is an error getting the document.
      • parseTrailerValuesDynamically

        protected COSBase parseTrailerValuesDynamically​(COSDictionary trailer)
                                                 throws java.io.IOException
        Parse the values of the trailer dictionary and return the root object.
        Parameters:
        trailer - The trailer dictionary.
        Returns:
        The parsed root object.
        Throws:
        java.io.IOException - If an IO error occurs or if the root object is missing in the trailer dictionary.
      • prepareDecryption

        private void prepareDecryption()
                                throws java.io.IOException
        Prepare for decryption.
        Throws:
        InvalidPasswordException - If the password is incorrect.
        java.io.IOException - if something went wrong
      • parseDictionaryRecursive

        private void parseDictionaryRecursive​(COSObject dictionaryObject)
                                       throws java.io.IOException
        Resolves all not already parsed objects of a dictionary recursively.
        Parameters:
        dictionaryObject - dictionary to be parsed
        Throws:
        java.io.IOException - if something went wrong