Package com.tom_roush.pdfbox.pdfparser
Class COSParser
- java.lang.Object
-
- com.tom_roush.pdfbox.pdfparser.BaseParser
-
- com.tom_roush.pdfbox.pdfparser.COSParser
-
public class COSParser extends BaseParser
PDF-Parser which first reads startxref and xref tables in order to know valid objects and parse only these objects. This class can be used as aPDFParserreplacement. FirstPDFParser.parse()orFDFParser.parse()must be called before page objects can be retrieved, e.g.PDFParser.getPDDocument(). This class is a much enhanced version ofQuickParserpresented in PDFBOX-1104 by Jeremy Villalobos.
-
-
Field Summary
Fields Modifier and Type Field Description static byte[]ENDOBJstatic byte[]ENDSTREAMprotected static char[]EOF_MARKEREOF-marker.protected longfileLenfile length.protected booleaninitialParseDoneprotected static char[]OBJ_MARKERobj-marker.protected SecurityHandlersecurityHandlerThe security handler.protected RandomAccessReadsourcestatic StringSYSPROP_EOFLOOKUPRANGEThe range within the %%EOF marker will be searched.static StringSYSPROP_PARSEMINIMALOnly parse the PDF file minimally allowing access to basic information.static StringTMP_FILE_PREFIXThe prefix for the temp file being used.protected XrefTrailerResolverxrefTrailerResolverCollects all Xref/trailer objects and resolves them into single object using startxref reference.-
Fields inherited from class com.tom_roush.pdfbox.pdfparser.BaseParser
A, ASCII_CR, ASCII_LF, B, D, DEF, document, E, ENDOBJ_STRING, ENDSTREAM_STRING, J, M, N, O, R, S, seqSource, STREAM_STRING, T
-
-
Constructor Summary
Constructors Constructor Description COSParser(RandomAccessRead source)Default constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description COSDocumentgetDocument()This will get the document that was parsed.protected longgetStartxrefOffset()Looks for and parses startxref.booleanisLenient()Return true if parser is lenient.protected intlastIndexOf(char[] pattern, byte[] buf, int endOff)Searches last appearance of pattern within buffer.protected COSStreamparseCOSStream(COSDictionary dic)This will read a COSStream from the input stream using length attribute within dictionary.protected voidparseDictObjects(COSDictionary dict, COSName... excludeObjects)Will parse every object necessary to load a single page from the pdf document.protected booleanparseFDFHeader()Parse the header of a fdf.protected COSBaseparseObjectDynamically(long objNr, int objGenNr, boolean requireExistingNotCompressedObj)This will parse the next object from the stream and add it to the local state.protected COSBaseparseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj)This will parse the next object from the stream and add it to the local state.protected booleanparsePDFHeader()Parse the header of a pdf.protected longparseStartXref()This will parse the startxref section from the stream.protected booleanparseTrailer()This will parse the trailer from the stream and add it to the state.protected COSBaseparseTrailerValuesDynamically(COSDictionary trailer)Parse the values of the trailer dictionary and return the root objectprotected COSDictionaryparseXref(long startXRefOffset)Parses cross reference tables.voidparseXrefStream(COSStream stream, long objByteOffset, boolean isStandalone)Fills XRefTrailerResolver with data of given stream.protected booleanparseXrefTable(long startByteOffset)This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.protected COSDictionaryrebuildTrailer()Rebuild the trailer dictionary if startxref can't be found.voidsetEOFLookupRange(int byteCount)Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.voidsetLenient(boolean lenient)Change the parser leniency flag.-
Methods inherited from class com.tom_roush.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isDigit, isDigit, isEndOfName, isEOL, isEOL, isSpace, isSpace, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedChar, readExpectedString, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, skipSpaces, skipWhiteSpace
-
-
-
-
Field Detail
-
ENDSTREAM
public static final byte[] ENDSTREAM
-
ENDOBJ
public static final byte[] ENDOBJ
-
source
protected final RandomAccessRead source
-
SYSPROP_PARSEMINIMAL
public static final String SYSPROP_PARSEMINIMAL
Only parse the PDF file minimally allowing access to basic information.- See Also:
- Constant Field Values
-
SYSPROP_EOFLOOKUPRANGE
public static final String SYSPROP_EOFLOOKUPRANGE
The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.- See Also:
- Constant Field Values
-
EOF_MARKER
protected static final char[] EOF_MARKER
EOF-marker.
-
OBJ_MARKER
protected static final char[] OBJ_MARKER
obj-marker.
-
fileLen
protected long fileLen
file length.
-
initialParseDone
protected boolean initialParseDone
-
securityHandler
protected SecurityHandler securityHandler
The security handler.
-
xrefTrailerResolver
protected XrefTrailerResolver xrefTrailerResolver
Collects all Xref/trailer objects and resolves them into single object using startxref reference.
-
TMP_FILE_PREFIX
public static final String TMP_FILE_PREFIX
The prefix for the temp file being used.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
COSParser
public COSParser(RandomAccessRead source)
Default constructor.
-
-
Method Detail
-
setEOFLookupRange
public void setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default valueDEFAULT_TRAIL_BYTECOUNT.We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.
In case system property
SYSPROP_EOFLOOKUPRANGEis defined this value will be set on initialization but can be overwritten later.- Parameters:
byteCount- number of trailing bytes
-
parseXref
protected COSDictionary parseXref(long startXRefOffset) throws IOException
Parses cross reference tables.- Parameters:
startXRefOffset- start offset of the first table- Returns:
- the trailer dictionary
- Throws:
IOException- if something went wrong
-
getStartxrefOffset
protected final long getStartxrefOffset() throws IOExceptionLooks for and parses startxref. We first look for last '%%EOF' marker (within lastDEFAULT_TRAIL_BYTECOUNTbytes (or range set viasetEOFLookupRange(int)) and go back to findstartxref.- Returns:
- the offset of StartXref
- Throws:
IOException- If something went wrong.
-
lastIndexOf
protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.- Parameters:
pattern- pattern to search forbuf- buffer to search pattern inendOff- offset (exclusive) where lookup starts at- Returns:
- start offset of pattern within buffer or
-1if pattern could not be found
-
isLenient
public boolean isLenient()
Return true if parser is lenient. Meaning auto healing capacity of the parser are used.- Returns:
- true if parser is lenient
-
setLenient
public void setLenient(boolean lenient)
Change the parser leniency flag. This method can only be called before the parsing of the file.- Parameters:
lenient- try to handle malformed PDFs.
-
parseDictObjects
protected void parseDictObjects(COSDictionary dict, COSName... excludeObjects) throws IOException
Will parse every object necessary to load a single page from the pdf document. We try our best to order objects according to offset in file before reading to minimize seek operations.- Parameters:
dict- the COSObject from the parent pages.excludeObjects- dictionary object reference entries with these names will not be parsed- Throws:
IOException- if something went wrong
-
parseObjectDynamically
protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException
This will parse the next object from the stream and add it to the local state.- Parameters:
obj- object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj- iftrueobject to be parsed must not be contained within compressed stream- Returns:
- the parsed object (which is also added to document object)
- Throws:
IOException- If an IO error occurs.
-
parseObjectDynamically
protected COSBase parseObjectDynamically(long objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException
This will parse the next object from the stream and add it to the local state. It's reduced to parsing an indirect object.- Parameters:
objNr- object number of object to be parsedobjGenNr- object generation number of object to be parsedrequireExistingNotCompressedObj- iftruethe object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)- Returns:
- the parsed object (which is also added to document object)
- Throws:
IOException- If an IO error occurs.
-
parseCOSStream
protected COSStream parseCOSStream(COSDictionary dic) throws IOException
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.- Parameters:
dic- dictionary that goes with this stream.- Returns:
- parsed pdf stream.
- Throws:
IOException- if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
-
rebuildTrailer
protected final COSDictionary rebuildTrailer() throws IOException
Rebuild the trailer dictionary if startxref can't be found.- Returns:
- the rebuild trailer dictionary
- Throws:
IOException- if something went wrong
-
parseStartXref
protected long parseStartXref() throws IOExceptionThis will parse the startxref section from the stream. The startxref value is ignored.- Returns:
- the startxref value or -1 on parsing error on parsing error
- Throws:
IOException- If an IO error occurs.
-
parseTrailer
protected boolean parseTrailer() throws IOExceptionThis will parse the trailer from the stream and add it to the state.- Returns:
- false on parsing error
- Throws:
IOException- If an IO error occurs.
-
parsePDFHeader
protected boolean parsePDFHeader() throws IOExceptionParse the header of a pdf.- Returns:
- true if a PDF header was found
- Throws:
IOException- if something went wrong
-
parseFDFHeader
protected boolean parseFDFHeader() throws IOExceptionParse the header of a fdf.- Returns:
- true if a FDF header was found
- Throws:
IOException- if something went wrong
-
parseXrefTable
protected boolean parseXrefTable(long startByteOffset) throws IOExceptionThis will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.- Parameters:
startByteOffset- the offset to start at- Returns:
- false on parsing error
- Throws:
IOException- If an IO error occurs.
-
parseXrefStream
public void parseXrefStream(COSStream stream, long objByteOffset, boolean isStandalone) throws IOException
Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.- Parameters:
stream- the stream to be readobjByteOffset- the offset to start atisStandalone- should be set to true if the stream is not part of a hybrid xref table- Throws:
IOException- if there is an error parsing the stream
-
getDocument
public COSDocument getDocument() throws IOException
This will get the document that was parsed. parse() must be called before this is called. When you are done with this document you must call close() on it to release resources.- Returns:
- The document that was parsed.
- Throws:
IOException- If there is an error getting the document.
-
parseTrailerValuesDynamically
protected COSBase parseTrailerValuesDynamically(COSDictionary trailer) throws IOException
Parse the values of the trailer dictionary and return the root object- Parameters:
trailer- The trailer dictionary.- Returns:
- The parsed root object
- Throws:
IOException- If an IO error occurs or if the root object is missing in the trailer dictionary
-
-