Package org.apache.tika.parser.microsoft
Class OfficeParser
java.lang.Object
org.apache.tika.parser.AbstractParser
org.apache.tika.parser.microsoft.AbstractOfficeParser
org.apache.tika.parser.microsoft.OfficeParser
- All Implemented Interfaces:
Serializable,Parser
Defines a Microsoft document content extractor.
- See Also:
-
Nested Class Summary
Nested Classes -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidextractMacros(POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) Helper to extract macros from an NPOIFS/vbaProject.bin As of POI-3.15-final, there are still some bugs in VBAMacroReader.getSupportedTypes(ParseContext context) Returns the set of media types supported by this parser when used with the given parse context.voidparse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) Extracts properties and text from an MS Document input streamMethods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getExtractAllAlternativesFromMSG, getExtractMacros, getIncludeDeletedContent, getIncludeMoveFromContent, getUseSAXDocxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractorMethods inherited from class org.apache.tika.parser.AbstractParser
parse
-
Constructor Details
-
OfficeParser
public OfficeParser()
-
-
Method Details
-
getSupportedTypes
Description copied from interface:ParserReturns the set of media types supported by this parser when used with the given parse context.- Parameters:
context- parse context- Returns:
- immutable set of media types
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException Extracts properties and text from an MS Document input stream- Parameters:
stream- the document stream (input)handler- handler for the XHTML SAX events (output)metadata- document metadata (input and output)context- parse context- Throws:
IOException- if the document stream could not be readSAXException- if the SAX events could not be processedTikaException- if the document could not be parsed
-
extractMacros
public static void extractMacros(POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException Helper to extract macros from an NPOIFS/vbaProject.bin As of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions- Parameters:
fs- NPOIFS to extract fromxhtml- SAX writerembeddedDocumentExtractor- extractor for embedded documents- Throws:
IOException- on IOException if it occurs during the extraction of the embedded docSAXException- on SAXException for writing to xhtml
-