Package org.apache.poi.hwpf.extractor
Class WordExtractor
- java.lang.Object
-
- org.apache.poi.extractor.POITextExtractor
-
- org.apache.poi.extractor.POIOLE2TextExtractor
-
- org.apache.poi.hwpf.extractor.WordExtractor
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public final class WordExtractor extends POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
-
-
Constructor Summary
Constructors Constructor Description WordExtractor(InputStream is)Create a new Word ExtractorWordExtractor(HWPFDocument doc)Create a new Word ExtractorWordExtractor(DirectoryNode dir)WordExtractor(POIFSFileSystem fs)Create a new Word Extractor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description String[]getCommentsText()String[]getEndnoteText()StringgetFooterText()Deprecated.3.8 beta 4String[]getFootnoteText()StringgetHeaderText()Deprecated.3.8 beta 4String[]getMainTextboxText()String[]getParagraphText()Get the text from the word file, as an array with one String per paragraphStringgetText()Grab the text, based on the WordToTextConverter.StringgetTextFromPieces()Grab the text out of the text pieces.static voidmain(String[] args)Command line extractor, so people will stop moaning that they can't just run this.static StringstripFields(String text)Removes any fields (eg macros, page markers etc) from the string.-
Methods inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getDocument, getMetadataTextExtractor, getRoot, getSummaryInformation
-
Methods inherited from class org.apache.poi.extractor.POITextExtractor
close, setFilesystem
-
-
-
-
Constructor Detail
-
WordExtractor
public WordExtractor(InputStream is) throws IOException
Create a new Word Extractor- Parameters:
is- InputStream containing the word file- Throws:
IOException
-
WordExtractor
public WordExtractor(POIFSFileSystem fs) throws IOException
Create a new Word Extractor- Parameters:
fs- POIFSFileSystem containing the word file- Throws:
IOException
-
WordExtractor
public WordExtractor(DirectoryNode dir) throws IOException
- Throws:
IOException
-
WordExtractor
public WordExtractor(HWPFDocument doc)
Create a new Word Extractor- Parameters:
doc- The HWPFDocument to extract from
-
-
Method Detail
-
main
public static void main(String[] args) throws IOException
Command line extractor, so people will stop moaning that they can't just run this.- Throws:
IOException
-
getParagraphText
public String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph
-
getFootnoteText
public String[] getFootnoteText()
-
getMainTextboxText
public String[] getMainTextboxText()
-
getEndnoteText
public String[] getEndnoteText()
-
getCommentsText
public String[] getCommentsText()
-
getHeaderText
@Deprecated public String getHeaderText()
Deprecated.3.8 beta 4Grab the text from the headers
-
getFooterText
@Deprecated public String getFooterText()
Deprecated.3.8 beta 4Grab the text from the footers
-
getTextFromPieces
public String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
-
getText
public String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getTextin classPOITextExtractor- Returns:
- All the text from the document
-
-