Package org.apache.poi.hwpf.extractor
Class WordExtractor
java.lang.Object
org.apache.poi.hwpf.extractor.WordExtractor
- All Implemented Interfaces:
Closeable,AutoCloseable,POIOLE2TextExtractor,POITextExtractor
Class to extract the text from a Word Document.
You should use either getParagraphText() or getText() unless you have a
strong reason otherwise.
-
Constructor Summary
ConstructorsConstructorDescriptionCreate a new Word ExtractorCreate a new Word ExtractorCreate a new Word Extractor -
Method Summary
Modifier and TypeMethodDescriptionString[]Return the underlying POIDocumentString[]Deprecated.3.8 beta 4String[]Deprecated.3.8 beta 4String[]String[]Get the text from the word file, as an array with one String per paragraphgetText()Grab the text, based on the WordToTextConverter.Grab the text out of the text pieces.booleanvoidsetCloseFilesystem(boolean doCloseFilesystem) static StringstripFields(String text) Removes any fields (eg macros, page markers etc) from the string.Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getMetadataTextExtractor, getRoot, getSummaryInformationMethods inherited from interface org.apache.poi.extractor.POITextExtractor
close
-
Constructor Details
-
WordExtractor
Create a new Word Extractor- Parameters:
is- InputStream containing the word file- Throws:
IOException
-
WordExtractor
Create a new Word Extractor- Parameters:
fs- POIFSFileSystem containing the word file- Throws:
IOException
-
WordExtractor
- Throws:
IOException
-
WordExtractor
Create a new Word Extractor- Parameters:
doc- The HWPFDocument to extract from
-
-
Method Details
-
getParagraphText
Get the text from the word file, as an array with one String per paragraph -
getFootnoteText
-
getMainTextboxText
-
getEndnoteText
-
getCommentsText
-
getHeaderText
Deprecated.3.8 beta 4Grab the text from the headers -
getTextFromPieces
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too. -
getText
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getTextin interfacePOITextExtractor- Returns:
- All the text from the document
-
stripFields
Removes any fields (eg macros, page markers etc) from the string. -
getDocument
Description copied from interface:POIOLE2TextExtractorReturn the underlying POIDocument- Specified by:
getDocumentin interfacePOIOLE2TextExtractor- Specified by:
getDocumentin interfacePOITextExtractor- Returns:
- the underlying POIDocument
-
setCloseFilesystem
public void setCloseFilesystem(boolean doCloseFilesystem) - Specified by:
setCloseFilesystemin interfacePOITextExtractor- Parameters:
doCloseFilesystem-true(default), if underlying resources/filesystem should be closed onPOITextExtractor.close()
-
isCloseFilesystem
public boolean isCloseFilesystem()- Specified by:
isCloseFilesystemin interfacePOITextExtractor- Returns:
true, if resources/filesystem should be closed onPOITextExtractor.close()
-
getFilesystem
- Specified by:
getFilesystemin interfacePOITextExtractor- Returns:
- The underlying resources/filesystem
-