Class WordExtractor

java.lang.Object
org.apache.poi.hwpf.extractor.WordExtractor
All Implemented Interfaces:
Closeable, AutoCloseable, POIOLE2TextExtractor, POITextExtractor

public final class WordExtractor extends Object implements POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
  • Constructor Details

    • WordExtractor

      public WordExtractor(InputStream is) throws IOException
      Create a new Word Extractor
      Parameters:
      is - InputStream containing the word file
      Throws:
      IOException
    • WordExtractor

      public WordExtractor(POIFSFileSystem fs) throws IOException
      Create a new Word Extractor
      Parameters:
      fs - POIFSFileSystem containing the word file
      Throws:
      IOException
    • WordExtractor

      public WordExtractor(DirectoryNode dir) throws IOException
      Throws:
      IOException
    • WordExtractor

      public WordExtractor(HWPFDocument doc)
      Create a new Word Extractor
      Parameters:
      doc - The HWPFDocument to extract from
  • Method Details

    • getParagraphText

      public String[] getParagraphText()
      Get the text from the word file, as an array with one String per paragraph
    • getFootnoteText

      public String[] getFootnoteText()
    • getMainTextboxText

      public String[] getMainTextboxText()
    • getEndnoteText

      public String[] getEndnoteText()
    • getCommentsText

      public String[] getCommentsText()
    • getHeaderText

      @Deprecated public String getHeaderText()
      Deprecated.
      3.8 beta 4
      Grab the text from the headers
    • getFooterText

      @Deprecated public String getFooterText()
      Deprecated.
      3.8 beta 4
      Grab the text from the footers
    • getTextFromPieces

      public String getTextFromPieces()
      Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
    • getText

      public String getText()
      Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().
      Specified by:
      getText in interface POITextExtractor
      Returns:
      All the text from the document
    • stripFields

      public static String stripFields(String text)
      Removes any fields (eg macros, page markers etc) from the string.
    • getDocument

      public HWPFDocument getDocument()
      Description copied from interface: POIOLE2TextExtractor
      Return the underlying POIDocument
      Specified by:
      getDocument in interface POIOLE2TextExtractor
      Specified by:
      getDocument in interface POITextExtractor
      Returns:
      the underlying POIDocument
    • setCloseFilesystem

      public void setCloseFilesystem(boolean doCloseFilesystem)
      Specified by:
      setCloseFilesystem in interface POITextExtractor
      Parameters:
      doCloseFilesystem - true (default), if underlying resources/filesystem should be closed on POITextExtractor.close()
    • isCloseFilesystem

      public boolean isCloseFilesystem()
      Specified by:
      isCloseFilesystem in interface POITextExtractor
      Returns:
      true, if resources/filesystem should be closed on POITextExtractor.close()
    • getFilesystem

      public HWPFDocument getFilesystem()
      Specified by:
      getFilesystem in interface POITextExtractor
      Returns:
      The underlying resources/filesystem