Class TikaInputStream

All Implemented Interfaces:
Closeable, AutoCloseable

public class TikaInputStream extends TaggedInputStream
Input stream with extended capabilities. The purpose of this class is to allow files and other resources and information to be associated with the InputStream instance passed through the Parser interface and other similar APIs.

TikaInputStream instances can be created using the various static get() factory methods. Most of these methods take an optional Metadata argument that is then filled with the available input metadata from the given resource. The created TikaInputStream instance keeps track of the original resource used to create it, while behaving otherwise just like a normal, buffered InputStream. A TikaInputStream instance is also guaranteed to support the mark(int) feature.

Code that wants to access the underlying file or other resources associated with a TikaInputStream should first use the get(InputStream) factory method to cast or wrap a given InputStream into a TikaInputStream instance.

TikaInputStream includes a few safety features to protect against parsers that may fail to check for an EOF or may incorrectly rely on the unreliable value returned from FileInputStream.skip(long). These parser failures can lead to infinite loops. We strongly encourage the use of TikaInputStream.

Since:
Apache Tika 0.8
  • Method Details

    • isTikaInputStream

      public static boolean isTikaInputStream(InputStream stream)
      Checks whether the given stream is a TikaInputStream instance. The given stream can be null, in which case the return value is false.
      Parameters:
      stream - input stream, possibly null
      Returns:
      true if the stream is a TikaInputStream instance, false otherwise
    • get

      public static TikaInputStream get(InputStream stream, TemporaryResources tmp)
      Casts or wraps the given stream to a TikaInputStream instance. This method can be used to access the functionality of this class even when given just a normal input stream instance.

      The given temporary file provider is used for any temporary files, and should be disposed when the returned stream is no longer used.

      Use this method instead of the get(InputStream) alternative when you don't explicitly close the returned stream. The recommended access pattern is:

       try (TemporaryResources tmp = new TemporaryResources()) {
           TikaInputStream stream = TikaInputStream.get(..., tmp);
           // process stream but don't close it
       }
       

      The given stream instance will not be closed when the TemporaryResources.close() method is called by the try-with-resources statement. The caller is expected to explicitly close the original stream when it's no longer used.

      Parameters:
      stream - normal input stream
      Returns:
      a TikaInputStream instance
      Since:
      Apache Tika 0.10
    • get

      public static TikaInputStream get(InputStream stream)
      Casts or wraps the given stream to a TikaInputStream instance. This method can be used to access the functionality of this class even when given just a normal input stream instance.

      Use this method instead of the get(InputStream, TemporaryResources) alternative when you do explicitly close the returned stream. The recommended access pattern is:

       try (TikaInputStream stream = TikaInputStream.get(...)) {
           // process stream
       }
       

      The given stream instance will be closed along with any other resources associated with the returned TikaInputStream instance when the close() method is called by the try-with-resources statement.

      Parameters:
      stream - normal input stream
      Returns:
      a TikaInputStream instance
    • cast

      public static TikaInputStream cast(InputStream stream)
      Returns the given stream casts to a TikaInputStream, or null if the stream is not a TikaInputStream.
      Parameters:
      stream - normal input stream
      Returns:
      a TikaInputStream instance
      Since:
      Apache Tika 0.10
    • get

      public static TikaInputStream get(byte[] data)
      Creates a TikaInputStream from the given array of bytes.

      Note that you must always explicitly close the returned stream as in some cases it may end up writing the given data to a temporary file.

      Parameters:
      data - input data
      Returns:
      a TikaInputStream instance
    • get

      public static TikaInputStream get(byte[] data, Metadata metadata)
      Creates a TikaInputStream from the given array of bytes. The length of the array is stored as input metadata in the given metadata instance.

      Note that you must always explicitly close the returned stream as in some cases it may end up writing the given data to a temporary file.

      Parameters:
      data - input data
      metadata - metadata instance
      Returns:
      a TikaInputStream instance
      Throws:
      IOException
    • get

      public static TikaInputStream get(Path path) throws IOException
      Creates a TikaInputStream from the file at the given path.

      Note that you must always explicitly close the returned stream to prevent leaking open file handles.

      Parameters:
      path - input file
      Returns:
      a TikaInputStream instance
      Throws:
      IOException - if an I/O error occurs
    • get

      public static TikaInputStream get(Path path, Metadata metadata) throws IOException
      Creates a TikaInputStream from the file at the given path. The file name and length are stored as input metadata in the given metadata instance.

      Note that you must always explicitly close the returned stream to prevent leaking open file handles.

      Parameters:
      path - input file
      metadata - metadata instance
      Returns:
      a TikaInputStream instance
      Throws:
      IOException - if an I/O error occurs
    • get

      @Deprecated public static TikaInputStream get(File file) throws FileNotFoundException
      Deprecated.
      use get(Path). In Tika 2.0, this will be removed or modified to throw an IOException.
      Creates a TikaInputStream from the given file.

      Note that you must always explicitly close the returned stream to prevent leaking open file handles.

      Parameters:
      file - input file
      Returns:
      a TikaInputStream instance
      Throws:
      FileNotFoundException - if the file does not exist
    • get

      @Deprecated public static TikaInputStream get(File file, Metadata metadata) throws FileNotFoundException
      Deprecated.
      use get(Path, Metadata). In Tika 2.0, this will be removed or modified to throw an IOException.
      Creates a TikaInputStream from the given file. The file name and length are stored as input metadata in the given metadata instance.

      Note that you must always explicitly close the returned stream to prevent leaking open file handles.

      Parameters:
      file - input file
      metadata - metadata instance
      Returns:
      a TikaInputStream instance
      Throws:
      FileNotFoundException - if the file does not exist or cannot be opened for reading
    • get

      public static TikaInputStream get(Blob blob) throws SQLException
      Creates a TikaInputStream from the given database BLOB.

      Note that the result set containing the BLOB may need to be kept open until the returned TikaInputStream has been processed and closed. You must also always explicitly close the returned stream as in some cases it may end up writing the blob data to a temporary file.

      Parameters:
      blob - database BLOB
      Returns:
      a TikaInputStream instance
      Throws:
      SQLException - if BLOB data can not be accessed
    • get

      public static TikaInputStream get(Blob blob, Metadata metadata) throws SQLException
      Creates a TikaInputStream from the given database BLOB. The BLOB length (if available) is stored as input metadata in the given metadata instance.

      Note that the result set containing the BLOB may need to be kept open until the returned TikaInputStream has been processed and closed. You must also always explicitly close the returned stream as in some cases it may end up writing the blob data to a temporary file.

      Parameters:
      blob - database BLOB
      metadata - metadata instance
      Returns:
      a TikaInputStream instance
      Throws:
      SQLException - if BLOB data can not be accessed
    • get

      public static TikaInputStream get(URI uri) throws IOException
      Creates a TikaInputStream from the resource at the given URI.

      Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

      Parameters:
      uri - resource URI
      Returns:
      a TikaInputStream instance
      Throws:
      IOException - if the resource can not be accessed
    • get

      public static TikaInputStream get(URI uri, Metadata metadata) throws IOException
      Creates a TikaInputStream from the resource at the given URI. The available input metadata is stored in the given metadata instance.

      Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

      Parameters:
      uri - resource URI
      metadata - metadata instance
      Returns:
      a TikaInputStream instance
      Throws:
      IOException - if the resource can not be accessed
    • get

      public static TikaInputStream get(URL url) throws IOException
      Creates a TikaInputStream from the resource at the given URL.

      Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

      Parameters:
      url - resource URL
      Returns:
      a TikaInputStream instance
      Throws:
      IOException - if the resource can not be accessed
    • get

      public static TikaInputStream get(URL url, Metadata metadata) throws IOException
      Creates a TikaInputStream from the resource at the given URL. The available input metadata is stored in the given metadata instance.

      Note that you must always explicitly close the returned stream as in some cases it may end up writing the resource to a temporary file.

      Parameters:
      url - resource URL
      metadata - metadata instance
      Returns:
      a TikaInputStream instance
      Throws:
      IOException - if the resource can not be accessed
    • peek

      public int peek(byte[] buffer) throws IOException
      Fills the given buffer with upcoming bytes from this stream without advancing the current stream position. The buffer is filled up unless the end of stream is encountered before that. This method will block if not enough bytes are immediately available.
      Parameters:
      buffer - byte buffer
      Returns:
      number of bytes written to the buffer
      Throws:
      IOException - if the stream can not be read
    • getOpenContainer

      public Object getOpenContainer()
      Returns the open container object, such as a POIFS FileSystem in the event of an OLE2 document being detected and processed by the OLE2 detector.
    • setOpenContainer

      public void setOpenContainer(Object container)
      Stores the open container object against the stream, eg after a Zip contents detector has loaded the file to decide what it contains.
    • hasFile

      public boolean hasFile()
    • getPath

      public Path getPath() throws IOException
      If the user created this TikaInputStream with a file, the original file will be returned. If not, the entire stream will be spooled to a temporary file which will be deleted upon the close of this TikaInputStream
      Returns:
      Throws:
      IOException
    • getPath

      public Path getPath(int maxBytes) throws IOException
      Parameters:
      maxBytes - if this is less than 0 and if an underlying file doesn't already exist, the full file will be spooled to disk
      Returns:
      the original path used in the initialization of this TikaInputStream, a temporary file if the stream was shorter than maxBytes, or null if the underlying stream was longer than maxBytes.
      Throws:
      IOException
    • getFile

      public File getFile() throws IOException
      Throws:
      IOException
      See Also:
    • getFileChannel

      public FileChannel getFileChannel() throws IOException
      Throws:
      IOException
    • hasLength

      public boolean hasLength()
    • getLength

      public long getLength() throws IOException
      Returns the length (in bytes) of this stream. Note that if the length was not available when this stream was instantiated, then this method will use the getPath() method to buffer the entire stream to a temporary file in order to calculate the stream length. This case will only work if the stream has not yet been consumed.
      Returns:
      stream length
      Throws:
      IOException - if the length can not be determined
    • getPosition

      public long getPosition()
      Returns the current position within the stream.
      Returns:
      stream position
    • skip

      public long skip(long ln) throws IOException
      This relies on IOUtils.skip(InputStream, long) to ensure that the alleged bytes skipped were actually skipped.
      Overrides:
      skip in class ProxyInputStream
      Parameters:
      ln - the number of bytes to skip
      Returns:
      the number of bytes skipped
      Throws:
      IOException - if the number of bytes requested to be skipped does not match the number of bytes skipped or if there's an IOException during the read.
    • mark

      public void mark(int readlimit)
      Description copied from class: ProxyInputStream
      Invokes the delegate's mark(int) method.
      Overrides:
      mark in class ProxyInputStream
      Parameters:
      readlimit - read ahead limit
    • markSupported

      public boolean markSupported()
      Description copied from class: ProxyInputStream
      Invokes the delegate's markSupported() method.
      Overrides:
      markSupported in class ProxyInputStream
      Returns:
      true if mark is supported, otherwise false
    • reset

      public void reset() throws IOException
      Description copied from class: ProxyInputStream
      Invokes the delegate's reset() method.
      Overrides:
      reset in class ProxyInputStream
      Throws:
      IOException - if an I/O error occurs
    • close

      public void close() throws IOException
      Description copied from class: ProxyInputStream
      Invokes the delegate's close() method.
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class ProxyInputStream
      Throws:
      IOException - if an I/O error occurs
    • toString

      public String toString()
      Overrides:
      toString in class TaggedInputStream