Class FST

  • All Implemented Interfaces:
    Iterable<ByteBuffer>
    Direct Known Subclasses:
    ConstantArcSizeFST, ImmutableFST

    public abstract class FST
    extends Object
    implements Iterable<ByteBuffer>
    This is a top abstract class for handling finite state automata. These automata are arc-based, a design described in Jan Daciuk's Incremental Construction of Finite-State Automata and Transducers, and Their Use in the Natural Language Processing (PhD thesis, Technical University of Gdansk).
    • Constructor Detail

      • FST

        public FST()
    • Method Detail

      • readRemaining

        protected static byte[] readRemaining​(InputStream in,
                                              int length)
                                       throws IOException
        Parameters:
        in - The input stream.
        length - Length of input to be read
        Returns:
        Reads remaining bytes upto length from an input stream and returns them as a byte array. Null if no data was read
        Throws:
        IOException - Rethrown if an I/O exception occurs.
      • read

        public static FST read​(InputStream stream,
                               boolean hasOutputSymbols,
                               PinotDataBufferMemoryManager memoryManager)
                        throws IOException
        A factory for reading automata in any of the supported versions.
        Parameters:
        stream - The input stream to read automaton data from. The stream is not closed.
        Returns:
        Returns an instantiated automaton. Never null.
        Throws:
        IOException - If the input stream does not represent an automaton or is otherwise invalid.
      • read

        public static <T extends FST> T read​(InputStream stream,
                                             Class<? extends T> clazz,
                                             boolean hasOutputSymbols)
                                      throws IOException
        A factory for reading a specific FST subclass, including proper casting.
        Type Parameters:
        T - A subclass of FST to cast the read automaton to.
        Parameters:
        stream - The input stream to read automaton data from. The stream is not closed.
        clazz - A subclass of FST to cast the read automaton to.
        Returns:
        Returns an instantiated automaton. Never null.
        Throws:
        IOException - If the input stream does not represent an automaton, is otherwise invalid or the class of the automaton read from the input stream is not assignable to clazz.
      • printToString

        public static String printToString​(FST fst)
        Print to String
      • getRootNode

        public abstract int getRootNode()
        Returns:
        Returns the identifier of the root node of this automaton. Returns 0 if the start node is also the end node (the automaton is empty).
      • getFirstArc

        public abstract int getFirstArc​(int node)
        Parameters:
        node - Identifier of the node.
        Returns:
        Returns the identifier of the first arc leaving node or 0 if the node has no outgoing arcs.
      • getNextArc

        public abstract int getNextArc​(int arc)
        Parameters:
        arc - The arc's identifier.
        Returns:
        Returns the identifier of the next arc after arc and leaving node. Zero is returned if no more arcs are available for the node.
      • getArc

        public abstract int getArc​(int node,
                                   byte label)
        Parameters:
        node - Identifier of the node.
        label - The arc's label.
        Returns:
        Returns the identifier of an arc leaving node and labeled with label. An identifier equal to 0 means the node has no outgoing arc labeled label.
      • getArcLabel

        public abstract byte getArcLabel​(int arc)
        Parameters:
        arc - The arc's identifier.
        Returns:
        Return the label associated with a given arc.
      • getOutputSymbol

        public abstract int getOutputSymbol​(int arc)
        Get output symbol for the given arc
        Parameters:
        arc - Arc for which the output symbol is requested
        Returns:
        Output symbol, null if not present
      • isArcFinal

        public abstract boolean isArcFinal​(int arc)
        Parameters:
        arc - The arc's identifier.
        Returns:
        Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.
      • isArcTerminal

        public abstract boolean isArcTerminal​(int arc)
        Parameters:
        arc - The arc's identifier.
        Returns:
        Returns true if this arc does not have a terminating node (@link getEndNode(int) will throw an exception). Implies isArcFinal(int).
      • getEndNode

        public abstract int getEndNode​(int arc)
        Parameters:
        arc - The arc's identifier.
        Returns:
        Return the end node pointed to by a given arc. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.
      • getFlags

        public abstract Set<FSTFlags> getFlags()
        Returns:
        Returns a set of flags for this FST instance.
      • getRightLanguageCount

        public int getRightLanguageCount​(int node)
        Parameters:
        node - Identifier of the node.
        Returns:
        Returns the number of sequences reachable from the given state if the automaton was compiled with FSTFlags.NUMBERS. The size of the right language of the state, in other words.
        Throws:
        UnsupportedOperationException - If the automaton was not compiled with FSTFlags.NUMBERS. The value can then be computed by manual count of getSequences(int).
      • getSequences

        public Iterable<ByteBuffer> getSequences​(int node)
        Returns an iterator over all binary sequences starting at the given FST state (node) and ending in final nodes. This corresponds to a set of suffixes of a given prefix from all sequences stored in the automaton.

        The returned iterator is a ByteBuffer whose contents changes on each call to Iterator.next(). The keep the contents between calls to Iterator.next(), one must copy the buffer to some other location.

        Important. It is guaranteed that the returned byte buffer is backed by a byte array and that the content of the byte buffer starts at the array's index 0.

        Parameters:
        node - Identifier of the starting node from which to return subsequences.
        Returns:
        An iterable over all sequences encoded starting at the given node.
      • iterator

        public final Iterator<ByteBuffer> iterator()
        Returns an iterator over all binary sequences starting from the initial FST state (node) and ending in final nodes. The returned iterator is a ByteBuffer whose contents changes on each call to Iterator.next(). The keep the contents between calls to Iterator.next(), one must copy the buffer to some other location.

        Important. It is guaranteed that the returned byte buffer is backed by a byte array and that the content of the byte buffer starts at the array's index 0.

        Specified by:
        iterator in interface Iterable<ByteBuffer>
      • visitInPostOrder

        public <T extends StateVisitor> T visitInPostOrder​(T v)
        Same as visitInPostOrder(StateVisitor, int), starting from root automaton node.
        Type Parameters:
        T - A subclass of StateVisitor.
        Parameters:
        v - Visitor to receive traversal calls.
        Returns:
        Returns the argument (for access to anonymous class fields).
      • visitInPostOrder

        public <T extends StateVisitor> T visitInPostOrder​(T v,
                                                           int node)
        Visits all states reachable from node in postorder. Returning false from StateVisitor.accept(int) immediately terminates the traversal.
        Type Parameters:
        T - A subclass of StateVisitor.
        Parameters:
        v - Visitor to receive traversal calls.
        node - Identifier of the node.
        Returns:
        Returns the argument (for access to anonymous class fields).
      • visitInPreOrder

        public <T extends StateVisitor> T visitInPreOrder​(T v)
        Same as visitInPreOrder(StateVisitor, int), starting from root automaton node.
        Type Parameters:
        T - A subclass of StateVisitor.
        Parameters:
        v - Visitor to receive traversal calls.
        Returns:
        Returns the argument (for access to anonymous class fields).
      • visitInPreOrder

        public <T extends StateVisitor> T visitInPreOrder​(T v,
                                                          int node)
        Visits all states in preorder. Returning false from StateVisitor.accept(int) skips traversal of all sub-states of a given state.
        Type Parameters:
        T - A subclass of StateVisitor.
        Parameters:
        v - Visitor to receive traversal calls.
        node - Identifier of the node.
        Returns:
        Returns the argument (for access to anonymous class fields).
      • buildMap

        protected Map<Integer,​Integer> buildMap​(String inputString)
        Build a map from a serialized string
        Parameters:
        inputString - Serialized string
        Returns:
      • isArcLast

        public abstract boolean isArcLast​(int arc)
      • toString

        public String toString()
        Returns a string representation of this automaton.
        Overrides:
        toString in class Object