public class FactorSequence
This class builds a list of factors of a character sequence as induced by its Lempel-Ziv (LZ78) factorisation. Each factor is enconded as the longest factor previously seen plus one character.
The input can come from any source, provided it is encapsulated in a proper Reader instance. The stream is expected to be ready (i.e. the next read operation must return the first character of the sequence) and it is not closed when its end is reached, so the client is allowed to reset it and maybe use it for another purpose.
Sequences can contain letters only although lines started with the COMMENT_CHAR character ('>') are regarded as comments and are completely skipped. White spaces (including tabs, line feeds and carriage returns) are also ignored throughout.
This class uses a class Trie to keep track of a list of factors. Each node of the trie contains a class Factor of the text. As the sequence is read from the input, the trie is traversed as far as possible. When a leaf node is reached (which means that the longest prefix of the input has been found), two tasks are accomplished:
Factor is created with the character at the current position of the input and the leaf node's factor; Each factor also receives a serial number according to the order they are found and a pointer to the next factor (in that order) for fast access. This pointer, together with the factor's ancestor pointer forms a doubly-linked list of factors. The original text can then be reconstructed simply by following the linked list and writing out its factors.
As an example, the sequence ACTAAACCGCATTAATAATAAAA is parsed into the following 12 factors:
0 ( , ) = empty
1 (0,A) = A
2 (0,C) = C
3 (0,T) = T
4 (1,A) = AA
5 (1,C) = AC
6 (2,G) = CG
7 (2,A) = CA
8 (3,T) = TT
9 (4,T) = AAT
10 (9,A) = AATA
11 (4,A) = AAA
serial # (prefix, new char) = factor text
This class is used by class CrochemoreLandauZivUkelson algorithm to speed up the classic dynamic programming approach to sequence alignment.
protected static char COMMENT_CHAR
The character used to start a comment line in a sequence file. When this character is found, the rest of the line is ignored.
protected Factor root_factor
A pointer to the root factor, the one that starts the list of factors.
protected int num_chars
The numbers of character represented by this sequence.
protected int num_factors
The numbers of factors generated by the LZ78 parsing of the sequence.
public FactorSequence(java.io.Reader reader)
Creates a new instance of a FactorSequence, loading the sequence data from the Reader input stream. A doubly-linked list of factors is built according to its LZ78 factorisation.
reader - source of characters for this sequenceIOException - if an I/O exception occurs when reading the inputInvalidSequenceException - if the input does not contain a valid sequencepublic Factor getRootFactor()
Returns the root factor, the one that starts the list of factors.
public int numFactors()
Returns the number of factors produced by the LZ78 parsing of the text.
public int numChars()
Returns the number of characters of the original sequence.
public java.lang.String toString()
Reconstructs the sequence from the list of factors induced by the LZ78 parsing of the text.
public java.lang.String printFactors()
Returns a string representation of the actual list of factors produced by the LZ78 parsing of the text. Each factor is printed out in a separate line, in the order they appear in the text, with its serial number, its ancestor's serial number, its new character, length and a string representation of the factor itself.