public class CrochemoreLandauZivUkelson
extends PairwiseAlignmentAlgorithm
This abstract class is the superclass of both global and local sequence alignment algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and Michal Ziv-Ukelson (2002).
This implementation derives from the paper of M.Crochemore, G.Landau and M.Ziv-Ukelson, A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices (available here as PDF or Postscript).
It employs Lempel-Ziv compression techniques to speed up the classic dynamic programmin approach to sequence alignment (see class NeedlemanWunsch and class SmithWaterman classes). It reduces the time and space complexity from O(n2) to O(n2/log n). In fact, the complexity is actually O(h n2/log n), where 0 <= h <= 1 is a real number denoting the entropy of the text (a measure of how compressible it is). This means that, the more compressible a sequence is, the less memory the algorithm will require, and the faster it will run.
The idea behind this improvement is to identify repetitions in the sequences and reuse the computation of their alignments. The first step is, therefore, to parse the sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished by the class FactorSequence class (for more information about this factorisation, please refer to the specification of that class) that builds a doubly-linked list of factors. Each factor is an instance of the class Factor class (refer to the specification of this class for more information).
Once the sequences have been parsed, the algorithm builds a matrix of blocks, called block table, that is vaguely similar to the dynamic programming matrix used by both NeedlemanWunsch and SmithWaterman. Each block contains an instance of an class AlignmentBlock (please refer to its specification for more information on what information is stored) and represents the alignment beteween one factor of each sequence. This block table is, in fact, a partition of the alignment graph.
Consider a block B which corresponds to the alignment of factor F1 = xa from sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of S1 with character a, while F2 extends a previous factor of S2 with character b. We can define the input border of B as the set of values at the left and top borders of block B, and the output border as the set of values at the right and bottom borders of B. Moreover, we can define the following prefix blocks of B:
Note that each factor has a pointer to its prefix factor, called ancestor (see the specification of the Factor class). This pointer makes it easy to retrieve any of the three prefix blocks of B in constant time.
Rather than computing each value of the alignment block B, the algorithm will only compute the values on its input and output borders. This is precisely what makes it more efficient.
In this class there is a general specification of how the block table is computed (see the method for details). The actual method depends on the subclasses. In general, there are two phases:net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson$computeBlockTable()
In fact, for each block, only one column of the DIST matrix needs to be computed, all other columns are actually retrieved from its prefix blocks. This is precisely what is accomplished by the method found in this class (it is general enough for both global and local alignment versions of the algorithm.net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson$assembleDistMatrix(net.maizegenetics.analysis.gbs.neobio.AlignmentBlock,kotlin.Int,kotlin.Int,kotlin.Int,kotlin.Int)
From the DIST matrix, we obtain the OUT matrix defined as OUT[i,j] = I[i] + DIST[i,j] where I is the input border array. This means that the OUT matrix is the DIST matrix updated by the input border of a block. The output border is then computed from the OUT matrix by taking the maximum value of each column. This class also have a general method for assembling the input border (see net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson$assembleInputBorder(kotlin.Int,kotlin.Int,kotlin.Int,kotlin.Int)
The OUT matrix is encoded by the class OutMatrix class that takes as both a DIST matrix and an input border array. Note that it does not compute the OUT matrix, it just stores the necessary information to retrieve a value at any position of the matrix.
A naive approach to compute the output border of a block from the OUT matrix of size n x n would take a time proportional to n2. However, it happens that, due to the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which implies that they are also totally monotone. This property allows the computation of the output border of B in linear time with the SMAWK algorithm (see the specification of the class Smawk class for more information on SMAWK).
This class contains a general specification that is pertinent to both global and local versions of the algorithm. For more information on each version of, please refer to the appropriate subclass.
A note about the performance of these algorithms. Although theoretical results suggest that these algorithms are faster and consume less memory than the classical methods, in practice it is hard to realise their performance gains.
These algorithms are extremely complex and require the storage of many extra pointers and other auxiliary data for each block (see the AlignmentBlock class for more details). Hence, even though the space requirement is O(n2/log n), which is less than O(n2), in practice, for most of the cases these algorithms will take more time and memory space than their clasical counterparts (we have to keep in mind that the Big Oh notation ignores all constants involved).
Therefore, in order to realise the full power of these algorithms, they have to be used with extremly large and redundant sequences. This will allow a proper reutilisation of the computations and, maybe, provide an improvement in terms of space and run time. For instance, it is easy to devise such a sequence if we use a one-character alphabet because, in this case, a sequence is factorised into a series of factors that are a prefix of the next.
class NeedlemanWunsch,
class SmithWaterman,
class FactorSequence,
class Factor,
class AlignmentBlock,
net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson$computeBlockTable(),
net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson$assembleDistMatrix(net.maizegenetics.analysis.gbs.neobio.AlignmentBlock,kotlin.Int,kotlin.Int,kotlin.Int,kotlin.Int),
net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson$assembleInputBorder(kotlin.Int,kotlin.Int,kotlin.Int,kotlin.Int),
class OutMatrix,
class Smawk,
class CrochemoreLandauZivUkelsonGlobalAlignment,
class CrochemoreLandauZivUkelsonLocalAlignment,
class NeedlemanWunsch,
class SmithWaterman,
class FactorSequence,
class AlignmentBlock,
class OutMatrix,
class Smawk,
#computeBlockTable,
#assembleDistMatrixprotected static byte STOP_DIRECTION
A constant that indicates that the source of an optimal path has been reached in a block and that the trace back procedure to retrieve a high scoring alignment can stop.
protected static byte LEFT_DIRECTION
A constant that indicates that the left direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.
protected static byte DIAGONAL_DIRECTION
A constant that indicates that the diagonal direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.
protected static byte TOP_DIRECTION
A constant that indicates that the top direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.
protected FactorSequence seq1
The first factorised sequence being aligned.
protected FactorSequence seq2
The second factorised sequence being aligned.
protected AlignmentBlock[] block_table
The block table, which is a matrix of alignment blocks where each block represents the alignment between one factor of each sequence.
protected int num_rows
Number of rows of the block table. It is determined by the number of factors of the first sequence.
protected int num_cols
Number of columns of the block table. It is determined by the number of factors of the second sequence.
protected Smawk smawk
An instance of the Smawk class that implements the SMAWK algorithm to compute the column maxima of a totally monotone matrix. It is used to speed up the computation of the output border of a block.
protected OutMatrix out_matrix
An instance of the OutMatrix class that encodes the OUT matrix of a block when supplied with the DIST matrix and the input border array of a block. Note that it does not compute the OUT matrix itselft, it just stores the necessary information to retrieve a value at any position of the matrix.
This object is then used to compute the output border of a block with the Smawk class. Note that the OutMatrix class implements the Matrix interface as required by the Smawk class.
interface Matrix,
class Smawkprotected void loadSequencesInternal(java.io.Reader input1,
java.io.Reader input2)
Loads sequences into FactorSequence instances. In case of any error, an exception is raised by the constructor of FactorSequence (please check the specification of that class for specific requirements).
A FactorSequence is an LZ78-like factorisation of the sequences being aligned.
input1 - Input for first sequenceinput2 - Input for second sequenceIOException - If an I/O error occurs when reading the sequencesInvalidSequenceException - If the sequences are not validclass FactorSequenceprotected void unloadSequencesInternal()
Frees pointers to loaded sequences and the the block table so that their data can be garbage collected.
protected PairwiseAlignment computePairwiseAlignment()
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to retrieve an optimal alignment between the loaded sequences. The actual product depends on the subclasses which can produce a global (see CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment (see CrochemoreLandauZivUkelsonLocalAlignment).
Subclasses are required to implement the buildOptimalAlignment abstract method defined by this class according to their own method.
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.class CrochemoreLandauZivUkelsonGlobalAlignment,
class CrochemoreLandauZivUkelsonLocalAlignment,
#computeBlockTable,
#buildOptimalAlignmentprotected int computeScore()
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to locate the score of the highest scoring alignment between the two sequences in the block table. The result depends on the subclasses, and either a global alignment (see CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment score (see CrochemoreLandauZivUkelsonLocalAlignment) will be produced.
Subclasses are required to implement the locateScore abstract method defined by this class according to their own method.
Note that this method calculates the similarity value only (it doesn't trace back into the block table to retrieve the alignment itself).
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.class CrochemoreLandauZivUkelsonGlobalAlignment,
class CrochemoreLandauZivUkelsonLocalAlignment,
#locateScoreprotected void computeBlockTable()
Computes the block table. This method is a general specification of how the block table should be computed. It creates the block table according to the number of factors of each sequence. It then goes over each position of the block table, retrieves the corresponding factors from each sequence, and repasses the information to the subclasses that will do the actual computation of each block using the scoring scheme previously set.
There are four different cases that defines four abstract methods in this class, which subclasses must implement:
Note that each factor has a serial number which indicates its order in the list of factors of a sequence. This number will match with the row and column index of a block in the block table. For instance, if a block has factors F1 and F2 with serial numbers 12 and 53, respectively, this means that this block is found at row 12, column 53 of the block table.
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.#createRootBlock,
#createFirstColumnBlock,
#createFirstRowBlock,
#createBlockprotected kotlin.Array[] assembleDistMatrix(AlignmentBlock block, int dim, int r, int c, int lc)
Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix blocks. In fact, it also stores pointers to the owner block for each column retrieved. These pointers are later used during the trace back procedure that builds an optimal alignment from the information computed in the block table. This method is general enough to suit both global and local alignment versions of the algorithm.
block - the block for which the DIST matrix is neededdim - the dimension of the DIST matrixr - the row index of this block in the block tablec - the column index of this block in the block tablelc - the number of columns of the alignment blockprotected kotlin.Array[] assembleInputBorder(int dim,
int r,
int c,
int lr)
Assembles the input border of a block by retrieving the values at the output borders of the left and top blocks. This method is general enough to suit both global and local alignment versions of the algorithm. Note that it can be used to assemble the input border of any block but the root one (it will cause an ArrayIndexOutOfBoundsException.
dim - dimension of the input borderr - row index of the block in the block tablec - column index of the block in the block tablelr - number of row of the blockprotected void traverseBlock(AlignmentBlock block, int source, java.lang.StringBuffer gapped_seq1, java.lang.StringBuffer tag_line, java.lang.StringBuffer gapped_seq2)
Traverses a block to retrieve a part of an optimal alignment from the specified source in the output border to an entry in the input border.
block - the block to be traversedsource - the source of the path in the output bordergapped_seq1 - the StringBuffer to where the gapped sequence 1 is written totag_line - the StringBuffer to where the tag_line is written togapped_seq2 - the StringBuffer to where the gapped sequence 2 is written toIncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.protected AlignmentBlock getLeftPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the left prefix of a block from the block table.
block - the blockprotected AlignmentBlock getDiagonalPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the diagonal prefix of a block from the block table.
block - the blockprotected AlignmentBlock getTopPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the top prefix of a block from the block table.
block - the blockprotected AlignmentBlock createRootBlock(Factor factor1, Factor factor2)
Computes the root block of the block table. See subclasses for actual implementation.
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedIncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.protected AlignmentBlock createFirstRowBlock(Factor factor1, Factor factor2, int col)
Computes a block at the first row (row zero) of the block table, which corresponds to an alignment block between one factor of sequence 2 and an empty string. See subclasses for actual implementation.
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedcol - the column index of block in the block tableIncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.protected AlignmentBlock createFirstColumnBlock(Factor factor1, Factor factor2, int row)
Computes a block at the first column (column zero) of the block table, which corresponds to an alignment block between one factor of sequence 1 and an empty string. See subclasses for actual implementation.
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedrow - the row index of block in the block tableIncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.protected AlignmentBlock createBlock(Factor factor1, Factor factor2, int row, int col)
Computes a block of the block table, which corresponds to an alignment block between one factor of sequence 1 and one factor of sequence 2. See subclasses for actual implementation.
factor1 - the factor of the first sequence being alignedfactor2 - the factor of the second sequence being alignedrow - the row index of block in the block tablecol - the column index of block in the block tableIncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.protected PairwiseAlignment buildOptimalAlignment()
Retrieves an optimal alignment between the loaded sequences. See subclasses for actual implementation.
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.protected int locateScore()
Locates the score of the highest scoring alignment between the two sequences in the block table after is thas been computed. See subclasses for actual implementation.
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.