public class BaseEncoder
Utility class for encoding tags into longs.
Sequencing reads are chunked into 32bp and recorded in a 64-bit long. Only A (00), C (01), G (10), T (11) are encoded. Any other character sets the entire long to -1. Missing data at the end is padded with poly-A or (0). This missing end, is tracked by the tag length attribute.
Some of these methods should be transitioned to , however, BaseEncoder only supports four states, while NucleotideAlignment includes gaps, insertions, and missing.class NucleotideAlignmentConstants
class NucleotideAlignmentConstantspublic static int chunkSize
defines the number of bases fitting with a long
public static int chunkSizeForInt
public static kotlin.Array[] bases
defines the base order
public static long getLongFromSeq(java.lang.String seq)
Returns a long for a sequence in a String
seq - public static kotlin.Array[] getLongArrayFromSeq(java.lang.String seq)
seq - A String containing a DNA sequence.public static kotlin.Array[] getLongArrayFromSeq(java.lang.String seq,
int paddedLength)
seq - A String containing a DNA sequence.public static int getIntFromSeq(java.lang.String seq)
Returns an int for a sequence in a String NOTE: this version leaves the padding at the FRONT of the sequence. This is to facilitate SPARK machine-learning IT is preferable to have a smaller int when creating the sequence. Padding at the end gives a larger value. Currently this is only used for monetdb encoding. The ints can be converted back to sequence by the existing getSequenceFromInt() method. User needs to know where padding was added to correctly analyze the sequence.
seq - public static long getReverseComplement(long seq,
byte len)
Returns the reverse complement of a sequence already encoded in a 2-bit long.
Note: polyA is used represent unknown, but reverse complement will change it to polyT which does not mean the same sometimes it is best to reverseComplement by text below
seq - 2-bit encoded sequencelen - length of the sequencepublic static long getReverseComplement(long seq)
Returns the reverse complement of a sequence already encoded in a 2-bit long. The entire long (32-bp) is reverse complemented.
Note: polyA is used represent unknown, but reverse complement will change it to polyT which does not mean the same sometimes it is best to reverseComplement by text below
seq - 2-bit encoded sequencepublic static kotlin.Array[] getReverseComplement(kotlin.Array[] seq)
Returns the reverse complement of a arrays of sequences already encoded in a 2-bit long.
Note: polyA is used represent unknown, but reverse complement will change it to polyT which does not mean the same sometimes it is best to reverseComplement by text below
seq - array of 2-bit encoded sequencespublic static java.lang.String getReverseComplement(java.lang.String seq)
Returns a string based reverse complement. Get around issues with the poly-A tailing in the 2-bit encoding approach.
seq - DNA sequencepublic static char getComplementBase(char base)
Returns reverse complement for a sequence.
base - public static kotlin.Array[] getByteSeqFromLong(long val)
Returns the byte representation used by TASSEL for the 2-bit encoded long. class NucleotideAlignmentConstants
e.g. A > 2-bit encode 00 > byte (0)
val - 2-bit encoded DNA sequenceclass NucleotideAlignmentConstantspublic static kotlin.Array[] getByteSeqFromLong(kotlin.Array[] valA)
Returns the byte representation used by TASSEL for the 2-bit encoded long. class NucleotideAlignmentConstants
e.g. A > 2-bit encode 00 > byte (0)
valA - array of 2-bit encoded DNA sequenceclass NucleotideAlignmentConstantspublic static long getLongSeqFromByteArray(kotlin.Array[] b)
Returns the 2-bit encoded long represented by 32 bytes representing representation. It is padded by As if shorter than 32 bytes, -1 returned if longer than 32. The byte array values must be 0-3. If the array contains a value outside that range returns -1. class NucleotideAlignmentConstants
b - array of bytes encoding NucleotideAlignmentConstantsclass NucleotideAlignmentConstantspublic static java.lang.String getSequenceFromLong(long val,
byte len)
Return a string representation of the 2-bit encoded long.
val - 2-bit encoded sequencelen - length of the sequencepublic static java.lang.String getSequenceFromLong(kotlin.Array[] val)
Return a string representation of an array of 2-bit encoded longs.
val - array of 2-bit encoded sequencespublic static kotlin.Array[] getIntFromLong(long val)
Split a 2-bit encoded long into 2 integers.
val - 2-bit encoded long sequencepublic static java.lang.String getSequenceFromInt(int val)
Return a string representation of the 2-bit encoded Integer (16bp).
val - 2-bit encoded sequencepublic static int getFirstLowQualityPos(java.lang.String quality,
int minQual)
Returns the position of the first low quality positions based on a quality fastq (?) string.
quality - fastq quality stringminQual - minimum quality thresholdpublic static int getFirstLowQualityPos(java.lang.String quality,
int minQual,
int qualBase)
Returns the position of the first low quality positions based on a quality fastq (?) string.
quality - fastq quality stringminQual - minimum quality thresholdpublic static java.lang.String getSequenceFromLong(long val)
Return a string representation of the 2-bit encoded long.
val - 2-bit encoded sequencepublic static byte seqDifferences(long seq1,
long seq2,
int maxDivergence)
Returns the number of bp differences between two 2-bit encoded longs. Maximum divergence is used to save time when only interested in very similar sequences.
seq1 - 2-bit encoded sequenceseq2 - 2-bit encoded sequencemaxDivergence - threshold for counting divergence uptopublic static byte seqDifferences(long seq1,
long seq2)
Returns the number of bp differences between two 2-bit encoded longs.
seq1 - 2-bit encoded sequenceseq2 - 2-bit encoded sequencepublic static byte seqDifferencesForSubset(long seq1,
long seq2,
int lengthOfComp,
int maxDivergence)
Returns the number of sequencing differences between two 2-bit encoded longs. Maximum divergence is used to save time when only interested in very similar sequences.
seq1 - 2-bit encoded sequenceseq2 - 2-bit encoded sequencelengthOfComp - number of sites to comparemaxDivergence - threshold for counting divergence uptopublic static java.lang.String removePolyAFromEnd(java.lang.String s)
Trim the poly-A off the sequence string
s - input sequence