public class EstimateLibraryComplexity extends AbstractOpticalDuplicateFinderCommandLineProgram
Attempts to estimate library complexity from sequence alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).
Reads of poor quality are filtered out so as to provide a more accurate estimate. The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than MIN_MEAN_QUALITY across either the first or second read.
The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment to screen out technical reads one further filter is applied on the data. After examining all reads a Histogram is built of [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are then removed from the Histogram as outliers before library size is estimated.
| Modifier and Type | Field and Description |
|---|---|
java.lang.String |
BARCODE_TAG |
java.util.List<java.io.File> |
INPUT |
double |
MAX_DIFF_RATE |
int |
MAX_GROUP_RATIO |
int |
MIN_IDENTICAL_BASES |
int |
MIN_MEAN_QUALITY |
java.io.File |
OUTPUT |
java.lang.String |
READ_ONE_BARCODE_TAG |
java.lang.String |
READ_TWO_BARCODE_TAG |
LOG, OPTICAL_DUPLICATE_PIXEL_DISTANCE, opticalDuplicateFinder, READ_NAME_REGEXCOMPRESSION_LEVEL, CREATE_INDEX, CREATE_MD5_FILE, GA4GH_CLIENT_SECRETS, MAX_RECORDS_IN_RAM, QUIET, REFERENCE_SEQUENCE, TMP_DIR, VALIDATION_STRINGENCY, VERBOSITY| Constructor and Description |
|---|
EstimateLibraryComplexity() |
| Modifier and Type | Method and Description |
|---|---|
protected int |
doWork()
Method that does most of the work.
|
int |
getBarcodeValue(htsjdk.samtools.SAMRecord record) |
static int |
getReadBarcodeValue(htsjdk.samtools.SAMRecord record,
java.lang.String tag) |
static void |
main(java.lang.String[] args)
Stock main method.
|
customCommandLineValidation, setupOpticalDuplicateFindergetCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getNestedOptions, getNestedOptionsForHelp, getStandardUsagePreamble, getVersion, instanceMain, instanceMainWithExit, parseArgs, setDefaultHeaders@Option(shortName="I", doc="One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.") public java.util.List<java.io.File> INPUT
@Option(shortName="O", doc="Output file to writes per-library metrics to.") public java.io.File OUTPUT
@Option(doc="The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.") public int MIN_IDENTICAL_BASES
@Option(doc="The maximum rate of differences between two reads to call them identical.") public double MAX_DIFF_RATE
@Option(doc="The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.") public int MIN_MEAN_QUALITY
@Option(doc="Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.") public int MAX_GROUP_RATIO
@Option(doc="Barcode SAM tag (ex. BC for 10X Genomics)", optional=true) public java.lang.String BARCODE_TAG
@Option(doc="Read one barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public java.lang.String READ_ONE_BARCODE_TAG
public int getBarcodeValue(htsjdk.samtools.SAMRecord record)
public static int getReadBarcodeValue(htsjdk.samtools.SAMRecord record,
java.lang.String tag)
public static void main(java.lang.String[] args)
protected int doWork()
doWork in class CommandLineProgram