public class MarkerDNARun_IFLFilePlugin
extends AbstractPlugin
BEWARE: Whenever you "pull" to update the GOBII projects, there could be changes that effect these plugins. Need to check if the .nmap or .dupmap files have changed, and make corresponding changes here as required. This class takes a hmp.txt file(s) or vcf file(s) with a mapping file and creates the intermediate files for the marker, marker_linkage_group, dataset_marker, dnarun, and dataset_dnarun tables. The inputFile variable can be a file or a directory. If it is a directory, the code will look for all files with format *(hmp.txt,hmp.txt.gz,vcf,vcf.gz) and process them. It is assumed all files use the same taxa. Because we assume it is all the same taxa, the dnarun and dataset_dnarun files are created from the first *.hmp.txt file processed. These are the intermediate files that map this dnarun to a dataset, and contain one entry for each taxa which contains the taxa name (in the name field), libraryPrepID (as the code field), and ids into experiment and dnasample tables. The "mapping file" needs to contain columnns for the following data: taxaname: as appears in the vcf/hmp file name: taxa name it maps to (do I need this?) MGID: MGID for this taxa name GID: GID for this dnarun libraryID: same as in dnasample file project_name: db will be queried to get project_id from project name. Needed by IFL get get dnasample_id experiment_name: name of experiment needed for dnarun table (IFL maps to id) platform_name: name of platform needed for marker table, (IFL maps to ID) reference_name: name of reference table (IFL maps to ID) dataset_name: needed for dataset_dnarun and dataset_marker tables (IFL Maps to ID) samplename: will be used for table dnasample.name field The mapping file needs an entry for all taxa that may appear in the data input file. It is ok if multiple taxa names appear with the same MGID/GID/etc. These are synonyms. We mostly aren't storing the names, just the MGID. It must be identified in the mapping file. THe dataset id must be gotten from the database. Check the dataset from the mapping file, query the database to get the dataset_id. GOBII creates the data_table and data_file names from the GUI when it creates the ID. It always names them DS_.h5 and DS_ for the table. We must do this by hand as we want to maintain consistency. The marker_linkage_group: Their mapping now requires both marker_name and platform_id. So Platform_name must also be a parameter. The software will query the db to get the platform_id from platform name. Both the marker and the marker_linkage_group intermediate files need the platform_name. This could be moved to the mapping file, but currently is an input parameter. VCF file headers have these fields: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT taxa1 taxa2 ... HMP.txt file headers have these fields: rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode taxa1 ... Class Gobii_IFLUtils is used to find chrom,pos, alt and strand values based on file type of hmp or vcf. THe type of file is determined by the file suffix (hmp.txt, hmp.txt.gz, vcf or vcf.gz) July 6, 2016: In addition, this method will create a file to be used with PostProcessMarkerPlugin(). This file will contain the marker name, platformid and alts array. It may be used at a future date to find existing markers in the DB and update the alts array. See PostProcessMarkerPlugin() for details. Currently any allele in A/C/G/T that is not the reference will appear on the alt list. This is per Ed who says given a large enough population, each allele will appear as an alternate. August 3: BEcause we continue to change the data that makes up the sample name (was GID:plate:well, now is extraction_id) I have added a column called "SampleName" to the mapping file. The software will take whatever is stored here and use it as the dnasample name. Biologists can then change at will without a need to change the software Problems with GOBII loaders: The GOBII dnarun.nmap file now takes "num" instead of "platename" EIther one can be a problem for BL as they are not required fields, and we often don't have values for these columns. The IFL scripts preprocessor_ifile.py does not check for IS NULL. It merely checks if the input file and exsiting db column match. You can't compare blank to null in postgres. BEcause of this, I changed our copy of the dnarun.nmap file and removed "num" as a mapping criteria. I have left it in this code to alert me if I do a "pull" on GOBII and move over new scripts. Uploading DS_X.dnarun should reulst in no entries, which will hopefully remind me to chaneg the mapping script again.
public MarkerDNARun_IFLFilePlugin(java.awt.Frame parentFrame,
boolean isInteractive)
public MarkerDNARun_IFLFilePlugin()
protected void preProcessParameters(DataSet input)
protected void postProcessParameters()
public javax.swing.ImageIcon getIcon()
public java.lang.String getButtonName()
public java.lang.String getToolTipText()
public java.util.HashMap<java.lang.String,net.maizegenetics.analysis.gobii.MarkerDNARun_IFLFilePlugin.HmpTaxaData> createTaxaMap(java.sql.Connection conn,
java.lang.String mappingFile)
public static void main(java.lang.String[] args)
public java.lang.String dbConfigFile()
DB connection config file
public MarkerDNARun_IFLFilePlugin dbConfigFile(java.lang.String value)
Set dbConfigFile. DB connection config file
value - dbConfigFilepublic java.lang.String inputFile()
hmp.txt file including, including the header line, which will be used to create marker related and dnarun related intermediary files for GOBII loading
public MarkerDNARun_IFLFilePlugin inputFile(java.lang.String value)
Set inputFile. hmp.txt file including, including the header line, which will be used to create marker related and dnarun related intermediary files for GOBII loading
value - inputFilepublic java.lang.String outputFileDir()
Directory where created files will be written
public MarkerDNARun_IFLFilePlugin outputFileDir(java.lang.String value)
Set outputFileDir. Directory where created files will be written
value - outputFileDirpublic java.lang.String refFile()
Species reference file used to determine ref allele at marker position
public MarkerDNARun_IFLFilePlugin refFile(java.lang.String value)
Set Reference File. Species reference file used to determine ref allele at marker position
value - Reference Filepublic java.lang.String mappingFile()
tab-delimited File containing columns for taxaname, name, MGID, libraryID, project_id, experiment_name, platform_name, reference_name and dataset_name
public MarkerDNARun_IFLFilePlugin mappingFile(java.lang.String value)
Set mappingFile. tab-delimited File containing columns for taxaname, name, MGID, libraryID, project_id, experiment_name, platform_name, reference_name and dataset_name
value - mappingFilepublic java.lang.String mapsetName()
Integer identifying the mapset_id from the linkage group table to use when mapping to marker_linkage_group.
public MarkerDNARun_IFLFilePlugin mapsetName(java.lang.String value)
Set mapsetId
value - mapsetIdpublic java.lang.String expName()
Name of experiment to which this data belongs. Must match an experiment name from the db experiment table.
public MarkerDNARun_IFLFilePlugin expName(java.lang.String value)
Set Experiment Name. Name of experiment to which this data belongs. Must match an experiment name from the db experiment table.
value - Experiment Namepublic java.lang.String platformName()
THe platform on which this data set was run, e.g. GBSv27. Must match a platform name from the platform db table
public MarkerDNARun_IFLFilePlugin platformName(java.lang.String value)
Set Platform Name. THe platform on which this data set was run, e.g. GBSv27. Must match a platform name from the platform db table
value - Platform Namepublic java.lang.String refName()
Name of referenece, e.g agpv2. Must match name from entry in reference table in db.
public MarkerDNARun_IFLFilePlugin refName(java.lang.String value)
Set Reference Name. Name of referenece, e.g agpv2. Must match name from entry in reference table in db.
value - Reference Namepublic java.lang.String datasetName()
Name of dataset for this data. Must match one the name of one of the administered datasets in the db.
public MarkerDNARun_IFLFilePlugin datasetName(java.lang.String value)
Set Dataset Name. Name of dataset for this data. Must match one the name of one of the administered datasets in the db.
value - Dataset Name