| Class | Description |
|---|---|
| BMSConnection | |
| GOBIIDbUtils |
These methods were largely borrowed from Jeff's privatemaizegenetics GenomeAnnosDB.java, then modified as needed for GOBII postgres access. Currently, this file is ONLY connecting to postgres, not monetd. if plugins are created that need to connect to monetdb, a new connectToDB() method for monetdb must be created, or the existing one will need to be changed to indicate the database type for connection. For connecting to a monetdb instance, see example in privatemaizegenetics.lynn.MonetdbFileProcessing.MonetDBQtoPosList.connectToDBOrDie
|
| GOBIIPlugin | |
| GOBIIPostgresConnection | |
| GOBII_IFLUtils |
This class contains utility methods for pulling values out of hmp or vcf files needed when creating the intermediate files for loading into GOBII postgres and monetdb instances
|
| GWAS_IFLPlugin |
The tables populated from this plugin are described in TAS-1162 The plugin takes files of gwas data and adds them to the gwas_data in a GOBII instance. It is assume the gwas_method and gwas_experiment tables associated with this data have already been populated. These are all proprietary tables currently only in use by Buckler Lab. TO speed up procesing, the experimentId and methodIds are hard-coded in the juint that calls this plugin. Those tables are generally small, and if you have to go to it to get the name, you might as well just input the ID anda save GOBII IFL processing time. The values stored in the "values" column will be stored as "real" in the gwas_data table This is because there is 1 "value" field, which holds values for all data, of any type. The method table will provide specifics on how to interpret each statistic. In addition to the .gz files of gwas data, a mapping file of phenotype names to IDs is created - data pulled from b4R table.
|
| HapBreakpoints_IFLFilePlugin |
THis is defined in tas-1098 This plugin takes a haplotype breakpoint file and creates intermediate files for the hap_breakpoint and breakpoint_set tables. These three tables are currently proprietary tables created by Buckler lab to be added to the GOBII postgres DB. THe tables are created from the create_hapBrkptTables.sql on cbsudc01 in directory /workdir/lcj34/postgresFiles/gobii_ifl_filesToLoad/gobii_hapbreakpoints THe tables to be created have these entries: hapbreakpoint Table: hap_breakpoint_id int taxa(GID) int position_range int4range (start/stop stored as an integer range) donor1 (GID) int donor2 (GID) int breakpoint_set_id int (maps to breapoint_set table) breakpoint_set Table: breakpoint_set_id int name text method text (method used to created breakpoint file ,e.g FILLIN) mapset_id int projection_align Table: projection_align_id int name text het_resolution text breakpoint_set_id int dataset_id int (gives us the donor file) The thought is that GOBII IFL scripts will operate successfully on them. It will require the tables to be previously created in the DB, and mapping files to be created and stored on the db server that can be used to process these files for bulk loading. THis method will take the set name and use it to populate the breakpoint table. Need to create the index into each table like GOBII does, with auto increment and they live in the pg_catalog. Restrictions: The breakfile format must be as defined by Ed. The first line must contain 2 tab-delimited values, the first indicating the number of donors and the second indicating the number of "blocks" to process. All lines beginning with a "#" are considered comment lines. There are 2 mapping files required: one for donors, and one for taxa. in the Ames example, the donor mapping file is the same file used to curate the WGS dataset 5, which is named ZeaWGS_hmp321_raw_AGPv3. The taxa mapping file is the file Cinta provides with each subset, e.g. for Ames in Cornell box. If the TaxaColumn in the mapping file contains a libraryID (e.g. name:libraryID) the code removes it, leaving taxa set to just the "name" portion. THis is because Kelly's pa.txt.gz files do NOT have the library portion, though some of Cinta's files do. The breakpoint file has donors at top: Donors are Feb-48,PHG84, PHG83, etc. Their GIDs come from the WGS mapping file Cinta gave for curating that large vcf set. 1210 2711 #Donor Haplotypes 0 Feb-48 1 PHG84 2 PHG83 And taxa breakpoint blocks at the bottom: "taxa" is the first column. It gets it GID from the germplasminformation_Ames_20160810.txt file Cinta provided in Cornell Box with the Ames data: Below, 12E and 37 are taxa. #Block are defined chr:startPos:endPos:donor1:donor2 (-1 means no hypothesis) 12E 1:299497909:299752426:958:1070 1:299777120:300064510:79:958 1:300064521:300335541:1050:1202 1:300351838:300818858:162:7 37 2:233507851:234268676:1015:1015 2:234268677:234350614:820:820 2:234350639:234411604:950:950 2:234411637:234464003:191:1 The third table, the projection_alignment table, will be populated with another plugin. Users will create their own projection alignment analyses for the breakpoint sets of their choice. ------------- Sept 14, 2016: Trying with Cinta's "name2" field, as the taxa are not matching from the taxa column for the taxa mapping file. Donors are working.
|
| MarkerDNARun_IFLFilePlugin |
BEWARE: Whenever you "pull" to update the GOBII projects, there could be changes that effect these plugins. Need to check if the .nmap or .dupmap files have changed, and make corresponding changes here as required. This class takes a hmp.txt file(s) or vcf file(s) with a mapping file and creates the intermediate files for the marker, marker_linkage_group, dataset_marker, dnarun, and dataset_dnarun tables. The inputFile variable can be a file or a directory. If it is a directory, the code will look for all files with format *(hmp.txt,hmp.txt.gz,vcf,vcf.gz) and process them. It is assumed all files use the same taxa. Because we assume it is all the same taxa, the dnarun and dataset_dnarun files are created from the first *.hmp.txt file processed. These are the intermediate files that map this dnarun to a dataset, and contain one entry for each taxa which contains the taxa name (in the name field), libraryPrepID (as the code field), and ids into experiment and dnasample tables. The "mapping file" needs to contain columnns for the following data: taxaname: as appears in the vcf/hmp file name: taxa name it maps to (do I need this?) MGID: MGID for this taxa name GID: GID for this dnarun libraryID: same as in dnasample file project_name: db will be queried to get project_id from project name. Needed by IFL get get dnasample_id experiment_name: name of experiment needed for dnarun table (IFL maps to id) platform_name: name of platform needed for marker table, (IFL maps to ID) reference_name: name of reference table (IFL maps to ID) dataset_name: needed for dataset_dnarun and dataset_marker tables (IFL Maps to ID) samplename: will be used for table dnasample.name field The mapping file needs an entry for all taxa that may appear in the data input file. It is ok if multiple taxa names appear with the same MGID/GID/etc. These are synonyms. We mostly aren't storing the names, just the MGID. It must be identified in the mapping file. THe dataset id must be gotten from the database. Check the dataset from the mapping file, query the database to get the dataset_id. GOBII creates the data_table and data_file names from the GUI when it creates the ID. It always names them DS_.h5 and DS_ for the table. We must do this by hand as we want to maintain consistency. The marker_linkage_group: Their mapping now requires both marker_name and platform_id. So Platform_name must also be a parameter. The software will query the db to get the platform_id from platform name. Both the marker and the marker_linkage_group intermediate files need the platform_name. This could be moved to the mapping file, but currently is an input parameter. VCF file headers have these fields: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT taxa1 taxa2 ... HMP.txt file headers have these fields: rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode taxa1 ... Class Gobii_IFLUtils is used to find chrom,pos, alt and strand values based on file type of hmp or vcf. THe type of file is determined by the file suffix (hmp.txt, hmp.txt.gz, vcf or vcf.gz) July 6, 2016: In addition, this method will create a file to be used with PostProcessMarkerPlugin(). This file will contain the marker name, platformid and alts array. It may be used at a future date to find existing markers in the DB and update the alts array. See PostProcessMarkerPlugin() for details. Currently any allele in A/C/G/T that is not the reference will appear on the alt list. This is per Ed who says given a large enough population, each allele will appear as an alternate. August 3: BEcause we continue to change the data that makes up the sample name (was GID:plate:well, now is extraction_id) I have added a column called "SampleName" to the mapping file. The software will take whatever is stored here and use it as the dnasample name. Biologists can then change at will without a need to change the software Problems with GOBII loaders: The GOBII dnarun.nmap file now takes "num" instead of "platename" EIther one can be a problem for BL as they are not required fields, and we often don't have values for these columns. The IFL scripts preprocessor_ifile.py does not check for IS NULL. It merely checks if the input file and exsiting db column match. You can't compare blank to null in postgres. BEcause of this, I changed our copy of the dnarun.nmap file and removed "num" as a mapping criteria. I have left it in this code to alert me if I do a "pull" on GOBII and move over new scripts. Uploading DS_X.dnarun should reulst in no entries, which will hopefully remind me to chaneg the mapping script again.
|
| MonetDB_IFLFilePlugin |
This class has methods to create GOBII intermediary files to be used when creating a GOBII monetdb dataset table. The GOBII IFL scripts require 3 files: a matrix of variants, a list of maker_ids, and a list of dnarun ids. The matrix of variants is created from the hmp.txt file at the same time the intermediary files for the postgres marker and dnarun related tables are created. (see MarkerDNARunFromHMP_IFLFilePlugin) Once the IFL scripts have been run to populate these postgres tables, the marker_id and dnarun_id values are created. This script pulls the id values from the postgres DB to create the final 2 intermediary files needed by the GOBII loadVariantMatrix.py script See this link for details: http://cbsugobii05.tc.cornell.edu:6084/display/TD/MonetDB+IFL The db config file should look like this: host=cbsudc01.tc.cornell.edu user= password= DB=gobii_maizeifltest (or other postgres db you want to query) The outputDir field should also contain the prefix for the file. Previously this was the dataset.name. But GOBII names their .h5 and monetdb table with DS_. We want to do the same to be consistent. THis will need to be queried from the db before running this plugin. If the dataset name is known, this is a simple query. AUgust2016 UPDATE: The list of markers in the marker_id, and dnarun_ids in dnarun_id file must be in proper order. They must be in the order the markers are stored in the monetdb table, ie in the order they are stored in the variant file. TO achieve this, the DB query orders the output by marker_idx (marker query) and dnarun_idx (dnarun query). These idx values were created sequentially when the marker and dnarun tables were created.
|
| PreProcessGOBIIMappingFilePlugin |
This plugin should be run prior to creating the intermediate files for marker and dnarun. There are 3 purposes to this plugin's. Using the mapping file created for the dataset: 1. Identify duplicate/missing germplasm/dnasample entries, create intermediate file for germplasm and dnasmple tables, load any missing entries. Duplicates are skipped. 2. Identify duplicate libraryPrepIds. Write a list of duplicate libraryPrepIds, write to a file. 3. Provide mapping data to load new marker/dnarun related tables. Create intermediate files, load via GOBII IFL scripts For the first 2 purposes, the database must be queried. Missing entries entries are defined as below: germplasm table: From the db,Get list of distinct MGIDs (they should all be distinct). use this list to compare to MGIDs in the file. For any MGIDs that don't appear, create a line in the *.germplasm intermediate file used to add values. dnasample table: From the db, Get a list of dnasample names. These names are a string comprised of these components: GID:plate:well. From the input file, for each entry, create a concatenanted string of GID:plate:well. compare to list from db. For any names that don't appear, create a line in the *.dnasample intermediate file for loading. This file needs the "name" field to be a concatenation of GID:plate:well as this will be unique and GOBII dnasample.dupmap looks at only the name field. Code can be MGID if we need that stored (which I think we do). It takes "external code" column instead of germplasm_id as that maps to the external_code field in the germplasm table when GOBII IFL looks to find the germplasm_id from DB. This file also needs project_name, which comes from the mapping file. dnarun table: From the db, Get a list of all dnasample.name fields. These should be distinct library prep id. Compare to libraryPrepIds from the mapping file. IF there are duplicate, write to a file to show the biologist. NOTES: GOBII uses dnasample.name and dnasample.num to determine duplicates BL is not populating dnasample.num. "num" has been removed from the dnasample.dupmap file when running this. For some reason, with it present, but all values "null", the script believed the values were different and I ended up duplicating all dnasamples when sending the file through the GOBII scripts. When I removed this line, the scripts only checked the "name" field and project id and it worked. For step 3: The intermediate files are created by the MarkerDNARunMGID_fromHMPIFIFIlePLugin.java. Note the dnasample and germplasm entries must be loaded to the db before loading the marker/ dnarun intermediate files or the necssary db ids will not be found..
|
| SplitFile_IFLFilePlugin | |
| UpdateMarkerAndDNA_idxes |
Once I have the datasets fixed, this class should not be needed. What it does: Initially the marker_idx and dnarun_idx columns of the dataset_marker and dataset_dnarun tables respectively were not populated. They are now needed and are populated. Kevin Palis created a couple scripts to handle populating these fields in tables when they were missing. These scripts live with the gobii_ifl_scripts on CBSU, and are called update_marker_idx.py and update_dnarun_idx.py. For GOBII, then are in the rpository at the same level as the gobii_ifl.py scrips. The file below creates an intermediate file that will be worked on by the preprocess_ifile.py script. You can also run the gobii_ifl.py script instead if you uncomment the "return" statement that occurs after the preprocess_ifile.py script has been called. Here is the order: 1. Run this class to create the needed files (DS_X.mh5i and DS_X.sh5i) 2. sftp these files to cbsudc01.tc.cornell into /workdir/lcj34/postgresFiles/update_idxes_files dir 3. Run the file through gobii_ifl.scripts (change the script to return after the preprocess_ifl.py step !!) python gobii_ifl.py -c postgresql://lcj34:@localhost:5432/gobii_maize2 -i /workdir/lcj34/postgresFiles/update_idxes_files/DS_5.sh5i -o /tmp/ -v 4. Run the /tmp/ppd_* file created in step 3 through the update_dnarun_idx.py or update_marker_idx.py script python update_dnarun_idx.py "postgresql://lcj34:@cbsudc01.tc.cornell.edu/gobii_maize2" /tmp/ppd_DS_5.sh5i 5 5. Verify the db has values for dataset_marker.marker_idx and dataset_dnarun.dnarun_idx for the specified dataset_id. 6. Change the gobii_ifl.py script to re-comment the "return" after the preprocess_ifl call
|