HGVbaseG2P::DataImport::Core - High level logic for data DataImport and import pipeline
#for dbGaP import of a file use HGVbaseG2P::Browser::Core;
my $validator = HGVbaseG2P::DataImport::Core->new({ conf_file => "/Users/robf/Projects/hgvbaseG2P/perl/conf/hgvbase.conf", argv => \%ARGV, template_file => '../conf/import/dbGaP.conf', }); $validator->process_file($ARGV{sourcefile}); $logger->info("lines successful:\n".$validator->marker_success);
my $marker_stats = $validator->marker_failed; =head1 DESCRIPTION
The DataImport::Core module ties together the logic of RetrieveMarker, Rule::*, Plugin::* and Export::* modules.
The separation of core logic makes it easier to test the module by use of regression/unit testing. It also provides the option to add additional Rule modules or a different Export module.
To demonstrate the core principles, an example data set (WTCCC) was used.
This contains the criteria for import of data. An example template is shown below with comments:
#WTCCC FlatFile import config - comment line plugins=AffymetrixLookup - use the AffymetrixLookup plugin - specify plugin module names to include here (space-separated) study_id=HGVST2 - study identifier which can also be specified as a field experiment_id=HGVE2 - experiment_identifier which can also be specified as a field resultset_id=HGVRS2 - resultset identifier which can also be specified as a field head=1 - number of heading rows (default is 0) accession_db=dbSNP - database to use for marker lookups comment_prefix=# - character used to specify comment lines field_separator=\t - delimiter used to split data (default is tab delimited) gt_order=11 12 22 - order of genotypes in a combined field allele_order=1 2 - order of alleles in a combined field frequencies=1 - indicates there are frequencies in this file associations=1 - indicates there are associations in this file #the fields in the delimited file in order of appearance #the logic looks for the presence of specific fields to determine what DataImport to do <Fields> 1=affymetrixid 2=accession 3=position 4=allele1 5=allele2 6=av_max_post_call 7=genotype11_number:CONTROL 8=genotype12_number:CONTROL 9=genotype22_number:CONTROL 10=genotype_none:CONTROL 11=genotype11_number:CASE 12=genotype12_number:CASE 13=genotype22_number:CASE 14=genotype_none:CASE 15=pvalue </Fields>
The fields are used to build a hash containing these as the key, and the element in the delimited-line as the value. The presence/absence of specific keys is then used by the logic to determine the DataImport performed. Any other keys will be present in the data hash, and can be used by a plugin to perform different processes on the data.
Reserved fields include the following:
accession - the marker ID from the source DB (eg. dbSNP) position - the expected position of the marker panelname - the panel name to lookup in the study (e.g. CONTROL) allele1 - the label for allele 1 (eg. A) allele2 - the label for allele 2 (eg. C) genotype11 - the label for genotype11 (eg. AA) genotype12 - the label for genotype12 (eg. AC) genotype22 - the label for genotype22 (eg. CC) allele1_number - the number of people with allele 1 allele2_number - the number of people with allele 2 genotype11_number - the number of people with genotype 11 genotype12_number - the number of people with genotype 12 genotype22_number - the number of people with genotype 22 allele1_frequency - the frequency of allele1 allele2_frequency - the frequency of allele2 genotype11_frequency - the frequency of genotype 11 genotype12_frequency - the frequency of genotype 12 genotype22_frequency - the frequency of genotype 22 allele_total - the total number of alleles genotype_total - the total number of genotypes
Frequencies/numbers and totals are calculated if not present (as long as the bare minimum information is present)
The simple fields above only work if a line contains information for a single panel.
Therefore, it is possible to specify that allele or genotype data corresponds to a particular panel by using a colon followed by the panelname
eg. genotype11_frequency:CONTROL
These fields are dealt with in the same way as standard frequencies/numbers (e.g. data is calculated as required)
For multiple alleles or genotypes represented in a single field, the DataImport pipeline splits the data into its component fields, and regex can be used to extract allele or genotype numbers once a split has been carried out. The separators and regex are specified in the template file: These are examples for the CGEMS dataset:
gt_freq_separator=\| gt_number_separator=\| gt_number_regex=(.*)\(.*\) allele_separator=\| allele_number_separator=\| allele_number_regex=(.*)\(.*\)
Also important for compound fields is the order of alleles or genotypes. This is specified using the following options:
allele_order=1 2 genotype_order=11 12 22
alleles - both alleles in a single field (eg. A|C) - a rule is added to split these using 'allele_separator' as delimiter, and they are put in 'allele_order' genotypes - all genotypes in a single field (eg. AA|AC|CC) - a rule is added to split these using 'gt_separator' as delimiter, and there are put in 'gt_order' allele_frequencies - both allele frequencies in a single field (eg. 0.25|0.75) - a rule is added to split these using 'allele_freq_separator' as delimiter, and they are put in 'allele_order' genotype_frequencies - all genotypes frequencies in a single field (eg. 0.3|0.6|0.2 - a rule is added to split these using 'gt_freq_separator' as delimiter, and there are put in 'gt_order' allele_numbers - same as allele_frequencies for numbers genotype_numbers - same as genotype_frequencies for numbers
And of course for multiple panels on a line :assayedpanelname is appended to the field eg. genotype_frequencies:CASE
NOTE: if the Database exporter is used an error will occur if panels cannot be found.
Significance data is not always in the same file as the frequency data. However, the DataImport pipeline can deal with this as long as an rsid is present or can be retrieved (eg. through a plugin).
If there are separate frequency/association files, two different templates should be used. If there is a single combined file the significance results can be included in a combined template.
The association based field layout is simpler than with frequencies because a single P value and/or odds ratio represents all panels involved. However, there is the need to allow the import of multiple resultsets from a single file.
To do this use 'resultset_id' or 'resultset_name', and 'pvalue' and 'oddsratio' as column names. N.B. The resultsetid in the template will override the line resultset, so this must be removed if the resultset is per line.
Set 'resultset_lookup' to identifier,label or name And then use pvalue:info (eg. for an identifier pvalue:HGVRS1)
NOTE: if the Database exporter is used an error will occur if resultsets cannot be found.
The template provides all of the info required by the pipeline to decipher the input file.
Initially, various helper classes are prepared This includes for database access, retrieval of validated markers, user specified plugins and the exporter which deals with exporting the lines of data once they have been processed.
Also at this time rule modules are also prepared. The logic for adding these modules depends on the presence/absence of fields in the template. At the moment the core logic contains the following rule modules (prefix of HGVbaseG2P::DataImport::Rule::)
PositionsMatch | checks chromosome/position retrieved from HGVbaseG2P matches the input line |
StrandFlip::Genome | deals with strandflip when rs2genome and rs2ss fields are present |
StrandFlip::Standard | deals with standard strandflip |
AllelesMatch | check alleles in HGVbaseG2P match those provided |
PopulateGenotypesFromAlleles | creates genotypes using alleles (eg. take A and C and makes A, A+C and C genotypes) |
SplitField | generic rule for splitting field using a delimiter and putting it into new fields specified by an order |
RegexFields | generic rule for performing regex on several fields and putting the results in the same fields |
CalcAlleleFreqsFromNumbers | calculates allele frequencies using number of individuals with specific alleles |
CalcAlleleFreqsFromGenotypeNumbers | calculates allele frequences when only number with genotypes |
CalcGenotypeFreqsFromNumbers | calculates genotype frequencies using number with genotypes |
FreqFieldCheck | checkpoint that only runs once - indicates if essential fields are missing from the output |
If required, plugins are 'setup' at this point
The next stage is processing of each line of the file (ignoring header lines and comment lines) by validating the marker ID, and performing the chosen rules upon the line data. Each line is represented by an HGVbaseG2P::DataImport::DataElement which contains 'line', 'status' and 'marker'.
At the moment, output lines are batched before they are passed to the exporter module.
The DataElements are then passed to the exporter module.
Plugins will also work at various points within this processing (eg. before retrieving marker, after retrieving marker, after all rules etc).
Any de-initialisation is carried out here. Including database disconnection and plugins 'teardown' etc.
Important logic is performed by the following single helper classes:
As described above there are also changeable helper classes, inparticular the Rule and Export modules.
The default export module is Export::Database, which inserts frequency and association data straight into an HGVbaseG2P database, but it would be simple to develop a module which outputs the data to one or more files.
Although it is simple to add additional rules to the core logic, it is suggested that where possible Plugin modules are used instead. Plugins can be simply included in the 'plugins' line of the DataImport template passed to the core module.
If you do wish to add additional rules a Rule module must be created, and then the 'prepare_rule' method modified. This method contains clever logic which based on the presence/absence of fields in the input file, adds rules to deal with the data. This can involve simple comparison with the DB, or calculation of data such as frequencies.
Usage : $gtimport->process_file('/path/to/gtdatafile.xml') Purpose : Main method for processing an input gt-data file from a particular source Returns : undef if non-fatal errors are encountered during run, dies on fatal errors Arguments : Path to input file. Throws : Status : Public Comments :