NAME

HGVbaseG2P::DataImport::Core - High level logic for data DataImport and import pipeline


SYNOPSIS

        #for dbGaP import of a file
        use HGVbaseG2P::Browser::Core;
        my $validator = HGVbaseG2P::DataImport::Core->new({
                                  conf_file =>
                                        "/Users/robf/Projects/hgvbaseG2P/perl/conf/hgvbase.conf",
                                  argv      => \%ARGV,
                                  template_file => '../conf/import/dbGaP.conf',
          });
          
          $validator->process_file($ARGV{sourcefile});
          $logger->info("lines successful:\n".$validator->marker_success);
      my $marker_stats  = $validator->marker_failed;
          
=head1 DESCRIPTION

The DataImport::Core module ties together the logic of RetrieveMarker, Rule::*, Plugin::* and Export::* modules.

The separation of core logic makes it easier to test the module by use of regression/unit testing. It also provides the option to add additional Rule modules or a different Export module.

To demonstrate the core principles, an example data set (WTCCC) was used.


THE TEMPLATE FILE/HASH

This contains the criteria for import of data. An example template is shown below with comments:

        #WTCCC FlatFile import config - comment line
        plugins=AffymetrixLookup - use the AffymetrixLookup plugin - specify plugin module names to include here (space-separated)
        study_id=HGVST2 - study identifier which can also be specified as a field
        experiment_id=HGVE2 - experiment_identifier which can also be specified as a field 
        resultset_id=HGVRS2 - resultset identifier which can also be specified as a field
        head=1 - number of heading rows (default is 0)
        accession_db=dbSNP - database to use for marker lookups
        comment_prefix=# - character used to specify comment lines
        field_separator=\t - delimiter used to split data (default is tab delimited) 
        gt_order=11 12 22 - order of genotypes in a combined field
        allele_order=1 2 - order of alleles in a combined field
        frequencies=1 - indicates there are frequencies in this file
        associations=1 - indicates there are associations in this file
        
        #the fields in the delimited file in order of appearance
        #the logic looks for the presence of specific fields to determine what DataImport to do
        <Fields>
        1=affymetrixid
        2=accession
        3=position
        4=allele1
        5=allele2
        6=av_max_post_call
        7=genotype11_number:CONTROL
        8=genotype12_number:CONTROL
        9=genotype22_number:CONTROL
        10=genotype_none:CONTROL
        11=genotype11_number:CASE
        12=genotype12_number:CASE
        13=genotype22_number:CASE
        14=genotype_none:CASE
        15=pvalue
        </Fields>

Frequency fields

The fields are used to build a hash containing these as the key, and the element in the delimited-line as the value. The presence/absence of specific keys is then used by the logic to determine the DataImport performed. Any other keys will be present in the data hash, and can be used by a plugin to perform different processes on the data.

Reserved fields include the following:

Simple fields

        accession - the marker ID from the source DB (eg. dbSNP)
        position - the expected position of the marker
        panelname - the panel name to lookup in the study (e.g. CONTROL)
        allele1 - the label for allele 1 (eg. A)
        allele2 - the label for allele 2 (eg. C)
        genotype11 - the label for genotype11 (eg. AA)
        genotype12 - the label for genotype12 (eg. AC)
        genotype22 - the label for genotype22 (eg. CC)
        allele1_number - the number of people with allele 1
        allele2_number - the number of people with allele 2
        genotype11_number - the number of people with genotype 11
        genotype12_number - the number of people with genotype 12
        genotype22_number - the number of people with genotype 22
        allele1_frequency - the frequency of allele1
        allele2_frequency - the frequency of allele2
        genotype11_frequency - the frequency of genotype 11
        genotype12_frequency - the frequency of genotype 12
        genotype22_frequency - the frequency of genotype 22
        allele_total - the total number of alleles
        genotype_total - the total number of genotypes

Frequencies/numbers and totals are calculated if not present (as long as the bare minimum information is present)

Multiple panels per line

The simple fields above only work if a line contains information for a single panel. Therefore, it is possible to specify that allele or genotype data corresponds to a particular panel by using a colon followed by the panelname eg. genotype11_frequency:CONTROL

These fields are dealt with in the same way as standard frequencies/numbers (e.g. data is calculated as required)

Compound fields

For multiple alleles or genotypes represented in a single field, the DataImport pipeline splits the data into its component fields, and regex can be used to extract allele or genotype numbers once a split has been carried out. The separators and regex are specified in the template file: These are examples for the CGEMS dataset:


        gt_freq_separator=\|
        gt_number_separator=\|
        gt_number_regex=(.*)\(.*\)
        allele_separator=\|
        allele_number_separator=\|
        allele_number_regex=(.*)\(.*\)

Also important for compound fields is the order of alleles or genotypes. This is specified using the following options:


        allele_order=1 2
        genotype_order=11 12 22

Compound fields available

        alleles - both alleles in a single field (eg. A|C) - a rule is added to split these using 'allele_separator' as delimiter, and they are put in 'allele_order'
        genotypes - all genotypes in a single field (eg. AA|AC|CC) - a rule is added to split these using 'gt_separator' as delimiter, and there are put in 'gt_order'
        allele_frequencies - both allele frequencies in a single field (eg. 0.25|0.75) - a rule is added to split these using 'allele_freq_separator' as delimiter, and they are put in 'allele_order'
        genotype_frequencies - all genotypes frequencies in a single field (eg. 0.3|0.6|0.2 - a rule is added to split these using 'gt_freq_separator' as delimiter, and there are put in 'gt_order'
        allele_numbers - same as allele_frequencies for numbers
        genotype_numbers - same as genotype_frequencies for numbers

And of course for multiple panels on a line :assayedpanelname is appended to the field eg. genotype_frequencies:CASE

NOTE: if the Database exporter is used an error will occur if panels cannot be found.

Association Data

Significance data is not always in the same file as the frequency data. However, the DataImport pipeline can deal with this as long as an rsid is present or can be retrieved (eg. through a plugin).

If there are separate frequency/association files, two different templates should be used. If there is a single combined file the significance results can be included in a combined template.

The association based field layout is simpler than with frequencies because a single P value and/or odds ratio represents all panels involved. However, there is the need to allow the import of multiple resultsets from a single file.

If the resultset is specified for each line of the file (eg. like CGEMS).

To do this use 'resultset_id' or 'resultset_name', and 'pvalue' and 'oddsratio' as column names. N.B. The resultsetid in the template will override the line resultset, so this must be removed if the resultset is per line.

TODO If the significance data for multiple resultsets is included on a single line.

Set 'resultset_lookup' to identifier,label or name And then use pvalue:info (eg. for an identifier pvalue:HGVRS1)

NOTE: if the Database exporter is used an error will occur if resultsets cannot be found.


MAIN LOGIC

The template provides all of the info required by the pipeline to decipher the input file.

setup

Initially, various helper classes are prepared This includes for database access, retrieval of validated markers, user specified plugins and the exporter which deals with exporting the lines of data once they have been processed.

Also at this time rule modules are also prepared. The logic for adding these modules depends on the presence/absence of fields in the template. At the moment the core logic contains the following rule modules (prefix of HGVbaseG2P::DataImport::Rule::)

PositionsMatchchecks chromosome/position retrieved from HGVbaseG2P matches the input line
StrandFlip::Genomedeals with strandflip when rs2genome and rs2ss fields are present
StrandFlip::Standarddeals with standard strandflip
AllelesMatchcheck alleles in HGVbaseG2P match those provided
PopulateGenotypesFromAllelescreates genotypes using alleles (eg. take A and C and makes A, A+C and C genotypes)
SplitFieldgeneric rule for splitting field using a delimiter and putting it into new fields specified by an order
RegexFieldsgeneric rule for performing regex on several fields and putting the results in the same fields
CalcAlleleFreqsFromNumberscalculates allele frequencies using number of individuals with specific alleles
CalcAlleleFreqsFromGenotypeNumberscalculates allele frequences when only number with genotypes
CalcGenotypeFreqsFromNumberscalculates genotype frequencies using number with genotypes
FreqFieldCheckcheckpoint that only runs once - indicates if essential fields are missing from the output

If required, plugins are 'setup' at this point

process

The next stage is processing of each line of the file (ignoring header lines and comment lines) by validating the marker ID, and performing the chosen rules upon the line data. Each line is represented by an HGVbaseG2P::DataImport::DataElement which contains 'line', 'status' and 'marker'.

At the moment, output lines are batched before they are passed to the exporter module.

The DataElements are then passed to the exporter module.

Plugins will also work at various points within this processing (eg. before retrieving marker, after retrieving marker, after all rules etc).

teardown

Any de-initialisation is carried out here. Including database disconnection and plugins 'teardown' etc.


HELPER CLASSES

Important logic is performed by the following single helper classes:

HGVbaseG2P::Database::Study - provides access to the study metadata and data
HGVbaseG2P::Database::Marker - provides access to the marker catalog
HGVbaseG2P::DataImport::RetrieveMarker - retrieves a marker using Database::Marker taking into account strand flips, and missing markers

As described above there are also changeable helper classes, inparticular the Rule and Export modules.

The default export module is Export::Database, which inserts frequency and association data straight into an HGVbaseG2P database, but it would be simple to develop a module which outputs the data to one or more files.

Although it is simple to add additional rules to the core logic, it is suggested that where possible Plugin modules are used instead. Plugins can be simply included in the 'plugins' line of the DataImport template passed to the core module.

If you do wish to add additional rules a Rule module must be created, and then the 'prepare_rule' method modified. This method contains clever logic which based on the presence/absence of fields in the input file, adds rules to deal with the data. This can involve simple comparison with the DB, or calculation of data such as frequencies.


SUBROUTINES/METHODS

process_file

  Usage      : $gtimport->process_file('/path/to/gtdatafile.xml')
  Purpose    : Main method for processing an input gt-data file from a particular source
  Returns    : undef if non-fatal errors are encountered during run, dies on fatal errors
  Arguments  : Path to input file.
  Throws     : 
  Status     : Public
  Comments   :