A Graph-Based Architecture For Efficient Genome Data Representation And Variant Transformation
Processing massive and complex genomic data sets is increasingly time consuming and challenging. The scale and complexity of this data, which includes diverse file types and heterogeneous data structures, present significant challenges for storage, processing, and retrieval. Traditional models struggle with the high dimensionality and iterative nature of genomic data analysis. This research introduces a graph-based architecture to represent human genome variations, addressing key challenges in the current genomic data landscape, including data heterogeneity, volume, and structural complexity. This thesis introduces a graph-based data model to represent and process human genome variations efficiently. By mapping both the reference genome and genome variant data (from VCF files) into a unified graph model, a graph-based architecture is proposed that enhances data accessibility, scalability, and speed in genome variant analysis. The research formalises a property graph model where genomic variations, such as substitutions, insertions, and deletions, are mapped as nodes connected to the reference genome. A graph-based variant normalization algorithm is presented that ensures consistent variant representation from different VCF data sources. A graph database is employed for fast data retrieval, with response times reduced from minutes to milliseconds. This approach provides a scalable and adaptable solution to genomic data processing, facilitating more efficient research and enabling new opportunities in personalised and precision medicine.
History
Supervisor(s)
Ashiq Anjum; Lu LiuDate of award
2024-11-04Author affiliation
School of Computing and Mathematical SciencesAwarding institution
University of LeicesterQualification level
- Doctoral
Qualification name
- PhD