Text mining and integration of genetic association information
The size of biomedical data available is always increasing, yet their remains significant barriers in place to accessing this data. While research studies are published online, they are often left submitted without machine-readable versions, limiting how quickly and efficiently key information can be extracted and used via automated systems.
With an emphasis on the association between genotype and phenotype, this thesis details the approaches undertaken to extract information from genome-wide association study (GWAS) publications. Using natural language processing techniques, I developed “GWAS Miner”, which utilises ontology terms to annotate and extract data from GWAS publications’ full-text and tables. This enables scalable data curation for database resources such as GWAS Central. Additionally, I developed “GWAS Tagger” for the automated annotation of a GWAS corpus which can be used for training and testing text mining machine learning models.
GWAS Central is one of the largest sources of summary-level GWAS data, providing users with tools for both comparing and visualising GWAS data, along with phenotype ontologies. This thesis also describes how I extended the GWAS Central resource to integrate GWAS summary-level data with mouse disease model data from the International Mouse Phenotyping Consortium (IMPC). Combining model organism data with human GWAS, accessible via novel web interfaces, enables researchers to compare mouse gene knockout experiment data alongside human GWAS data to identify genes of interest for follow-up research and to corroborate existing findings.
History
Supervisor(s)
Tim BeckDate of award
2024-06-17Author affiliation
Department of Genetics & Genome BiologyAwarding institution
University of LeicesterQualification level
- Doctoral
Qualification name
- PhD