Exploiting Public Human Genome NGS Datasets to Characterize Repetitive DNA and Recover Assembly Gaps

Ogeh, Denye Nathaniel

2018OGEHDNPhD.pdf (3.6 MB)

Exploiting Public Human Genome NGS Datasets to Characterize Repetitive DNA and Recover Assembly Gaps

thesis

posted on 2018-07-03, 08:45 authored by Denye Nathaniel Ogeh

With the advent of Next Generation Sequencing (NGS), we have witnessed the generation of enormous volumes of short read sequence data, cheaply and on short time scales. Nevertheless, the quality of genome assemblies generated using NGS technologies has been greatly affected by this innovation, compared to those generated using Sanger DNA sequencing. This is largely due to the inability of short read sequence data alone to scaffold repetitive structures, creating gaps, inversions and rearrangements and ultimately resulting in assemblies that are, at best, draft forms (by draft we mean, assembly that is only a preliminary result that will require more work to be done to make it a more complete and accurate representation of the genome). Single molecule long-read sequencing (SMS) technologies on the other hand, address this challenge by generating sequences with greatly increased read lengths, offering the prospect to better recover these complex repetitive structures, concomitantly improving assembly quality. Following this development, we evaluate the ability of SMS data (specifically Pacific Biosciences SMRT data and Oxford Nanopore MinION data from human genomes) to recover poorly represented repetitive sequences (specifically, GCrich human minisatellites), identify novel transposable element insertions and enable the closing of gapped regions. Our results show that by using single molecule sequencing and long read technology, poorly represented repetitive sequences (specifically, minisatellites and L1s) and other missing elements in published human genome assemblies can be characterized by developing custom software, scalable for the analysis of single molecule long-reads (particularly, Pacific Biosciences’ SMRT technology). The tool designed is cross-platform, thus, giving computational and non-computational biologists a straightforward approach and less technical platform for local analysis of specific poorly characterized DNA sequences.

History

Supervisor(s)

Badge, Richard

Date of award

2018-05-11

Author affiliation

Department of Genetics

Awarding institution

University of Leicester

Qualification level

Doctoral

Qualification name

PhD

Language

en

Administrator link

https://leicester.figshare.com/account/articles/10207511

Usage metrics

Keywords

IR content

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Exploiting Public Human Genome NGS Datasets to Characterize Repetitive DNA and Recover Assembly Gaps

History

Supervisor(s)

Date of award

Author affiliation

Awarding institution

Qualification level

Qualification name

Language

Administrator link

Usage metrics

Categories

Keywords

Licence

Exports