University of Leicester
Browse

Text mining analysis on Pubmed data base

Download (50.62 MB)
thesis
posted on 2023-02-22, 12:20 authored by Seyedeh Z. Rezaei Lalami

Due to the increasing amount of unstructured text data, information retrieval from large volumes of data has become highly important. Applying most of the algorithms, such as classification and clustering, is challenging because of the high dimensionality of the text data.

This study investigates a novel, co-occurrence model of text data to help reduce the dimension of the data set. We present a graph-based text mining approach for discovering similar documents in a scientific corpus and use it in a search engine that is built into the R Shiny web application. The Biological Scientific Corpus (BSC) is a collection of 764,213 PubMed-indexed English abstracts of research papers and proceedings papers, chosen to reflect the widest range of abstracts of scientific works published in 2012. Analysis of the co-occurrence matrix helps to understand the feature of interconnection between the words. Applying the community detection method, we discovered hubs and strong communities in the co-occurrence network and use them to reduce the dimension of the network.

After dimension reduction, we produced meaningful clusters of the data set. To see whether or not the clustering is correct we investigated the distribution of the authors of the papers over the clusters and the results were satisfactory. Finally, we used a hierarchal approach to develop a search engine on the data set that accepts a query from a user and responds with a set of retrieved documents. The main advantage of this search engine is the ability to take long text, and abstracts, as a query.

Another part of this work is to reproduce the well-known Elastic Map algorithm in R as an open resource for data visualization. We used the R Elastic Map package we developed to present a zoomable and rotatable visualization of a map fitted to clustered data in a two and three-dimensional space.

History

Supervisor(s)

Katrin Leschke; Paul King

Date of award

2022-12-09

Author affiliation

School of Computing and Mathematical Science

Awarding institution

University of Leicester

Qualification level

  • Doctoral

Qualification name

  • PhD

Language

en

Usage metrics

    University of Leicester Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC