MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

Yeung, Cheng S; Beck, Tim; Posma, Joram M

MetaboListem and TABoLiSTM.pdf (3.69 MB)

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

journal contribution

posted on 2022-07-13, 09:09 authored by Cheng S Yeung, Tim Beck, Joram M Posma

Reviewing the metabolomics literature is becoming increasingly difficult because of the rapid expansion of relevant journal literature. Text-mining technologies are therefore needed to facilitate more efficient literature reviews. Here we contribute a standardised corpus of full-text publications from metabolomics studies and describe the development of two metabolite named entity recognition (NER) methods. These methods are based on Bidirectional Long Short-Term Memory (BiLSTM) networks and each incorporate different transfer learning techniques (for tokenisation and word embedding). Our first model (MetaboListem) follows prior methodology using GloVe word embeddings. Our second model exploits BERT and BioBERT for embedding and is named TABoLiSTM (Transformer-Affixed BiLSTM). The methods are trained on a novel corpus annotated using rule-based methods, and evaluated on manually annotated metabolomics articles. MetaboListem (F1-score 0.890, precision 0.892, recall 0.888) and TABoLiSTM (BioBERT version: F1-score 0.909, precision 0.926, recall 0.893) have achieved state-of-the-art performance on metabolite NER. A training corpus with full-text sentences from >1000 full-text Open Access metabolomics publications with 105,335 annotated metabolites was created, as well as a manually annotated test corpus (19,138 annotations). This work demonstrates that deep learning algorithms are capable of identifying metabolite names accurately and efficiently in text. The proposed corpus and NER algorithms can be used for metabolomics text-mining tasks such as information retrieval, document classification and literature-based discovery and are available from the omicsNLP GitHub repository.

History

Author affiliation

Department of Genetics and Genome Biology, University of Leicester

Version

VoR (Version of Record)

Published in

Metabolites

Volume

12

Issue

4

Publisher

MDPI

issn

2218-1989

eissn

2218-1989

Acceptance date

2022-03-17

Copyright date

2022

Available date

2022-07-13

Publisher DOI

https://doi.org/10.3390/metabo12040276

Spatial coverage

Switzerland

Language

English

Publisher version

https://doi.org/10.3390/metabo12040276

Usage metrics

Keywords

Science & Technology Life Sciences & Biomedicine Biochemistry & Molecular Biology deep learning named entity recognition natural language processing EXTRACTION HMDB

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

History

Author affiliation

Version

Published in

Volume

Issue

Publisher

issn

eissn

Acceptance date

Copyright date

Available date

Publisher DOI

Spatial coverage

Language

Publisher version

Usage metrics

Categories

Keywords

Licence

Exports