Engineering Lossless Sequence Prediction with Compact Data Structures

Ktistakis, Rafael

doi:10.25392/leicester.data.19096553.v1

Engineering Lossless Sequence Prediction with Compact Data Structures

thesis

posted on 2022-01-31, 14:19 authored by Rafael Ktistakis

Sequences of symbols can be used to represent data in many domains such as text documents, activity logs, customer transactions and website clickstreams. Sequence prediction is a popular task, which consists of predicting the next symbol from a sequence of symbols, given a set of training sequences (under identical symbol set). Although numerous prediction models have been proposed, many have a low accuracy because they are lossy models (they discard information from training sequences to build the model), while lossless models are often more accurate but typically consume a large amount of memory. This thesis addresses the challenges of lossless sequence prediction approaches by engineering an existing state-of-the-art lossless sequence prediction algorithm through space efficient data structures. Moreover, during this thesis, we propose a novel and lossless sequence prediction model that overcomes most of the challenges that usually make a lossless approach suffer. We utilise succinct data structures to compactly represent and efficiently access training sequences for prediction. Based on our experimental evaluations, our lossless SCPT and SUBSEQ prediction algorithms, achieve a very low and consistent memory consumption while maintaining a competitive execution performance. Moreover, with SUBSEQ, we demonstrate an excellent accuracy when compared to eight state-of-the-art prediction algorithms on seven real-life datasets. Finally, we further examine the significance of lossless approaches in the sequence prediction domain, and we present a new ensemble approach that blends lossy and lossless sequence prediction models for a much more improved accuracy performance.

History

Supervisor(s)

Rajeev Raman

Date of award

2021-11-10

Author affiliation

Faculty of Informatics

Awarding institution

University of Leicester

Qualification level

Doctoral

Qualification name

PhD

Language

en

Usage metrics

Keywords

Lossless Sequence Prediction Compact Data Structures Engineering thesis

Engineering Lossless Sequence Prediction with Compact Data Structures

History

Supervisor(s)

Date of award

Author affiliation

Awarding institution

Qualification level

Qualification name

Language

Usage metrics

Categories

Keywords

Licence

Exports