University of Leicester
Browse

Learning and Generalisation for High-dimensional Data

Download (18.31 MB)
thesis
posted on 2021-11-30, 22:42 authored by Muhammad H. Alkhudaydi
Modern data-driven Artificial Intelligence models are based on large datasets which have been recently made available to practitioners. Significant efforts have been put into gathering data and information. The volumes of our data assets grow with time and bring us to the new era of Big Data. In many relevant problems however, we are faced with one particular class of Big Data types: high-dimensional low-sample data or data with limited annotation. These data sets are characterized by many attributes in a single record. At the same time, the number of separate records in these datastets are often small or lack annotation. We refer to these datasets as high-dimensional low-sample size data. They are found in many significant fields such as medical image analysis such as asthma detection and treatment, financial data analysis, and bioinformatics. These are just examples of where the data has got more attributes compared to the observations made. Note that the volumes of unlabeled data in these areas may in fact be large. However, for reasons beyond control of AI practitioners (e.g. privacy, data protection laws, costs of human assessment, intellectual property) annotated data may not be fully available to them. This kind of data presents many challenges in machine learning algorithms. Over-fitting and high variance have been some of the major problems. They are just one of many facets of the grand challenge of learning and generalisation in high dimensions. Altogether they constitute the challenge of learning and generalisation for high-dimensional systems. This thesis focuses on the analysis of the problem and presents some theoretical approaches and results to address the issue of learning and generalisation in high-dimension. The results are based on Concentration of Measure phenomena and Stochastic Separation Theorems [1], [2], [3], [4], [5], [6] and exploit ideas proposed in [7], [8], [9] and explaining some fascinating properties of real brain - the phenomenon known as concept cells. The new approaches and results can be applied to develop new methods and processing architectures. One of these, named as forward propagation or concept cells processing, is presented in this thesis. The method is illustrated with a case study of asthma diagnosis from 2D CT scans. The thesis consists of 6 chapters. In Chapter 1, we provide an overview of the problems of machine learning and highlight challenges with the application of existing theories and approaches to high-dimensional low-sample datasets. In Chapter 2, we illustrate classical concepts in machine and deep learning. In Chapter 3, we present the application of conventional existing methods to a particular high-dimensional low-sample benchmark problem: detection of asthma from CT scans. In Chapter 4, we present our contribution to approaching and addressing some of the key questions around generalisation and learning in high-dimensional models from high-dimensional data. Chapter 5 shows how these results can be applied to the benchmark problem considered in Chapter 3, and Chapter 6 concludes the thesis and presents our vision for possible future developments.

History

Supervisor(s)

Ivan Tyukin

Date of award

2021-07-23

Author affiliation

School of Mathematics and Actuarial Science

Awarding institution

University of Leicester

Qualification level

  • Doctoral

Qualification name

  • PhD

Language

en

Usage metrics

    University of Leicester Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC