University of Leicester

sorry, we can't preview this file

2023SmithHPHD.pdf (9.07 MB)

Statistical and Machine Learning Approaches to Risk Prediction

Download (9.07 MB)
posted on 2023-12-13, 09:59 authored by Hayley Smith

In medical research, it is essential models accurately predict probabilities of future events so we can implement preventative measures, predict patient prognosis, and decide effective treatment plans. Currently, conclusions differ about the comparative performance of statistical and machine learning approaches. In this thesis, I compared and evaluated the discrimination and calibration of these approaches, specifically: the Cox model; Flexible Parametric model (FP); Multivariable Fractional Polynomial model (MFP); Random Survival Forest (RSF); and two neural networks. Firstly, I methodologically reviewed simulation studies comparing statistical and machine learning methods for risk prediction. Multiple articles only reported discrimination measures, had poor reporting standards, and simulated from data-generating mechanisms that were biased toward machine learning. This review informed the simulation study design. I then developed a novel approach to simulating survival data, where data is generated from each risk prediction method.

The MFP and RSF models were the most accurate, especially with complex data. As simulation studies use simulated data and require methods to be automated, the methods were then compared using a dataset from VICORI and an iterative, model-fitting workflow used in prognostic research. RSF had the best performance, though including covariate relationships identified in the literature improved the statistical models. Both the simulation study and VICORI analysis highlighted that good discrimination doesn’t necessarily imply good calibration. Lastly, methods must be implemented in software for researchers to use them. Python is a popular programming language but many survival methods are not available. I developed a Python package  (asurvivalpackage) that implements key survival methods increasing accessibility. This thesis shows statistical models can perform equivalently to machine learning models, such as RSF, with careful consideration of model implementation. It emphasises how rigorous evaluations of risk prediction models are vital in prognostic research: evaluating both discrimination and calibration, and improving reporting standards is essential.



Paul Lambert; Tim Lucas; Michael Sweeting; Michael Crowther

Date of award


Author affiliation

Department of Health Sciences

Awarding institution

University of Leicester

Qualification level

  • Doctoral

Qualification name

  • PhD



Usage metrics

    University of Leicester Theses


    No categories selected


    Ref. manager