Model Selection, Union And Assembling In Practical Data Analysis: Methods And Case Study

Muhammad, Awaz K.

2018MuhammadAPhD.pdf (6.47 MB)

Model Selection, Union And Assembling In Practical Data Analysis: Methods And Case Study

thesis

posted on 2018-08-23, 15:04 authored by Awaz K. Muhammad

The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.

History

Supervisor(s)

Gorban, Alexander; Mirkes, Evgeny

Date of award

2018-06-26

Author affiliation

Department of Mathematics

Awarding institution

University of Leicester

Qualification level

Doctoral

Qualification name

PhD

Language

en

Administrator link

https://leicester.figshare.com/account/articles/10233956

Usage metrics

Keywords

IR content

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Model Selection, Union And Assembling In Practical Data Analysis: Methods And Case Study

History

Supervisor(s)

Date of award

Author affiliation

Awarding institution

Qualification level

Qualification name

Language

Administrator link

Usage metrics

Categories

Keywords

Licence

Exports