General Notices

MSc Statistics Presentation: Shardha Ramlal

Posted Friday, January 10, 2025


The Department of Mathematics and Statistics will host a MSc Statistics Presentation on Monday, January 20 from 10:30 a.m. to 11:00 a.m. Ms. Shardha Ramlal will present on the topic The Comparison Of Binary Classification Methods On Microarray Cancer Datasets Using Statistical/Machine Learning Techniques (Supervised Learning).

Interested persons are invited to attend virtually via Zoom. Click here to access (Meeting ID: 976 3650 9288 | Passcode: 880462).

Abstract:

Aim: The aim of the study was to select the most appropriate models for cancer classification using microarray data.

Objective: To explore supervised learning models and machine learning techniques to analyze microarray data

Methods: Four microarray cancer datasets were examined: colon cancer, leukemia cancer, breast cancer and ovarian cancer. Differentially expressed genes were selected using the univariate, filter approach of the Welsh t test. The false discovery rate was set at the threshold of q values ≤ 0.05.

Nine statistical and machine learning models were employed via supervised learning comparison of logistic regression (LR), elastic net logistic regression (EN), linear discriminant analysis (LDA), decision trees (DT), random forest (RF), support vector machines (SVM) with linear kernel, SVM with radial kernel, neural networks (NN), k nearest neighbours (KNN) and naïve bayes (NB). Performance was evaluated primarily using accuracy, area under the receiver operator curve (AUC-ROC), F1 and normalized Matthews Correlation Coefficient (normMCC) metrics.

Results: Selection of differentially expressed genes led to 233 genes (colon), 1571 genes (leukemia), 286 (breast) and 9117 (ovarian). The optimal performing model for colon cancer was the neural network model, resulting in a 90.3% accuracy, an 88.0% AUC-ROC, a 92.5% F1 score and 89.5% normMCC. The best models for leukemia cancer were linear discriminant analysis, random forest and SVM with linear kernel resulting in an accuracy of 98.6%, AUC-ROC of 100.0%, 98.5% F1 score and 97% normMCC. Similarly for breast cancer, KNN (5-NN) performed the best with an accuracy of 83.5%, AUC-ROC of 88.0%, 83.3% F1 score and a 83.7% normMCC. EN, LDA and SVM with linear kernel outperformed other models on the ovarian cancer dataset giving an accuracy, AUC-ROC, F1 and normMCC of 100.0%.

Conclusion: The classifiers give different performances on different datasets.