Abstract
Background/Aim: Machine learning (ML) models are often modelled to predict cancer prognosis but rarely consider spatial factors in a region. Hence this study explored machine learning algorithms utilising Local Government Areas (LGAs) in Queensland, Australia to spatially predict 3- and 5-year prognosis of oral cancer patients and provide clinical interpretability of the predicted outcome made by the ML model. Patients and Methods: Data from a total of 3,841 oral cancer patients were retrieved from the Queensland Cancer Registry (QCR). Synthesizing minority oversampling technique together with edited nearest neighbours (SMOTE-ENN) was used to pre-process unbalanced datasets. Five ML models: logistic regression, random forest classifier, XGBoost, Gaussian Naïve Bayes and Voting Classifier were trained. Predictive features were age, sex, LGAs, tumour site and differentiation. Outcomes were 3- and 5-year overall survival of patients. Model performances on test set were evaluated using area under the curve and F1 scores. SHapley Additive exPlanations (SHAP) method was applied to the best performing model for model interpretation of the predicted outcome. Results: The Voting Classifier was the best performing model with F1 score of 0.58 and 0.64 for 3- and 5-year overall survival, respectively. Age was the most important feature in the Voting Classifier in 3- and 5-year prognosis prediction. LGAs at diagnosis was the top 3 predictive feature for both 3- and 5-year models. Conclusion: The Voting Classifier demonstrated the best overall performance in classifying both 3- and 5-year overall survival of oral cancer patients in Queensland. SHAP method provided clinical understanding of the predictive features of the Voting Classifier.
Squamous cell carcinoma involving the oral cavity is the most common malignancy arising from the head and neck region (1). Globally, 5-year overall survival rates post-diagnosis remains poor, and only around 50% of patients survive for more than five years (1). Despite advancement in multimodality treatments of curative surgery, chemo-radiotherapy, immunotherapy and other targeted therapies, late diagnosis and aggressive primary tumour have hampered the efforts for reducing cancer mortality and morbidity. Early diagnosis and better prognosis prediction are still the ultimate goals for better overall and disease-specific prognosis, especially for high-risk oral cancer patients (2, 3). Therefore, targeting this group of patients for timely treatment may improve efficacy and response towards oncological treatments (4, 5).
Whilst cancer may be influenced by lifestyle and genetic susceptibility, it may also be affected by environmental exposure (6). Spatially and temporally, disease mapping of cancer provides an explanation and prediction of disease outcome patterns within a well-defined geographic region. Our group has previously delineated at-risk regions and those at-risk of oral cancer within the Hong Kong population using Bayesian disease mapping (7). To date, even though there is an exponential utilisation of machine learning modelling for oral cancer prognosis prediction, most machine learning predictions are based on the AJCC TNM staging and clinicopathologic profiling (8, 9). Of note, the input of spatial units within a region is rarely taken into account in the prognosis prediction algorithms for cancer.
Queensland, which is located northeast of Australia, is defined by 78 spatial units gazetted as Local Government Areas (LGAs). Potentially targeting at-risk hotspots, and more so identifying those high-risk oral cancer patients in Queensland may assist budgetary allocation and resources for future targeted interventions and screening strategies. Hence, we aimed to explore machine learning algorithms utilising LGAs in Queensland to spatially predict 3-year and 5-year prognosis of oral cancer.
More appreciably, there is a strong emphasis on explainability and re-traceability of machine learning models in order for future, proper implementation (10, 11). It is crucial to understand and provide an interpretation of how a machine learning model reached its output in understandable, non-specialised terms (12). Hence, in this study, we utilised SHapley Additive exPlanation (SHAP) to facilitate interpretation of the best performing machine learning prediction model (13). Besides, it is beneficial to understand if the machine learning model is in agreement with experts and clinicians on whether certain predictors are important. Thus, understanding how the model propose decisions may further allow adoption of the model in an oncology treatment setting.
Patients and Methods
Patient dataset and ethical approval. Approval to conduct this retrospective study was obtained from the James Cook University Human Research Ethics Committee (Ref. H8609) and further approval under the Public Health Act 2005 provided by Queensland Health. The Queensland Cancer Registry (QCR) was accessed for the period 1982 (when data were first compiled) to 2018 (most recent available data); the dataset was received as a de-identified, password protected spreadsheet and managed under the Australian Code for the Responsible Conduct of Research.
Data from patients diagnosed with OSCC between 1st January 1982 to 31st December 2018 were retrieved. Only patients aged 18 years or older at the time of diagnosis with a minimum follow-up duration of 5 years were included in the development and training of the machine learning models. After data cleaning, a total of 3,841 patients met the inclusion criteria. The parameters included were age at diagnosis, sex, primary tumour site and tumour differentiation as listed in Table I. Local Government Area (LGA), classified as cities, regions and shires, was used to spatially stratify patients accordingly. Key dates included the date of diagnosis and the date of death, if the patient has passed away. Censoring date was set as 31st December 2018. The primary outcomes were 3-year and 5-year overall survival. Overall survival was modelled as a binary classification and defined as either alive or dead from cancer, as well as dead from other causes.
Variables and input features fed into the machine learning models.
Machine learning prediction algorithms. Machine learning models that focused on binary classification as in our study outcomes (alive or death at 3-years and 5-years post diagnosis) were developed to assess their performance for predicting 3- year and 5- year overall survival of OSCC patients in Queensland, Australia. Models evaluated included logistic regression, random forest classifier (14), XGBoost (15) and Gaussian Naïve Bayes (16). Random forest classifier and XGBoost were chosen for comparison as they are ensemble algorithms that are meant for weak learners and controls over-fitting. Gaussian Naïve Bayes, on the other hand, has been applied in many medical classification problems. A voting ensemble classifier, combining multiple classifiers with the best performance in the training cohort, was fitted. Herein we implemented the hard voting ensemble where the majority class, as voted by the three individual classifier of logistic regression, random forest classifier and Gaussian Naïve Bayes, was chosen as the final predicted outcome.
Model training and internal validation. The data were split into a training and testing cohort of 70:30 in which the training set consists of 2688 patient data while the testing cohort comprise of 1153 patient data. Training cohort is required for the training of the machine learning model whilst the testing cohort unseen during model training is used to validate the performance of the trained machine learning model. Class imbalance (i.e., there were many more patients who were still alive compared to those who had passed away) was handled by a combination of oversampling the minority class using synthetic minority oversampling technique (SMOTE) and under-sampling the majority class using edited nearest neighbours (ENN) (17). SMOTE is an oversampling method which will result in new synthetic examples of the minority class. Though the process improves accuracy, noisy samples, which may not be understood by the algorithm, may be introduced. Hence, application of ENN will then subsequently remove data that have outcome class differing from the observation data and its nearest neighbours in order to improve label purity.
Hyperparameters are parameters whose values control the learning process. In this study, hyperparameter optimization was determined using the grid search method with 5-fold cross validation on the training cohort to determine the parameters with the best performance. The internal validation cohort (testing cohort) unseen during training and cross-validation were used to assess the performance of the machine learning algorithms. Performance measures generated from the internal-validation dataset were the basis for comparison of the algorithms in this study.
Model performance measures and model prediction explainability. To describe and evaluate discriminative performance of the machine learning algorithms, area under the curve (AUC) and accuracy scores were calculated. Recall and F1 scores were also reported as they are more clinically relevant and applicable (18), with recall being a measure of how many of the positive cases the classifier correctly predicted (sensitivity) while F1 is a measure combining both recall and precision. As it is crucial to detect true positives and false negatives in an oncology setting for timely and effective treatment for OSCC patients, the F1 score would be the most relevant metric when assessing the 3-year and 5-year overall-survival classification. For better clinical interpretability, SHapley additive explanation (SHAP) values and SHAP interaction values were also presented (13). SHAP values represents how much weight is given to a variable to predict the final outcome in a model, and provides better understanding to the predictions made by the machine learning classifiers.
Statistical analysis and computation. Descriptive statistics were performed using the SPSS for Windows version 27.0 (IBM Corp., Armonk, NY, USA). All classification algorithms were developed and conducted in Python v 3.10.4. using sklearn (19) and XGBoost (15) packages. The SHAP library was applied to calculate SHAP values and generate interaction plots of the predictive variables in the machine learning models.
Results
Patient characteristics. The study data included 3,841 patients with squamous cell carcinoma arising from the oral cavity region meeting the inclusion criteria as presented in Table II. Of included patients, 2,490 patients were males and 1,351 patients were females. Median age of diagnosis was 63 years (IQR=55-72 years). Distribution of patients was higher in Brisbane City which comprised 972 patients (25.3%), followed by Gold Coast City with 373 patients (9.7%) and in Whitsunday Regional with 309 patients (0.78%), diagnosed over the course of 36 years. Most primary lesions involved the anterior tongue (48.1%). Distribution according to tumour differentiation showed that 656 (17.1%) patients had well differentiated tumours, 2454 (63.9%) patients with moderately differentiated tumours and 731 (19.0%) patients who had poor or undifferentiated tumours. At the time of censoring, 1572 patients (40.9%) had died within three years of diagnosis. Five-year overall survival was 51% in this study cohort.
Demographics and clinical-pathologic characteristics.
Performance of machine learning models. Following division of the data into training and validation datasets, 2,688 patients (70%) and their corresponding data were used for training and 5-fold cross validation of the machine learning classifiers while 1,153 patient data (30%) which were previously unseen during training were used for internal validation of the algorithms. Addressing the potential of an imbalanced dataset, training cohorts were handled with SMOTE-ENN. As a result, a total of 824 and 538 unique patient data were input into the training cohort for 3-year and 5-year overall survival prediction, respectively.
The discriminative performance and other performance metrics of the models are presented in Table II. Overall, in the training phase, XGBoost demonstrated the highest accuracy of 0.89, while both XGBoost and Random Forest showed the highest AUC of 0.95 for the prediction of 3-year overall survival (Table III and Figure 1). The prediction model was then applied to the test cohort. Accuracy score of 0.61 was achieved using XGBoost and all models achieved an AUC of only 0.61. Similarly, XGBoost was the most robust during the training phase for 5-year overall survival prediction with an accuracy of 0.88 and AUC of 0.94. Overall, most models for 5-year overall survival prediction had an accuracy score of 0.62. All models achieved at least an AUC of 0.62 in the testing cohort (Table III).
Performance measures of machine learning models for prediction of 3-year and 5-year overall survival.
Area under the receiver operating characteristic curve based on machine learning models predicting 3-year and 5-year overall survival.
In the machine learning models predicting 3-year overall survival, models achieved at least an F1 score of 0.54. The highest score of 0.58 was achieved by the Voting Classifier. The F1 score for machine learning algorithms for 5-year overall survival increased, with Voting Classifier achieving a score of 0.64. When considering performance using the F1 score, the best overall performance was achieved with the hard Voting Classifier for 3-year and 5-year overall survival (Table III).
SHAP summary plot. Summary plots presenting SHAP values of the 5 predictive features for each machine learning model based on the test set are shown in Figure 2a and 2b. In summary, the x-axis denotes the SHAP values, while predictive variables are presented along the y-axis according to their weights. Each dot on the summary plot reflects to an input from a single patient and dots pile up vertically to show the density of those with the same SHAP value. The position of the dot on the x-axis shows the importance of the feature on that prediction of that particular machine learning model. The effect size and its distribution can be reflected based on the tails of the plot. For the 3-year and 5-year overall survival plots, red corresponds to a higher value placed on a variable while blue corresponds to a lower value.
Summary plots for SHAP values based on Voting Classifier predicting 3-year (a) and 5-year (b) overall survival.
Overall, age at diagnosis as a variable was observed to be the most important feature in the Voting Classifier predicting 3-year and 5-year overall survival. Increasing age was associated with a poorer prognosis while those of younger age has a better overall survival. For 3-year overall survival, the top three features are age at diagnosis, LGAs at diagnosis and tumour differentiation. On the other hand, with regard to 5-year overall survival, the top 3 predictive features were age, tumour site and LGAs at diagnosis. Sex was the least important variable contributing to outcome in both 3- and 5-year survival when using the Voting Classifier algorithm.
Discussion
Oral cancer is an aggressive disease that affects speaking, eating and swallowing. More often than not, patients want to be informed of their cancer diagnosis and prognosis, particularly on how long more they are expected to live. Hence, robust predictive tools can provide guidance to healthcare providers with the necessary information to guide treatment and patient discussion (20). This study utilised retrospective data collected from patients diagnosed with OSCC in Queensland, Australia over the past 36 years to build machine learning algorithms to predict 3-year and 5-year overall survival. A comparison of XGBoost, Random Forest, Logistic Regression, Gaussian Naïve Bayes and Voting Classifier were conducted to evaluate their performance using AUC and F1 score as metrics. As the Voting Classifier takes into account the prediction of three other machine learning models, the prediction of the Voting Classifier was also interpreted with SHAP summary plots.
We observed an evolving and growing number of machine learning models for clinical prediction in terms of diagnostic or prognostic purposes particularly in oral cancer (9). To push further the implementation of machine learning modelling for prognosis prediction in actual clinical settings, it is a priori to make these models interpretable and understandable. Moreover, oncology care is currently on a paradigm shift towards personalisation and precision. The utility of the myriad data types and thereafter interpretation of individual patient prognosis require significant time and expertise. Hence, the application of SHAP values and summary plots can be utilised to interpret these predictions and demonstrate variable importance in a more time-efficient manner and can be better understood by personnel with little training in machine learning. SHAP summary plots provide a concise figure by visually demonstrating the range and distribution of importance of the features on the machine learning models’ output. Individualised plots can also be generated using SHAP force plots (Figure 3) to present an explanation based on the patient’s routine assessment. Healthcare providers and patients can review the information directly with the inputted data and better plan for further monitoring or treatment.
SHAP force plot displaying predictive features for an individual patient.
After understanding how predictive features or variables impact OSCC patients, healthcare providers can place more emphasis and resources on those at higher risk of more aggressive cancers or death. The top most important predictive features were age at diagnosis, LGAs and tumour differentiation for the prediction of 3-year overall survival in OSCC patients. On the contrary, age at diagnosis, tumour sites and LGAs were more important for the prediction of 5-year overall survival in this Queensland dataset. Selecting patients for appropriate and timely treatment based on their respective SHAP values together with the expertise of the healthcare providers, may render better treatment efficacy and resources allocation thereafter improving overall prognosis. This is especially useful when resources are limited or affected, especially those in rural areas. Moreover, since Queensland is spatially organized into LGAs, local government may also reference the SHAP values for allocation of medical resources and healthcare providers according to LGAs as LGAs at diagnosis was heavily weighted as an important predictive feature in the summary plots presented in Figure 2a and b for 3-year and 5-year overall survival, respectively. In short, the summary plots may be implemented as a guide for budgetary and resources allocation for oral cancer patients in high risk areas in Queensland.
Before interpretation of the results presented in this study, several limitations should be taken into account. The performance of this Voting Classifier is yet to be externally validated. As LGAs in Queensland is involved as a predictive feature to stratify risk of oral cancer patients geographically, prospective data can be collected to further validate and access the real performance of the classifier. Moreover, this study was limited to the use of demographic data – age, sex and LGAs, as well as clinical data- tumour site and differentiation. These are obviously the most easily obtainable data types and available at the time of diagnosis. However, oral cancer is an aggressive and heterogenous cancer and many factors at the clinicopathological, histological and molecular levels can affect overall survival of patients (21-27). Treatment information would also contribute to the prediction of the machine learning model (28). Retrieving more data from these patients and modifying these machine learning models with the incorporation of multiple factors that is beyond tumour differentiation and staging could potentially improve accuracy, precision, and recall ability in predicting overall survival among patients with OSCC in Queensland. Moreover, lifestyle factors, such as smoking and drinking, should also be included as predictive features as these etiologic differences may affect overall survival of the patient. Reports have suggested that overall survival of OSCC patients might also be affected by existing co-morbidities that leads to other systemic diseases (29, 30).
Whilst LGA data are of considerable interest in identifying geographic regions displaying high disease incidence, they are inevitably influenced by large populations numbers in urban regions and patient referrals to tertiary head and neck centres located in city centre hospitals, such as in Brisbane and the Gold Coast. Nonetheless, previous studies suggest OSCC incidence and mortality are likely to be higher in regional and remote regions where low socio-economic status, increased risk factor behaviour, limited access to healthcare services and large Indigenous population numbers influence disease status (31).
Future directions for machine learning modelling on oral cancer prognostic prediction can focus on time-to-event algorithms to predict and stratify oral cancer patient temporally based on the use of the use of demographic, clinical, pathologic, and treatment information. Even with the advancement in treatment and therapeutic strategies, risk of recurrence and survival probability remains poor in OSCC patients. The machine learning algorithms presented here are mostly static and do not take into account the dynamic and heterogenous nature of the cancer and patient. Development of clinical prognostic prediction algorithms that can handle time-to-event data may be more superior for providing risk estimates at appropriate timeframes for better oncological monitoring and timely treatment. When available, these models may assist healthcare providers in selecting patients for timely introduction of multimodality interventions.
Conclusion
This study utilized binary classification algorithms to model and predict 3- and 5-year overall survival of oral cancer patients diagnosed across LGAs in Queensland. The hard Voting Classifier demonstrated the best overall performance in classifying both 3- and 5-year overall survival. SHAP values indicated the importance of the respective features in Voting Classifier. Age at diagnosis, LGAs and tumour differentiation were important predictive features for prediction of 3-year overall survival in OSCC patients. For the predication of 5-year overall survival, age at diagnosis, tumour sites and LGAs were more important predictive features. This study calls for further inclusion of clinicopathological information to improve discriminative performances of the Voting Classifier before actual implementation in the clinical setting in Queensland.
Footnotes
Authors’ Contribution
Jia Yan Tan – Wrote the manuscript, conducted the machine learning. John Adeoye – Advised on manuscript and advised on the machine learning. Peter Thomson – Read and critiqued manuscript. Dileep Sharma – Acquired data from public database. Poornima Ramamurthy – Acquired data from public database. Siu-Wai Choi – Critiqued manuscript for intellectual content and re-wrote passages for readability.
Conflicts of Interest
The Authors report no conflicts of interest.
- Received August 15, 2022.
- Revision received August 30, 2022.
- Accepted October 5, 2022.
- Copyright © 2022 International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved.









