Abstract
Background/Aim: Many cancer patients face multiple primary cancers. It is challenging to find an anticancer therapy that covers both cancer types in such patients. In personalized medicine, drug response is predicted using genomic information, which makes it possible to choose the most effective therapy for these cancer patients. The aim of this study was to identify chemosensitive gene sets and compare the predictive accuracy of response of cancer cell lines to drug treatment, based on both the genomic features of cell lines and cancer types. Materials and Methods: In this study, we identified a gene set that is sensitive to a specific therapeutic drug, and compared the performance of several predictive models using the identified genes and cancer types through machine learning (ML). To this end, publicly available gene expression datasets and drug sensitivity datasets of gastric and pancreatic cancers were used. Five ML algorithms, including linear discriminant analysis, classification and regression tree, k-nearest neighbors, support vector machine and random forest, were implemented. Results: The predictive accuracy of the cancer type models were 0.729 to 0.763 on the training dataset and 0.731 to 0.765 on the testing dataset. The predictive accuracy of the genomic prediction models was 0.818 to 1.0 on the training dataset and 0.759 to 0.896 on the testing dataset. Conclusion: Performance of the specific gene models was much better than those of the cancer type models using the ML methods. Therofore, the most effective therapeutic drug can be chosen based on the expression of specific genes in patients with multiple primary cancers, regardless of cancer types.
- Multiple primary cancers
- chemosensitivity prediction
- gene expression
- cancer type
- machine learning
When more than one tumors in the same or a different organ is seen in a single patient, multiple primary tumors may be present (1-3). According to epidemiological studies, the frequency of multiple primary tumors is reported to be in the range of 2-17% (1, 4-7). In addition, when two active malignancies are diagnosed concurrently in the same patient, it is challenging to find an anticancer therapy that covers both cancer types (a) without increased toxicity or relevant pharmacological interactions and (b) without a negative impact on the overall outcome (1). Zhai et al. (2018) reported that the most common cancer pairs were digestive-digestive tumors among multiple primary malignant tumors (8). In particular, gastric cancer and pancreatic cancer are included in the WHO digestive cancer classification, and these two cancers are likely to be involved in cases of multiple primary cancers.
Gastric cancer is the fifth most commonly cancer diagnosed type and ranks as the third leading cause of cancer-related death worldwide (9-11). In general, the incidence rate is about two times higher in men than women (9, 12). Around 1 million new cases of gastric cancer were recorded globally in 2018, accounting for an estimated 783,000 deaths. This means that more than one-twelfth of all deaths are caused annually by gastric cancer. According to data from the World Health Organization, Asia is first in the world in the incidence, mortality, and 5-year prevalence rate of gastric cancer.
Pancreatic cancer is a lethal disease with poor early diagnosis and a limited number of therapeutic options, and is associated with a high number of cancer-related deaths. Despite improvements in survival for most cancer types in the last decade, pancreatic cancer is falling behind because there has been limited progress in diagnostic methods and effective targeted therapeutic interventions. Although various clinical trials showed that combination therapy is more efficacious than monotherapy in advanced pancreatic cancer, the side-effects of combination therapy are usually much more severe than those of monotherapy and many patients cannot tolerate the side effects of combination chemotherapy, such as FOLFIRINOX.
Most cancer-predisposing mutations confer susceptibility to cancer at multiple sites. Chemotherapy targets all rapidly growing cells, not only cancer cells, and is thus often associated with unpleasant side-effects. The side-effects of chemotherapy may be more severe in patients with multiple primary cancers. Therefore, an examination of chemosensitivity based on genotype is needed in order to reduce the incidence and severity of side-effects. A key goal of precision medicine is to predict the best drug therapy for a specific patient using genomic information. In oncology, cancers that appear pathologically similar can vary greatly in how they respond to the same drug.
Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence and one of its important applications is in the field of drug discovery. Consequently, the total number of papers published in drug discovery fields with machine learning techniques is increasing every year.
The aim of this study was to compare the predictive accuracy of response of cancer cell lines to drug treatment, based on both the genomic features of the cell lines and cancer types. In order to fulfill the purpose of the study, chemosensitivity predictive models were implemented with publicly available drug sensitivity datasets and gene expression datasets of gastric and pancreatic cancers.
Materials and Methods
Data preparation. Three publicly available gene expression and drug sensitivity datasets were used in this study. The two expression datasets are accessible from a public microarray database [gene expression omnibus (GEO), GSE64604 and GSE77850]. These datasets consist of 16 gastric cancer cell lines and 29 pancreatic cancer cell lines. Moreover, both datasets include 41,000 probes, and they were summarized by 19,566 gene symbols for this study.
The chemosensitivity dataset consists of 144 cell lines and four chemosensitivity measurements, including the IC50 scores (half-maximal inhibitory concentration; the concentration of a drug required for 50% growth inhibition in vitro) of 22 components. The remaining chemosensitivity measures are as follows: (a) EC50 (half-maximal effective concentration), defined as the concentration required to obtain a 50% antioxidant effect; (b) ActArea, representing the area between the drug-response curve and a fixed reference; and (c) Amax, which is the maximum activity value. One hundred and forty-four cell lines were divided into seven types of cancers. Gastric cancer and pancreatic cancer cell lines were carefully analyzed because they represent digestive-digestive tumor pairs.
Machine learning methods. In recent decades, the rapid advancement of computational algorithms and the increased availability of big data have enabled artificial intelligence (AI), one of the most exciting technologies in our everyday lives, to analyze and dramatically improve upon the predictive performance of models in various research areas. To be specific, machine learning (ML), a major branch of AI, has been used widely; ML has been focused on the process of drug discovery and development in order to (a) predict treatment effects, (b) identify target genes as well as functional pathways, and (c) select potential biomarkers. The ML algorithms in this study are as follows.
Linear discriminant analysis. Linear discriminant analysis (LDA) is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or for dimensionality reduction before later classification. LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. LDA is a dimension-reduction technique which is commonly applied to supervised classification problems. It is used to model differences between groups, i.e., separating two or more groups from each other. It is also used to project features in higher dimensional space into lower dimensional space.
Classification and regression tree (CART). A classification and regression tree (CART) is a predictive model that explains how certain outcome variables can be predicted based on other values. A CART output is a decision tree where each fork is a split in a predictor variable and each end node contains a prediction for the outcome variable. CART is a non-parametric decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively.
k-Nearest neighbors (KNN) method. The k-nearest neighbors algorithm is one of the simplest techniques used in ML. It is used for both classification and regression. It is preferred by many in the industry because of its ease of use and low calculation time. When implementing KNN, the first step is to transform data points into feature vectors; the algorithm works by finding the distance between the vectors of these points. The most common way to find this distance is to use the Euclidean distance, as shown below.
where p and q are two arbitrary points in n-dimensional space, and the subscripts represent dimensions.
KNN runs this formula to compute the distance between each data point and the test data. It then finds the probability of these points being similar to the test data and classifies the points based on which of them share the highest probabilities.
Support vector machine (SVM). A support vector machine (SVM) is a supervised learning model with associated learning algorithms that analyze data used for classification and regression analysis. The SVM algorithm is a popular machine learning tool that offers solutions to both classification and regression problems. It was developed at AT&T Bell Laboratories by Vapnik and colleagues. The objective of the SVM algorithm is to find a hyperplane in N-dimensional space (where N=the number of features) that distinctly classifies the data points. Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, the margin of the classifier is maximized. Deleting the support vectors will change the position of the hyperplane.
Random forest. Random forest (RF) is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. The RF algorithm is used to solve both regression and classification problems, making it a diverse model that is widely used by engineers. All machine learning models were implemented using the R programming language. All statistical analyses including principal component analysis (PCA) were conducted using R (version 4.0.1) with p<0.05 considered statistically significant.
In this study, we compared the predictive accuracies of the ML algorithms for drug sensitivity using both cancer type and specific gene expressions. The study proceeded as shown in Figure 1.
Results
Chemosensitivity profiling in cancer cell lines. The chemosensitivity dataset contains data from 144 cell lines on 4 chemosensitivity measures, including the IC50 scores, for 22 components. The 144 cell lines comprise 7 different types of cancers. Data on the gastric cancer and pancreatic cancer cell lines were used in this study. This dataset is summarized in Figure 2.
As shown in Figure 2, the smaller the values of IC50, EC50 and Amax, the greater the sensitivity (Figure 2A, B and D), and the larger the value of ActArea (Figure 2C), the greater the sensitivity. As shown in Figures 3 and 4, drugs, irinotecan, paclitaxel, panobinostat and topotecan, are more sensitive than all other drugs (components) in the cell lines from 7 types of cancer. The responses of all 22 drugs showed similar sensitivity patterns in the cell lines from 7 types of cancer. This indicates that a drug which is sensitive in a particular cancer type may also be sensitive in other cancer types (Figure 2).
To compare drug sensitivity in the two models, we used two types of cancer cell lines, gastric cancer and pancreatic cancer, and one of the four most sensitive drug components, topotecan, as it included the least missing entries.
The publicly archived gene expression datasets contain data on 16 gastric cancer cell lines and 29 pancreatic cancer cell lines, respectively. Forty-one thousand probes were summarized by 19,566 unique gene symbols in this study. The cell lines in two datasets are summarized in Table I.
Combination of the 4 drug sensitivity measures. The drug sensitivity datasets included 4 measures of drug sensitivity. In this study, PCA was used to identify a combined sensitivity metric using all 4 measures. PCA is a simple nonparametric statistical method that reduces data dimensionality into uncorrelated variables, which are named principal components. Each principal component is represented by linear combinations of the original variables. Therefore, a combined chemosensitivity value for each cell line can be calculated according to the following formula, which we call a combined chemosensitivity measure:
Combined chemosensitivity measure = w1S1 + w2S2 + w3S3 + w4S4
where S1, S2, S3, S4 are the values of the 4 sensitivity measures, and w1, w2, w3, w4 are the weights of the 4 sensitivity measures.The correlation coefficients of the relationships between the combined chemosensitivity measure and the 4 measures in the two types of cancer cell lines are shown in Table II.
The 4 individual measures were not strongly correlated with each other; therefore, the combined measure is meaningful as it reflects the characteristics of all 4 measures.
Classification of cell lines according to chemosensitivity. The gene expression datasets included information on sixteen gastric cancer cell lines and twenty-nine pancreatic cancer cell lines. Among these 45 cell lines, twenty-seven were treated with topotecan. By combining the gene expression and chemosensitivity datasets, the number of cell lines was reduced from 45 to 27. The combined topotecan sensitivities in the two types of cancer are shown in Figure 3.
As shown in Figure 3, the combined sensitivities were widely distributed even in the same cancer type. This may imply that the chemosensitivity depends on the characteristics of the individual cell lines, not the type of cancer.
The combined sensitivity to topotecan was divided into two groups, sensitive and resistant, by k-means clustering. The 27 cell lines were classified into the two groups, as shown in Table III. The cell lines were divided into 3 groups at first, but the second and third groups were combined (resistant group) because of the small number of resistant cell lines.
Identification of topotecan sensitivity genes. Using the gene expression data set, we identified 24 genes which are correlated with topotecan sensitivity based on the results of the Mann-Whitney U-test (Table IV).
Figure 4 shows the gene expression patterns and correlation coefficients of expression of the identified genes. The expression pattern is mixed in both the sensitive and resistant groups (Figure 4A).
The expression patterns on the right side of the heatmap show that OBP2B and PIGP are up-regulated in sensitive cell lines, while ZNF227, WDPCP and TTLL5 are down-regulated in sensitive cell lines (Figure 4A). Figure 4B shows that the expression levels of PIGP and OBP2B are negative correlated with expression of TTLL5, WDPCP, ZNF227 and CD300LD. The distribution of gene expression in the sensitive and resistant cell line groups is shown in Figure 5.
Figure 5 shows the gene expression distributions in the topotecan-sensitive and -resistant groups. Though the expression of most genes partially overlaps in the sensitive and resistant groups, different expression patterns can be grasped in the two groups.
Comparison of the performance of two types of chemosensitivity models. We compared the performance of two types of models using ML algorithms. One set of models predicted chemosensitivity based on cancer type, while the other set of models used the identified gene sets. For this experiment, the dataset was randomly split into training (70%) and testing (30%) datasets. The data was processed 100 times repeatedly, and the performance of each model was summarized using the mean values and standard deviations of the calculated values from all 100 processing cycles. The results are summarized in Table V.
Table V shows the prediction accuracies on the training and testing dataset. The table shows mean values and standard deviations, which were calculated from the 100 randomly allocated processing cycles. The performances of the models using the identified genes are superior to those of the models using cancer type (Table V). The predictive accuracies were 0.729 to 0.763 on the training dataset and 0.731 to 0.765 on the testing dataset for the cancer type models. For the models using gene expression, RF is the best model on the training dataset (accuracy=1.0), and SVM is the best model on the testing dataset (accuracy=0.896).
SVM shows the best performance on the training and testing datasets, and performance seems to differ according to the predictive gene. For all genes, the specific gene models show better performance than the cancer type models. This means that the selection of therapeutic drug should be done on the basis of expression of specific genes, not cancer type. It may also mean that patients with multiple primary cancers can be treated depending on their specific gene expression profiles, not cancer type.
Discussion
Some environmental factors are thought to be involved in the pathogenesis of multiple primary cancers, such as smoking exposure, alcohol consumption, hepatitis C virus infection, and human papillomavirus infection. Moreover, failure of the immune surveillance system, including decreased T-cell number and expression of human leukocyte antigen class I and CD3 zeta chain, may also contribute to development of multiple primary cancers. However, most cases of multiple primary cancers cannot be fully explained by these immune dysfunction and environmental factors. Some genetic factors, including single-nucleotide polymorphisms, microsatellite instability, chromosomal instability, and epigenetic alterations, are considered risk factors for multiple primary cancers. Hereditary breast and ovarian cancer syndrome caused by germline mutations in the genes BRCA1 and BRCA2, along with TP53 mutations, are associated with Li-Farumeni syndrome, which is characterized by a high frequency of various types of malignancies such as soft tissue sarcomas, leukemia, and adrenocortical carcinomas.
A fundamental goal of precision medicine is to match drugs to the specific genomic profiles of patients in order to maximize the effectiveness of treatment for the individual. Cancer patients face the possibility of multiple primary cancers, and the occurrence of subsequent cancers in cancer patients is becoming increasingly frequent. Specifically, when the first cancer site is the pancreas, the standard incidence ratio (SIR) of stomach cancer is 1.41, and the relationship between the two is significant. These two types of cancers are seen in Peutz–Jeghers syndrome due to STK11 gene mutation.
The treatment of patients with multiple primary cancers is challenging and often presents therapeutic dilemmas. In cases of advanced disease, anti-tumour therapy selection is often difficult and is generally not based on evidence from the literature and clinical trials. In these patients, the side effects of combination therapy are usually much more severe than those of monotherapy. Identification of potential biomarkers in specific patient groups may aid in selection of appropriate therapies and increase survival.
ML is a subset of artificial intelligence that is seeing more applications in drug discovery every year. In a supervised ML algorithm, the output is already known; these include LDA, CART, KNN, SVM and RF.
From an exploration of chemosensitivity patterns in cell lines from 7 types of cancer, this paper showed that chemosensitivity depends on drug components, not cancer type (Figure 3). Also, sensitivity is widely distributed even in the same cancer cell lines (Figure 5). This can be interpreted to mean that the chemosensitivity depends on the characteristics of the cell lines, regardless of the type of cancer. Based on this insight, chemosensitivity models using specific gene expression patterns were implemented for patients with multiple primary cancers. By applying five ML models to the combined drug sensitivity and gene expression data, we confirmed that the use of specific gene expression patterns can improve the predictive accuracy for drug sensitivity. These findings imply that molecular markers are essential for personalized medicine in patients with multiple primary cancers, such as the gastric and pancreatic cancers investigated in this study.
Among the five ML models using the six identified genes, RF shows the best performance on the training dataset, but its performance is lower than that of the other models on the testing dataset. On the whole, the ML models based on gene expression were far superior to the models based on cancer type. This means that drugs should not be chosen according to type of cancer, but rather because of the expression of specific genes, especially in patients with multiple primary cancers.
While machine learning easily identifies trends and patterns of data set, it requires massive data sets to train on. Due to the small sample size, the first limitation of the study is that the dataset does not represent the entire population of patients with multiple primary cancers. A model trained on a random sample of a dataset may have poor generalizability and perform poorly outside of that sample. Indeed, the use of larger training and test sets result in more accurate and reliable predictions. In addition, the second limitation of this study is that we constructed ML models using the expression levels of single genes. When features are chosen based on single genes, the models tend to show poor stability and performance on independent datasets. Therefore, in future work, we will implement the predictive ML models using combined gene expression data from a larger dataset.
Acknowledgements
This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2019R1A2C1003028).
Footnotes
* These Authors contributed equally to the study.
Authors’ Contributions
Ki-Yeol Kim designed this study, analyzed the data, prepared the figures and wrote original draft. Xianglan Zhang and Mi Jang oversaw the study and revised the article. All Authors reviewed the article.
Conflicts of Interest
Conflicts of interest relevant to this article were not reported.
- Received February 16, 2021.
- Revision received March 11, 2021.
- Accepted March 24, 2021.
- Copyright © 2021 International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved.