Abstract
Background: It is unclear whether radiomic phenotypes of brain metastases (BM) are related to radiation therapy prognosis. This study assessed whether a convolutional neural network (CNN)-based radiomics model which learned computer tomography (CT) image features with minimal preprocessing, could predict early response of BM to radiosurgery. Materials and Methods: Tumor images of 110 BM post stereotactic-radiosurgery (SRS) (within 3 months) were assessed (Response Evaluation Criteria in Solid Tumor, version 1.1) as responders (complete or partial response) or non-responders (stable or progressive disease). Datasets were axial planning CT images containing the tumor center, and the tumor response. Datasets were randomly assigned to training, validation, or evaluation groups repeatedly, to create 50 dataset combinations that were classified into five groups of 10 different dataset combinations with the same evaluation datasets. The CNN learned using training-group images and labels. Validation datasets were used to choose the model that best classified evaluation images as responders or non-responders. Results: Of 110 tumors, 57 were classified as responders, and 53 as non-responders. The area under the receiver operating characteristic curve (AUC) of each CNN model for 50 dataset combinations ranged from 0.602 [95% confidence interval (CI)=36.5-83.9%] to 0.826 [95% CI, 64.3-100%]. The AUC of ensemble models, which averaged prediction results of 10 individual models within the same group, ranged from 0.761 (95% CI=55.2-97.1%) to 0.856 (95% CI=68.2-100%). Conclusion: A CNN-based ensemble radiomics model accurately predicted SRS responses of unlearned BM images. Thus, CNN models are able to predict SRS prognoses from small datasets.
The classical use and role of tumor images in radiation therapy planning has been restricted to defining treatment-target boundaries. However, as the field of medical-image analysis matures, it is expected that tumor imaging will include tumor biological and physiological information, which could inform treatment strategies. Radiomics is an emerging field of imaging analysis that quantifies information hidden in tumor images (1, 2). This concept is based on the conversion of tumor phenotypes, as represented in digital tumor images, into mineable high-dimensional feature data (3). Several studies have applied radiomics to the prediction of cancer prognosis, and demonstrated the potential of radiomic features to improve decision-making (4-6). To predict clinical prognosis using radiomic feature data, machine learning algorithms are applied, which are trained on feature data to recognize patterns and construct classification models after feature extraction (7). Traditional radiomics approaches rely on manually pre-defined features and hand-crafted feature extractions (Figure 1); recent studies have indicated that such approaches can be effective (4-6). However, the radiomics workflow is relatively complicated and cumbersome, and it is impossible to uncover features beyond those that were pre-defined. These reasons may explain the low acceptability of radiomics in clinical investigations.
The convolutional neural network (CNN) classifier. An artificial neural network is a machine-learning technique that emulates an animal's neural structure for the classification of high-dimensional non-linear data or pattern recognition. Neural networks comprise several vertical layers, and each layer comprises numerous neural units. Each neural unit is connected to those of the next layer with constant weight and bias. The initial weight and bias have random values. In a general neural-network structure, the first layer is called the input layer, while the final layer is called the output layer, as it expresses the results of the task. Between the input and output layers, there are hidden layers that perform constant operations or data processing. The number and arrangement of the hidden layers may be varied, or a layer for special operations can be inserted depending on the features of the task. When training data are input to the neural network, the neural units of the input layer multiply the input data by the connection weights and relay the product to the neural units of the next layer. The neural units of the next layer then generate an activation value by adding the sums of the values generated from the previous layer to the bias and inputting the value to an activation function. The activation value in turn becomes the inputted value for the next layer. Through this method, the output layer of a neural network presents the final value for the input data of a neural network. If the final output value differs from the targeted value, the connection weight is adjusted in a process called learning. If novel arbitrary data are inputted to an adequately trained neural network, the neural network can generate a value close to the target value according to the learned pattern.
A CNN is a type of feed-forward artificial neural network that is generally applied to image recognition. Categorizing medical images using conventional machine-learning algorithms requires a procedure involving pre-processing of the images, defining the inherent features, and extracting the output. In contrast, CNNs can automatically extract high-dimensional features from the original images, learn the patterns, and classify them (8). CNN was inspired by the neural structures of the animal visual cortex. The fundamental principle behind efficient learning of images by a CNN is the presence of a hidden layer that performs convolution and pooling operations, which is similar to the actual visual cortex (9).
CNN-based radiomics for brain metastases (BMs). CNNs have shown high performance in the classification of both natural and medical images (10-13). CNNs have the potential to improve the acceptability and applicability of radiomics because the approach requires minimal image preprocessing and no feature detector (Figure 1). To our knowledge, CNN-based radiomics has been used for tumor diagnosis (12) and predicting chemotherapy outcome for lung cancer (13), but not to predict radiation-therapy response. In this study, for the first time, we investigated the performance of CNN-based radiomics for predicting response after radiation therapy. We aimed to predict the response to stereotactic radiosurgery (SRS) for BMs, using CNN-based radiomics.
BMs, which may originate from any primary site, are a major cause of reduced quality of life, cognitive function, and life expectancy (14). As only a minority of patients with BMs are eligible for surgical resection (15), and SRS is a relatively short, convenient, and noninvasive treatment course, it has become a common treatment modality for BMs. Moreover, BMs generally have a round shape and clear tumor boundary; thus, they have an appropriate geometry for radiomics research. Although several studies have assessed factors that predict prognosis of SRS for BM, including enhancement of patterns in CT images (16,17), as far as we are aware, no attempts have been made to use radiomics to predict the radio-response of BMs.
The objectives of this study were, firstly, to predict the SRS response of BMs using tumor images extracted from SRS planning CT, and secondly, to demonstrate the applicability of CNN-based radiomics combined with ensemble learning in the clinical radiation therapy field, using a small dataset.
Materials and Methods
Patients, target tumors, and treatment. Medical records of all patients who received SRS for BMs at the Korea Institute of Radiological and Medical Sciences between 2007 and 2015 were analyzed. Patients with no imaging follow-up within 3 months of receiving SRS were excluded. Additional exclusions were patients with primary brain cancer, receipt of SRS combined with whole-brain radiation therapy, multiple BMs (more than five), skull-infiltrative BMs, and tumor with a longest diameter larger than 3.5 cm or smaller than 1 cm. Finally, 89 patients and 110 tumors were included in this study. All tumors were diagnosed by magnetic resonance (MR) imaging. SRS was performed using CyberKnife (Accuracy Inc., Sunnyvale, CA, USA) for 100 tumors, and RapidArc (Varian Medical System, Palo Alto, CA, USA) for 10 tumors. The gross tumor volume (GTV) was defined as the visible tumor extent on planning CT. The planning target volume (PTV) was defined as the GTV plus a 0-2 mm margin. Radiation doses were prescribed to the 80-95% isodose line of the maximum dose to cover the PTV. Tumor and treatment characteristics are summarized in Table I.
Images and dataset acquisition. For this study, two kinds of data were acquired from the treatment record for the 110 tumors. Firstly, one axial slice of each tumor was extracted. This contained the central point on the SRS planning CT images, as saved in the picture archiving communication system and extracted with the same CT window levels and widths for all tumors. Using the same scale (151×151 pixels; c. 40 mm × 40 mm), a square section of the image with tumor with a small portion of surrounding tissue was cropped. The cropped image contained the axial tumor centrally and the surrounding tissue marginally. The greatest tumor diameter enrolled in our study was less than 35 mm, so that each cropped image contained the entire axial tumor. No other image processing was carried out. Compared to MR, CT images are acquired under consistent, controlled conditions, such that if CT images are extracted for model learning, bias can be minimized in comparisons of images. MR data may have radiomic variations in images acquired due to differences in contrast time, equipment upgrades over time, and changes contrast agent use.
Secondly, the treatment response of each tumor in the 3 months following SRS was obtained. As per the Response Evaluation Criteria in Solid Tumor version 1.1 (RECIST 1.1) (18), the treatment results for the 110 tumor images were labeled as either ‘responder' for those showing complete (CR) or partial (PR) response, or as ‘non-responder’ for those showing stable (SD) or progressive (PD) disease. The cropped image of the tumor and the treatment results together defined a data point, resulting in a dataset of 110 such points.
Implementation and computation. In this study, CNNs featured an architecture comprising a hidden layer and two convolution/max-pooling layers (Figure 2). Image data were converted to a matrix of pixels to be introduced in the input layer. Training used the drop-out technique, which probabilistically selects and trains a subset of the neurons in the hidden layer. This technique reduces overfitting in the process of creating a multilayer-deep neural-network learning model, thereby boosting overall performance (19). Between the final pooling layer and output layer were two fully connected neural layers. These performed additional feature extractions, and did not incorporate the drop-out technique. The final output layer consisted of two neural units. Outputs from the two neural units were compared to classify the input data as ‘responder’ or ‘non-responder.’ A stochastic gradient-descent algorithm, which is generally used in deep artificial neural networks for processing large-scale data, was used to update the connection weights of the neurons during learning (20). This neural network was based on Google's machine-learning framework called TensorFlow (version 0.9) and configured in the Python language. Amazon's cloud GPU computing was used to enhance performance during learning operations.
Training, validation, and evaluation. One hundred and ten datasets were randomly classified into training, validation, and evaluation groups. This random assignment was repeated 50 times to create 50 independent combinations of datasets. Note that the following conditions were met for the random assignment: Firstly, the 50 combinations were clustered into five groups (A to E) of 10 combinations. Each group was assigned the same evaluation datasets but different training and validation datasets. Secondly, the ratio of responder to non-responder datasets remained consistent across the five groups.
Next, the experiment for each dataset combination proceeded as follows. Training datasets were used to train the neural networks. The neural networks learned the radiomic features of responders and non-responders by learning the labels that matched the training images. However, learning performance can vary according to the initial connection weights, which were randomly determined. Hence, the validation datasets were used to validate performance during learning. Validation proceeded by periodically presenting a validation image at a fixed trial-interval to the neural network undergoing the learning process. The network then predicted the matched label. When the accuracy of prediction was low, learning was reinitialized and restarted. When the network demonstrated good performance or did not show improved performance, learning was stopped and the model was accepted and proceeded to the evaluation stage. That is, validation was used to discover the best model among the candidate models during training. Finally, an evaluation image was presented to the neural network that completed learning to generate a classification output. Each neural network was designed such that it generated an output of 1 for an image determined to be a non-responder and 0 for an image determined to be a responder.
Independently of the classification outputs from the 50 individual neural network models, the classification outputs generated by the neural networks of the same group (which shared the same evaluation datasets) were averaged to determine the final result. This model is defined as an ensemble model (Figure 3). Performance of the ensemble model was evaluated using the area under the receiver operating characteristic curve (AUC).
Results
Tumor and treatment characteristics. The neural networks classified 57 images as responders and 53 images as non-responders. The maximum tumor diameter was 1.0-3.3 cm in the responder group (median=2.0 cm) and 1.2-3.3 cm in the non-responder group (median=2.2 cm), with no significant differences between the two groups. With regard to the primary tumor site, the greatest number of tumors in the responder group were primarily of breast (40%), followed by lung (35%), rectum (7%), and other sites. In the non-responder group, the most frequent site was lung (30%), followed by breast (25%), rectum (11%), and other sites. The difference in the distribution of primary tumors between the two groups approached statistical significance (p=0.051). There was no statistically significant difference between the responder and non-responder groups in terms of the tumor's intracranial position. Patient age ranged from 14-92 years (median=59 years) in the responder group and 43-76 years (median=57 years) in the non-responder group; there was no significant difference in age between the two groups. The two groups also did not significantly differ in terms of patient sex and Karnofsky performance status scale. Furthermore, there was no difference with respect to whether the primary disease site was completely controlled by surgery or radiation therapy compared to cases without such control. However, there was a significant difference in response of extra-cranial metastases, except for the primary site, and for those who did not receive systemic therapy for BMs (Table I). Systemic therapy refers to chemotherapy, hormone therapy, and target therapy, depending on the type of primary tumor used alone or in combination. Specific analysis of systemic therapy regimen was not performed.
The total dose of radiation ranged from 19-48 Gy (median 24 Gy) in 1-5 fractions in the responder group and from 16-36 Gy in 1-4 fractions in the non-responder group, with no significant differences between the two groups. The two groups also did not significantly differ in the biologically effective dose, or treatment platform (Table II).
Prediction results. The independent treatment prediction results for groups A to E were as follows. The AUC of the predictions generated by each of the 10 neural networks in group A ranged from 0.693 [95% confidence interval (CI)=47.0-91.6%] to 0.826 (95% CI=70.3-100%) (median 0.778). The AUC of the ensemble model, which determined the result based on the average of the outputs generated by each of the 10 neural networks, was 0.856 (95% CI=68.2-100%). When the cut-off point was set to 0.5 for the ensemble model, the sensitivity and specificity for prediction of non-responders were 82% and 83%, respectively. The results for groups B to E are detailed in Table III.
In summary, the AUC generated by the ensemble model for the evaluation data of groups A through E were 0.856 (95% CI=68.2-100%, p=0.004), 0.856 (95% CI=70.2-100%, p=0.004), 0.799 (95% CI=60.8-99.0%, p=0.015), 0.761 (95% CI=55.2-97.1%, p=0.034), and 0.826 (95% CI=63.7-100%, p=0.008), respectively, which were either comparable to or superior to the results generated from individual neural networks (Figure 4).
Discussion
The emergence of radiomics has provided an impetus to research assessing whether tumor images can predict radiation therapy prognosis. Mattonen et al. reported that radiomics analysis significantly predicts the probability of recurrence of lung cancer after radiation therapy (21). However, there has been no study that applied CNN-based radiomics to predict the prognosis of radiation therapy. Hence, this study analyzed radiation therapy planning CT images with CNNs to determine whether CNNs can predict early response to SRS.
We made two assumptions prior to beginning this study. The first was that microenvironments within tumors are determined by the pathophysiology of the tumor, and thereby are related to the tumor's response to treatment (22). The key concept of radiomics is that if the microenvironment of a tumor is exhibited as a characteristic phenotype in images, analyzing the images would enable the prediction of treatment response. George et al. (16) and Goodman et al. (17) reported that local control post-SRS is associated with the contrast patterns in BM and MRI, but these studies are limited in that they only qualitatively classified the contrast patterns. Evidence for the presence of a correlation between BM pathophysiology and their phenotypes in images is lacking in the literature. Therefore, we assumed that phenotypes of BMs shown in CT images reflect the tumor microenvironment, which determines the prognosis of radiation therapy.
The second assumption was that BMs are generally round with clear margins (23). Thus, we assumed that BMs display similar radiomic phenotypic patterns radially from a central point of the tumor. Based on this assumption, we would be able to identify the overall radiomic features of the tumor using a cross-sectional CT image containing the center of the tumor.
In this analysis, the CNN-based machine-learning model was taught pairs of tumor images and responses to SRS and then predicted SRS responses for unlearned images. Fifty independent neural networks used for the analysis showed a prediction AUC ranging from 0.602 to 0.826 (median=0.695). However, the AUC values of the ensemble models, each of which comprised 10 individual neural networks that assessed the same evaluation datasets, ranged from 0.761 (95% CI=55.2-97.1%) to 0.856 (95% CI=68.2-100%), thus showing better predictive performance than the individual neural networks. We cannot compare the performance of this model to that of another model, as no studies have used radiomics to predict BM responses to radiation therapy. However, the sensitivity and specificity of the ensemble models for recognizing non-responders were superior when the cut-off value was set to 0.5. As the ensemble technique increases the number of datasets used for learning (24,25), we speculate that this approach could compensate for the problem of small datasets in radiomics research.
Furthermore, considering the promising predictive performance found in this study, the assumption that radiological phenotypes are correlated with microenvironments that determine the tumor's treatment response may be supported. There were no statistically significant differences in other major parameters between the responder and non-responder groups, except for the presence of extra-cranial metastases and systemic therapy (Tables I and II). Thus, tumor radiological phenotypes may be more likely predictive of treatment response than other clinical factors. However, it seems that the presence of extra-cranial metastases and receipt of systemic therapy may affect intra-cranial tumor burden, and may partially relate to local control of BMs.
We only used one axial image, which contained the center of the tumor, for analysis instead of using the entire image data. When limited to BMs, it is reasonable to assume that one cross-sectional image containing the center of the tumor represents the overall radiomic features of the tumor. Furthermore, the use of one cross-sectional image extracted from the planning CT images for analysis, as in this study, may improve clinical applicability.
However, the design of this study has some limitations. Firstly, evidence for the effectiveness of a single fraction SRS for BMs has been established, but there is a lack of evidence for effectiveness of multi-fraction SRS for BMs. Nevertheless, a linear-quadratic model was introduced to simplify the problem in order to compensate for differences in biological treatment effects between the two groups. However, multi-fraction SRS has a high fraction size, which makes it difficult to compare strict doses between single and multi-fraction SRS. Second, systemic therapy needs to be considered as an important factor affecting local control of tumors. However, analysis of local control taking into consideration the types of systemic therapies and the details of regimens was not performed in this study. Thirdly, we used CT images that can be obtained under consistent conditions and that most appropriately reflect the timing of treatment initiation in model analysis. However, if MR images were available with consistent conditions and imaging parameters, it might be possible to improve the predictive accuracy by developing MR-based machine learning models that incorporate more detailed radiomic information.
We also need to review data that failed to accurately predict treatment outcome. Although we did not perform dataset-specific analysis of prediction failures, the following reasons may be involved. Firstly, CNN is known to be capable of learning high-dimensional features of images, but a great deal of learning data is required to achieve good performance (26). However, our data only comprised 87 sets, which is a relatively small number, even considering the ensemble learning approach. The small data size hindered the ability of the predictive model to learn generalized boundaries of radiomic phenotypes that distinguished responders from non-responders. Secondly, tumor responses PR and SD were labeled as responder and non-responder, respectively, as per RECIST version 1.1. However, the demarcations for PR and SD are very thin, and it is unclear whether tumors that show PR and those that show SD can be viewed as having different radiomic properties. Thus, it is possible that the predictive models generated conflicting results for tumors that were between PR and SD. Thirdly, other predictors of tumor treatment may exist in addition to the clinical factors analyzed in this study. As we did not control for potential predictors, our results should be carefully interpreted. Fourthly, the findings of progression at the 3-month follow-up imaging of BMs did not exclude the possibility of pseudoprogression due to edema or hemorrhage. Indeed, making a clinical distinction between progression and pseudoprogression is not easy. Despite these limitations, this study demonstrated the possibility of predicting BM response to SRS through CNN-based radiomic analysis, even in a small dataset. Future studies should address these limitations and utilize multicenter data to improve the predictive performance of the model so as to produce data useful for clinical treatment decisions.
Conclusion
A CNN-based ensemble radiomics model that learned SRS from planning CT images for BMs and known early responses predicted the SRS responses of unlearned images of BMs with high accuracy. Hence, the radiomic phenotypes of BMs shown in CT images might be correlated with tumor response to SRS. This study is the first to suggest the use of CNN models to predict radiation therapy prognoses with small-scale data. However, additional studies are required to improve the performance of this model so that it may be utilized to help make clinical treatment decisions.
Acknowledgements
This study was supported by a grant from the Korea Institute of Radiological and Medical Science s(KIRAMS), funded by the Ministry of Science and ICT, Republic of Korea (50543-2017).
Footnotes
Funding
None.
Conflicts of Interest
The Authors have no conflicts of interest that are relevant to this report.
- Received July 13, 2018.
- Revision received July 27, 2018.
- Accepted July 30, 2018.
- Copyright© 2018, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved