Abstract
Background/Aim: Early prediction of response to neoadjuvant chemotherapy (NAC) is essential for personalized treatment planning in breast carcinoma. Previous studies have relied on human-annotated regions of interest for digital pathology analysis rather than directly leveraging whole-slide images (WSIs). This study aimed to evaluate the predictive value of pretreatment core needle biopsy (CNB) WSIs for pathological complete response (pCR) to NAC using an artificial intelligence (AI)-based approach.
Patients and Methods: We analyzed 130 patients with invasive ductal carcinoma who underwent anthracycline- or taxane-based NAC followed by surgery. From each pretreatment CNB WSI, five regions with the highest cellular density were selected to extract image patches. A fusion-based classification model was developed, integrating image data with clinical metadata, including age, hormone receptor status, and Ki-67 labeling index.
Results: The model achieved an accuracy of 92.3%, comparable to those using expert annotations. Omitting either image or clinical data significantly reduced performance, underscoring their complementary roles. Optimal performance was achieved using five image patches of 1,000×1,000 pixels, balancing histological detail and computational efficiency.
Conclusion: Our AI-based model accurately predicted pCR to NAC in breast carcinoma using only a limited number of high-cellularity image patches and basic clinical metadata, without requiring expert annotation. This approach may facilitate earlier treatment decisions and improve preoperative outcome prediction.
Introduction
Neoadjuvant chemotherapy (NAC) has become the standard of care for locally advanced breast carcinoma, with the aim of downsizing the tumor before surgery (1, 2). The likelihood of successful and minimally invasive surgical resection increases in a subset of patients with breast carcinoma treated with NAC (3). NAC can reduce tumor size, prevent micrometastases preoperatively, allow for more conservative surgery, increase overall survival, and provide an in vivo evaluation of sensitivity to chemotherapy (1, 4). Unfortunately, less than 30% of patients with breast carcinoma who receive NAC achieve a pathological complete response (pCR), and approximately 5% experience disease progression while receiving NAC (5, 6). Moreover, the therapeutic effect remains unknown until the patient has received at least two cycles of NAC. For non-responders, the treatment regimen may be changed to a more effective option to avoid the side effects of ineffective treatment. Early evaluation of the treatment response also allows earlier timing of surgery if the tumor appears refractory to NAC (1). Thus, early prediction of NAC outcomes is of great importance in facilitating a personalized paradigm for breast carcinoma treatment (7, 8). The NAC response prediction model enables patients to avoid overtreatment with ineffective chemotherapy. Accurate assessment of the treatment response to NAC is crucial for establishing the most appropriate surgical approach.
Recently, with the advancement of artificial intelligence (AI), several researchers have reported the effects of NAC using whole-slide images (WSIs), which are an indispensable part of the visual examination of slides in digital pathology (9-14). Some studies have employed convolutional neural network (CNN)-based models for binary classification of the treatment response to NAC (9, 10), whereas others have integrated machine learning methods, such as support vector machines and random forests, to predict treatment outcomes (11, 12, 15). Although previous studies have provided valuable insights into NAC response prediction, they have been limited by their reliance on human-annotated regions of interest (16), rather than using the entire WSI directly. This process not only requires significant time to annotate key regions of interest of the WSI but also introduces human-level bias during annotation, making it difficult to attribute the prediction solely to AI.
Therefore, in this study, we aimed to predict pCR to NAC in patients with breast carcinoma using AI-based, computer-aided digital pathology analysis of WSIs obtained from pretreatment core needle biopsy (CNB) specimens.
Patients and Methods
Data collection and labeling. The study protocol was approved by our Institutional Review Board (approval number: 2024-04-037; date of approval: April 23, 2024). The need to obtain signed informed consent from the patients or their guardians was waived by the Institutional Review Board owing to the retrospective nature of the study. All patients were anonymized using the Easy Namer software (Kinfolk Soft, Seoul, Republic of Korea). We enrolled patients who met the following inclusion criteria: histological confirmation of primary invasive ductal carcinoma in breast CNB specimens, receipt of preoperative anthracycline- or taxane-based NAC without prior therapy, and partial or total mastectomy after the completion of NAC at our institution. Patients with synchronous or metachronous double primary carcinomas were excluded from the study. A single board-certified breast pathologist (S.-I.D.) thoroughly examined all available hematoxylin and eosin- and immunostained slides to confirm the diagnosis. Between January 2020 and November 2023, a total of 130 patients were included in this study. The following clinicopathological information was collected from the electronic medical records: age at diagnosis, estrogen receptor expression status, progesterone receptor expression status, human epidermal growth factor receptor 2 (HER2) expression status, Ki-67 labeling index, and response to NAC. Estrogen receptor and progesterone receptor immunoreactivities were analyzed in pretreatment biopsy specimens using the Allred scoring system (17). HER2 immunoreactivity was assessed in pretreatment biopsy specimens according to the American Society of Clinical Oncology/College of American Pathologists guidelines (18). HER2 positivity was defined as an immunostaining score ≥3 or an immunostaining score ≥2 in conjunction with HER2 amplification confirmed by silver in situ hybridization (19). To estimate Ki-67 labeling index, three 20× objective fields were digitally captured using the GenASis Capture and Analysis Platform (Applied Spectral Imaging, Carlsbad, CA, USA), and the index was measured using the GenASis HiPath system (Applied Spectral Imaging) (20-23). Treatment response was classified as pCR or non-pCR based on the final pathological report after mastectomy. A pCR was defined as the absence of residual invasive carcinoma in the breast or ipsilateral axilla (ypT0/Tis and ypN0, respectively) (24-26).
Pretreatment CNB specimen preparation. Breast biopsy cores were obtained from each patient before NAC. Formalin-fixed, paraffin-embedded blocks containing the biopsy specimens were cut into 4-μm sections. The sections were mounted onto glass slides and stained with hematoxylin and eosin. The slides were digitized into WSIs using an automated digital pathology slide scanner (Aperio GT 450 DX; Leica Biosystems, Seoul, Republic of Korea). All WSIs were reviewed to ensure image integrity prior to image processing and analysis. If any image was distorted, blurry, or contained occlusions, the associated slide was reimaged.
Data preprocessing. We preprocessed the image data to facilitate classification tasks. WSIs are inherently large and often exceed the practical computational limits for efficient processing. To address this, each WSI was downsampled by a factor of 16, reducing its dimensions while retaining essential tissue structures. This downsampling enabled more efficient handling and image analysis. Following this step, the red-green-blue images were converted to the hue-saturation-value color space, which is critical for improved segmentation. This conversion separates color information from intensity, making it easier to isolate regions of interest based on specific color characteristics. A color mask was then applied to the hue-saturation-value images to isolate regions within a predefined range of pinks and purples, as determined by the hue boundaries (130–170), saturation, and value. After masking, the labeled regions were identified in the binary mask using the labeling function, which assigns unique identifiers to connected components within the image. The regional properties were subsequently calculated. To focus on the most relevant areas, the largest regions by area were selected, and the top five regions were chosen for further analysis. This approach highlights the prominent features that may contain diagnostically significant tissue patterns. For each of the top five regions, image cropping was performed to extract a 1,000×1,000-pixel patch centered on the region’s centroid, ensuring that a high-resolution, detailed section of the WSI was retained while excluding unnecessary surrounding tissue. The cropped images were then used for subsequent classification tasks. A custom dataset class was defined to manage the five preprocessed images, along with their associated labels and metadata. This class extracted five cropped image patches, assigned appropriate labels, and stored additional relevant information for each case. The images were further transformed using a data augmentation process that included random horizontal flips, brightness and contrast adjustments, and grayscale conversion. Data augmentation was exclusively applied during the training and validation phases.
Model architecture. As shown in Figure 1, the model used for classification followed a fusion-based approach, integrating both image data and clinical metadata to enhance predictive accuracy. Its architecture comprised two main components: a CNN-based feature extractor for image data (27) and a fully connected multilayer perceptron (MLP) for clinical metadata (28). In the final stages of the model, these two components were merged to produce a classification output. For image feature extraction, the model employed DenseNet121, a well-established CNN pretrained on the ImageNet dataset (27). The pretrained classifier was replaced with an identity function, enabling the model to leverage the learned feature representations and adapt them for integration with clinical metadata. Features were extracted from the top five regions identified during preprocessing by passing the images through the DenseNet121 backbone (27). A max-pooling operation was applied across a batch of five images to fuse the extracted features, effectively summarizing the visual information from multiple image patches. In parallel, clinical metadata – including age, sex, hormone expression status, and Ki-67 LI – were processed using an MLP consisting of three fully connected layers (28). Each layer was followed by a rectified linear unit activation function and dropout for regularization (29). The outputs from the DenseNet feature extractor (27) and MLP (28) were concatenated into a single feature vector. This vector was passed through a classifier consisting of fully connected layers with a sigmoid activation function in the final layer to produce classification logits for binary classification tasks.
Illustration of the fusion-based model architecture for predicting pCR. Five selected histopathology images are individually processed through a DenseNet-based feature extractor, followed by a max-pooling operation to generate a representative image-level feature vector. In parallel, clinical metadata are input into a fully connected multilayer perceptron to extract structured data features. The outputs from the image and metadata branches are then fused and passed through a final classification layer to generate the pCR prediction. pCR: Pathological complete response.
Experimental setup. The experimental setup for predicting the pCR was designed to thoroughly evaluate the performance of the model. The dataset was divided into training and validation sets using an 8:2 stratified split to ensure a proportional distribution of pCR labels across both sets. Each set contained histopathological image data, corresponding clinical metadata, and their respective labels, with a batch size of 16. The model was trained using a supervised learning approach to optimize the classification accuracy with a cross-entropy loss function. The Adam optimizer was employed with a learning rate of 0.001, and a learning rate scheduler was used to dynamically adjust the learning rate based on the validation performance. The scheduler reduced the learning rate by 50% if no improvement was observed over two consecutive epochs, thereby ensuring continuous learning. The training loop was executed for 50 epochs, with each epoch including both forward and backward passes to update the weights of the model. After each epoch, the model was evaluated on the validation set using metrics, such as the validation loss, accuracy, and F1-score, to assess its performance. The model with the highest F1-score was chosen as the best-performing one.
Results
Overall performance. Our model achieved an accuracy of 0.9231, which is comparable to the performance reported in prior studies employing human annotation, with accuracies of 0.94 (11) and 0.89 (12), respectively. As illustrated in Figure 2A, the model attained an area under the curve (AUC) of 0.8308, further underscoring its predictive capability. For a more comprehensive evaluation, the detailed classification metrics are summarized in Table I, and the confusion matrix is presented in Figure 2B. The model exhibited balanced performance across key indicators, including sensitivity (0.8333), specificity (0.9500), and F1-score (0.8333), demonstrating reliable classification performance. The high negative predictive value (0.9500) further highlights the model’s effectiveness in identifying true negatives. The confusion matrix showed a strong distribution of true positives and true negatives, reinforcing the model’s robustness in both sensitivity and specificity. These findings suggest that the model can reliably predict response to NAC without the need for expert pathologist input. Its consistent performance supports potential integration into clinical workflows as an automated tool for predicting treatment efficacy. Furthermore, its ability to generalize across diverse datasets underscores its practical utility. Figure 3 shows visual results from four representative cases, incorporating clinical metadata, WSIs, and five cropped image patches were analyzed to generate pCR predictions. The model correctly predicted pCR status in cases 1 and 2, while misclassifications occurred in cases 3 and 4.
ROC curve and confusion matrix of the proposed model. (A) The ROC curve depicts the trade-off between the true positive rate (sensitivity) and false positive rate (1-specificity) across various classification thresholds. The model achieved an AUC of 0.8308, demonstrating strong discriminatory performance in distinguishing pCR from non-pCR cases. The diagonal dashed line represents the performance of a random classifier with an AUC of 0.5 and serves as a baseline reference. (B) Confusion matrix summarizes the number of true positives, true negatives, false positives, and false negatives based on the model’s predictions. A high number of correct predictions in both positive and negative classes demonstrates the model’s balanced performance. The top-left and bottom-right cells indicate correctly classified samples, while the off-diagonal cells represent misclassifications. AUC: Area under the curve; pCR: pathological complete response; ROC: receiver operating characteristic.
Performance metrics of our proposed model.
Summary of clinical metadata, image data, and prediction outcomes for four representative cases. Cases 1 and 2 show concordant results between the model prediction and the actual pCR status, whereas cases 3 and 4 demonstrate discordant outcomes. ER: Estrogen receptor; HER2: human epidermal growth factor receptor 2; IHC: immunohistochemical staining; pCR: Pathological complete response; PR: progesterone receptor; SISH: silver in situ hybridization; WSI, whole-slide image.
Ablation study. The results of the ablation experiments are summarized in Table II. The full model achieved the highest performance, with an accuracy of 0.9231 and an F1-score of 0.8333. Excluding clinical metadata led to a marked reduction in performance, with accuracy decreasing to 0.8077 and F1-score to 0.5455. Similarly, omitting the five cropped image patches resulted in substantial performance degradation, with the F1-score dropping to 0.0000 and accuracy to 0.7308. These findings highlight the critical role of incorporating both clinical metadata and multiple image inputs to maintain predictive performance and robustness.
Performance results of our proposed model in various experimental settings.
Effect of cropped images. We analyzed the impact of varying the number of cropped image inputs on model performance. As shown in Table II, increasing the number of patches significantly influenced accuracy and F1-score. Optimal performance was achieved with five patches, yielding an accuracy of 0.9231 and an F1-score of 0.8333, reflecting a balanced trade-off between precision and recall. Fewer than five patches or more than five resulted in decreased performance. These findings underscore the critical importance of optimizing the number of input images, specifically by using five images, to maximize predictive accuracy and maintain robust model sensitivity. We also evaluated the effect of image resolution on model performance. Images at a resolution of 1,000×1,000 pixels produced the best results (accuracy: 0.9231; F1-score: 0.8333). In contrast, both lower (500×500 pixels) and higher (2,000×2,000 pixels) resolutions resulted in performance declines, with the highest resolution yielding the poorest outcomes (accuracy: 0.6923; F1-score: 0.5000). These findings suggest that intermediate-resolution images offer the most effective balance between visual detail and computational efficiency for this predictive task.
Discussion
This study is unique in that we employed an AI-based approach to automate the entire predictive pipeline without requiring manual, expert-driven annotation of carcinoma cells – a process that is both labor-intensive and a potential source of inter-observer variability. This strategy holds clinical significance, as it improves predictive efficiency in both time and labor while maintaining high accuracy. Our model was designed to generate more accurate predictions by integrating both histological patterns and patient-specific clinical metadata. This multimodal fusion approach offered enhanced robustness compared to models relying solely on image or clinical inputs. By leveraging a fusion-based architecture that integrates features from both histopathological images and clinical information, our model achieved a high accuracy of 92.3%, comparable to prior studies (11, 12) that relied on manually annotated regions. By identifying and extracting five high-cellularity regions from each WSI, we eliminated the need for manual region-of-interest selection, significantly reducing human intervention and resource demands while preserving predictive performance. These results underscore the potential of an annotation-free approach to streamline the predictive workflow and enhance its practical utility in clinical settings.
Despite the overall strong performance of our model, we encountered challenges in two cases. In Case 3, the pretreatment CNB specimen contained a substantial amount of fibrous stroma (Figure 3), with two of the five cropped images comprising exclusively stromal components and no viable epithelial elements. The absence of tumor cells hindered accurate prediction of the NAC response. Moreover, the remaining images lacked well-defined ductal structures due to severe crush artifacts, further complicating interpretation. In Case 4, the CNB specimen was markedly compressed, resulting in distorted tumor morphology and extensive crush artifacts. In this case, a high degree of lymphocytic infiltration – greater than that observed in other cases – may also have interfered with image analysis and data processing.
The findings of this study should be interpreted in light of the dataset used for its development and validation. The model was trained on a relatively small cohort of 130 patients with invasive ductal carcinoma from a single institution. This narrow scope suggests that its high performance may, in part, reflect the learning of features specific to this patient population and our institution’s standardized protocols for tissue processing and slide scanning. As such, its generalizability to other histological subtypes of breast carcinoma or to datasets from institutions with different patient demographics and laboratory workflows remains unproven. Rigorous external validation will therefore be essential before considering its direct application in other clinical settings.
To further improve predictive accuracy, we recommend developing automated pipelines that can selectively crop tumor-rich regions, especially in small biopsies with substantial non-neoplastic content. Future studies involving larger and more diverse patient cohorts – including various histological subtypes and morphologies – will be instrumental in refining the model and enhancing its generalizability.
Conclusion
In summary, this study demonstrates the utility of deep neural networks in predicting pCR to NAC in patients with breast carcinoma. Our model achieved over 90% accuracy by integrating five high-cellularity image patches with patient clinical metadata, without requiring expert annotation. This automated, annotation-free approach holds promise for application in clinical settings, where it may support preoperative prognostication and facilitate ongoing assessment of NAC efficacy during treatment.
Acknowledgements
This work was supported by the National Research Foundation of Korea grant funded by the Korean Government (MSIT) (2023R1A2C2006223), the Songcheon Medical Research Fund, Sungkyunkwan University, 2024, the KBSMC-SKKU Future Clinical Convergence Academic Research Program, Kangbuk Samsung Hospital & Sungkyunkwan University, 2025, and a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) funded by the Ministry of Health & Welfare, Republic of Korea (HI21C1137).
Footnotes
Authors’ Contributions
All Authors made substantial contributions to the conception and design of this work; the acquisition, analysis, and interpretation of data; drafting, review, and critical revision of the manuscript for important intellectual content; and approval of the final version to be published.
Conflicts of Interest
The Authors declare no conflicts of interest or financial ties in relation to this report.
Artificial Intelligence (AI) Disclosure
During the preparation of this manuscript, a large language model (ChatGPT, OpenAI) was used solely for language editing and stylistic improvements in select paragraphs. No sections involving the generation, analysis, or interpretation of research data were produced by generative AI. All scientific content was created and verified by the authors. Furthermore, no figures or visual data were generated or modified using generative AI or machine learning–based image enhancement tools.
- Received July 25, 2025.
- Revision received August 16, 2025.
- Accepted August 25, 2025.
- Copyright © 2025 The Author(s). Published by the International Institute of Anticancer Research.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.









