Two new feature selection methods based on learn-heuristic techniques for breast cancer prediction: A comprehensive analysis

Read original: arXiv:2407.14631 - Published 7/23/2024 by Kamyab Karimi, Ali Ghodratnama, Reza Tavakkoli-Moghaddam

✨

Overview

Breast cancer is not preventable due to unknown causes, but early diagnosis can improve patient recovery chances.
Machine learning (ML) can enhance healthcare treatment outcomes while reducing costs and time.
This research proposes two novel feature selection (FS) methods based on the Imperialist Competitive Algorithm (ICA) and the Bat Algorithm (BA), and combines them with ML algorithms.
The study aims to improve the efficiency of diagnostic models and provide comprehensive analysis to help clinicians make more precise and reliable decisions.

Plain English Explanation

Breast cancer is a serious health issue, but unfortunately, its exact causes are still unknown, so it cannot be easily prevented. However, the research shows that early diagnosis of breast cancer can significantly improve a patient's chances of recovery.

One promising approach to improving breast cancer diagnosis and treatment is the use of machine learning (ML) techniques. ML can help healthcare providers optimize treatment outcomes while also reducing the time and costs involved.

In this study, the researchers developed two new feature selection (FS) methods using the Imperialist Competitive Algorithm (ICA) and the Bat Algorithm (BA). These FS techniques are then combined with various ML algorithms, such as K-nearest neighbors, support vector machines, and random forests.

The goal of this research is to enhance the efficiency and accuracy of diagnostic models, ultimately helping clinical physicians make more precise and reliable decisions when it comes to breast cancer treatment.

Technical Explanation

The researchers in this study employed a variety of ML algorithms, including K-nearest neighbors, support vector machine, decision tree, Naive Bayes, AdaBoost, linear discriminant analysis, random forest, logistic regression, and artificial neural network. They applied a unique integration of evaluation measures and ML algorithms using the wrapper feature selection based on ICA (WFSIC) and BA (WFSB) separately.

The team compared the performance of the two proposed FS approaches and also compared their best diagnostic model with previous works reported in the literature. Experiments were conducted using the Wisconsin Diagnostic Breast Cancer dataset.

The results reveal that the proposed framework using the BA with an accuracy of 99.12% outperforms the framework using the ICA and most previous works. Additionally, the random forest (RF) classifier in the BA-based FS approach emerges as the best model, outperforming others in terms of various evaluation criteria.

The findings also demonstrate that the proposed FS techniques can reduce the dataset dimensions by up to 90% while increasing the performance of diagnostic models to over 99%. Furthermore, the results show that there are more critical features than the optimum dataset obtained by the proposed FS approaches, and these features have been selected by most ML models.

Critical Analysis

The research presents a comprehensive and innovative approach to improving breast cancer diagnosis using advanced ML techniques. The proposed FS methods based on ICA and BA, combined with various ML algorithms, demonstrate impressive performance improvements over previous works.

However, the study is limited to a single dataset (the Wisconsin Diagnostic Breast Cancer dataset), and further validation on additional datasets would be necessary to assess the generalizability of the findings. Additionally, the paper does not provide much detail on the specific feature sets selected by the FS approaches or the interpretability of the final diagnostic models, which could be an important consideration for clinical adoption.

Future research could explore the explainability of the ML models, as well as investigate the potential for integration with other diagnostic tools or biomarkers to further enhance the clinical decision-making process.

Conclusion

This research proposes innovative feature selection methods based on the ICA and BA algorithms, which, when combined with various ML algorithms, demonstrate significant improvements in the efficiency and accuracy of breast cancer diagnostic models. The results highlight the potential of ML-driven approaches to enhance healthcare decision-making and potentially improve patient outcomes, though further research is needed to address the limitations and explore additional clinical applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

Two new feature selection methods based on learn-heuristic techniques for breast cancer prediction: A comprehensive analysis

Kamyab Karimi, Ali Ghodratnama, Reza Tavakkoli-Moghaddam

Breast cancer is not preventable because of its unknown causes. However, its early diagnosis increases patients' recovery chances. Machine learning (ML) can be utilized to improve treatment outcomes in healthcare operations while diminishing costs and time. In this research, we suggest two novel feature selection (FS) methods based upon an imperialist competitive algorithm (ICA) and a bat algorithm (BA) and their combination with ML algorithms. This study aims to enhance diagnostic models' efficiency and present a comprehensive analysis to help clinical physicians make much more precise and reliable decisions than before. K-nearest neighbors, support vector machine, decision tree, Naive Bayes, AdaBoost, linear discriminant analysis, random forest, logistic regression, and artificial neural network are some of the methods employed. This paper applied a distinctive integration of evaluation measures and ML algorithms using the wrapper feature selection based on ICA (WFSIC) and BA (WFSB) separately. We compared two proposed approaches for the performance of the classifiers. Also, we compared our best diagnostic model with previous works reported in the literature survey. Experimentations were performed on the Wisconsin diagnostic breast cancer dataset. Results reveal that the proposed framework that uses the BA with an accuracy of 99.12%, surpasses the framework using the ICA and most previous works. Additionally, the RF classifier in the approach of FS based on BA emerges as the best model and outperforms others regarding its criteria. Besides, the results illustrate the role of our techniques in reducing the dataset dimensions up to 90% and increasing the performance of diagnostic models by over 99%. Moreover, the result demonstrates that there are more critical features than the optimum dataset obtained by proposed FS approaches that have been selected by most ML models.

7/23/2024

🏷️

Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI

Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia

Breast cancer has rapidly increased in prevalence in recent years, making it one of the leading causes of mortality worldwide. Among all cancers, it is by far the most common. Diagnosing this illness manually requires significant time and expertise. Since detecting breast cancer is a time-consuming process, preventing its further spread can be aided by creating machine-based forecasts. Machine learning and Explainable AI are crucial in classification as they not only provide accurate predictions but also offer insights into how the model arrives at its decisions, aiding in the understanding and trustworthiness of the classification results. In this study, we evaluate and compare the classification accuracy, precision, recall, and F-1 scores of five different machine learning methods using a primary dataset (500 patients from Dhaka Medical College Hospital). Five different supervised machine learning techniques, including decision tree, random forest, logistic regression, naive bayes, and XGBoost, have been used to achieve optimal results on our dataset. Additionally, this study applied SHAP analysis to the XGBoost model to interpret the model's predictions and understand the impact of each feature on the model's output. We compared the accuracy with which several algorithms classified the data, as well as contrasted with other literature in this field. After final evaluation, this study found that XGBoost achieved the best model accuracy, which is 97%.

4/9/2024

🏷️

Biomarker based Cancer Classification using an Ensemble with Pre-trained Models

Chongmin Lee, Jihie Kim

Certain cancer types, namely pancreatic cancer is difficult to detect at an early stage; sparking the importance of discovering the causal relationship between biomarkers and cancer to identify cancer efficiently. By allowing for the detection and monitoring of specific biomarkers through a non-invasive method, liquid biopsies enhance the precision and efficacy of medical interventions, advocating the move towards personalized healthcare. Several machine learning algorithms such as Random Forest, SVM are utilized for classification, yet causing inefficiency due to the need for conducting hyperparameter tuning. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results.

6/17/2024

Machine Learning Driven Biomarker Selection for Medical Diagnosis

Divyagna Bavikadi, Ayushi Agarwal, Shashank Ganta, Yunro Chung, Lusheng Song, Ji Qiu, Paulo Shakarian

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 4 different machine learning (ML) classifiers for identifying correlations, evaluating 16 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted.

5/20/2024