Study the Effect of the Risk Factors in the Estimation of the Breast Cancer Risk Score Using Machine Learning

Khozama, Sam; Mayya, Ali Mahmoud

doi:10.31557/APJCP.2021.22.11.3543

Study the Effect of the Risk Factors in the Estimation of the Breast Cancer Risk Score Using Machine Learning

Document Type : Research Articles

Authors

Sam Khozama ¹
Ali Mahmoud Mayya ²

¹ Department of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary.

² Department of Computer Engineering, Tishreen University, Lattakia, Syria.

10.31557/APJCP.2021.22.11.3543

Abstract

Objective: Early prediction of breast cancer is one of the most essential fields of medicine. Many studies have introduced prediction approaches to facilitate the early prediction and estimate the future occurrence based on mammography periodic tests. In the current research, we introduce a novel machine learning tool for the early prediction of breast cancer. Methods: Three basic resources are used to identify the most essential risk factors; including the BCSC (Breast Cancer Surveillance Consortium) dataset, a medical questionnaire, and multiple international breast cancer reports. The BCSC dataset has been normalized and balanced; consequently, the questionnaire and the medical reports are analyzed in order to define the degree of importance and a potential weight factor of each risk factor. These weights are used to scale risk factors and then the optimizable tree-based ML model is trained using the balanced weighted risk factors datasets. Results: Three balanced versions of the BCSC dataset are used; oversampled, down-sampled and mixed datasets. Each risk factor has a weight (1, 2 or 4) assigned based on a mathematical modelling of the questionnaire and the international breast cancer reports. The experiments are applied on the weighted and non-weighted versions of the database, and they indicate that the performance increases significantly by using the weighted version of the risk factors. The tests prove that the down-weighting of the non-essential risk factor increases the accuracy and reduces errors. The overall accuracy of the weighted balanced datasets reaches 100%, 95.8% and 95.9% for down-sampled, oversampled and mixed datasets respectively. Conclusion: Weighting the risk factors of the BCSC dataset improves the performance by increasing the accuracy and reducing the false rejection and false discovery rates for all versions of balanced datasets. The weighting approach can also be used to improve the estimation score of breast cancer by scaling the individual scores of risk factors.

Keywords

Main Subjects