Study the Effect of the Risk Factors in the Estimation of the Breast Cancer Risk Score Using Machine Learning

Objective: Early prediction of breast cancer is one of the most essential fields of medicine. Many studies have introduced prediction approaches to facilitate the early prediction and estimate the future occurrence based on mammography periodic tests. In the current research, we introduce a novel machine learning tool for the early prediction of breast cancer. Methods: Three basic resources are used to identify the most essential risk factors; including the BCSC (Breast Cancer Surveillance Consortium) dataset, a medical questionnaire, and multiple international breast cancer reports. The BCSC dataset has been normalized and balanced; consequently, the questionnaire and the medical reports are analyzed in order to define the degree of importance and a potential weight factor of each risk factor. These weights are used to scale risk factors and then the optimizable tree-based ML model is trained using the balanced weighted risk factors datasets. Results: Three balanced versions of the BCSC dataset are used; oversampled, down-sampled and mixed datasets. Each risk factor has a weight (1, 2 or 4) assigned based on a mathematical modelling of the questionnaire and the international breast cancer reports. The experiments are applied on the weighted and non-weighted versions of the database, and they indicate that the performance increases significantly by using the weighted version of the risk factors. The tests prove that the down-weighting of the non-essential risk factor increases the accuracy and reduces errors. The overall accuracy of the weighted balanced datasets reaches 100%, 95.8% and 95.9% for down-sampled, oversampled and mixed datasets respectively. Conclusion: Weighting the risk factors of the BCSC dataset improves the performance by increasing the accuracy and reducing the false rejection and false discovery rates for all versions of balanced datasets. The weighting approach can also be used to improve the estimation score of breast cancer by scaling the individual scores of risk factors.


Introduction
therapy affect the exposure period of breast tissue to hormones that lead to cancer (American Cancer Society, 2019).
For breast cancer estimation research, many datasets, like the Breast Cancer Surveillance Consortium dataset (BCSC dataset, 2021), consisting of 280,660 records, had been used in many pieces of research (Rajendran et al., 2020;Shieh et al., 2016;BCSC dataset, 2021;Williams et al., 2016). Another international dataset is the Breast Cancer Information Management System (BCIMS) dataset, consisting of 16,000, cases (Peng et al., 2016) and was used by many studies such as Hou et al. (2020) and Zhong et al. (2020). Some other researchers collected their datasets from specialized medical centers or hospitals (Ming et al., 2020;Barlow et al., 2006). Shieh et al., (2016) proposed a breast cancer prediction model using the information of the clinical and polygenic risks. The Bayes estimation and conditional logistic regression models are used together to study the common effect of ordinary and polygenic risk factors on the future risk of breast cancer. The researchers used 486 cases of the BCSC dataset and found that the prediction accuracy increased from AUC=0.62 to AUC=0.65 after adding the polygenic risk to the model. They concluded that 18% of the cases were classified as high-risk cases in the common model, while it was only 7% for the ordinary risk factors model.
Li and Sundararajan (2018) applied several ML approaches for the prediction of breast cancer. They used only 10,000 cases and eight risk factors of the BCSC dataset. SVM and Bayes classifiers were used for the final risk estimation and got accuracy results of 96.6% and 91.26% for SVM and Bayes classification, respectively.
In 2020, Rajendran et al., (2020) used the supervised ML algorithms on imbalanced class data for the prediction of breast cancer on the BCSC dataset. In order to perform balancing, they used three approaches: Synthetic Minority Oversampling, under-sampling and fusion of both techniques. They also used Bayes classifier, Bayes networks, Random Forests (RF), and random trees as classifiers. The best accuracy they obtained was 99.1%, under False Positive (FP) equals 21%. The problem with this research was that they used only 10,252 instances after applying the balancing techniques; besides that, the results showed low sensitivity of 78.1%.
A new model for predicting breast cancer in Chinese women had been introduced by Hou et al., (2020). They used 7,127 cases of the Breast Cancer Information Management System (BCIMS) dataset and chose specific risk factors based on the fact that they must be known and collected by the same measurement techniques. Consequently, 10 risk factors had been chosen and different prediction models were used, like RF, deep neural networks DNN and XGBoost. They got an accuracy of 72.8 for both DNN and RF, while the XGBoost accuracy was 74.2%.
The Evaluation of many ML classifiers for the prediction of breast cancer under incomplete datasets was suggested by Teja et al., (2020). They evaluated the RF, Logistic Regression (LR) and custom Neural Network (NN). The Area Under Curve (AUC) was used for the performance evaluation on the BCSC dataset. AUC achieved 0.645, 0.634 and 0.649 for LR, RF and NN respectively. Ming et al., (2020) collected a breast cancer prediction dataset from Geneva University Hospitals. Their dataset included 112587 individuals and 14 variables. They applied different ML algorithms (like the Markov mixed model, adaptive boosting and RF) and obtained accuracy between 84.3% and 88.9%. However, the dataset variables related not only to breast cancer but also to other tissues, so that more risk factors needed to be included.
Most previous studies did not consider the nature of the used breast cancer dataset. Each dataset has some properties that must be understood in order to get a proper accurate estimation as mentioned by the BCSC dataset (2021). For example, the BCSC dataset needs to consider the "count" as a very important variable in order to achieve correct results. Besides that, the BCSC dataset is unbalanced, so it needs a balancing step before any estimation model. The research aim is to develop a new tool for predicting breast cancer based on BCSC risk factors. We have taken into account the "count" variable for good estimation. In addition to that, balancing has been applied as a pre-processing step. The last new option that have been done is the weighting mechanism, in which a weight number of each risk factor is assigned in order to enhance the performance. The following paragraphs will contain the used materials and the proposed approach in detail. Finally, the results and discussion section will be introduced.

Dataset
In the current research, the BCSC dataset is used. It includes ‫066و082‬ records and 12 risk factors, which are described in Table 1. Besides these risk factors, the dataset includes a variable called "count", which holds the frequency of each record within the dataset, as mentioned in the BCSC dataset (2021).

Proposed system
The proposed risk-estimation model of breast cancer is described in Figure 1 so that the BCSC dataset is obtained from http://www.bcsc-research.org/, and all risk factors are used. First, the dataset is normalized to ensure that all risk factors initially have the same effect on the final risk estimation. The normalization is done using Equation 1: Where M is the number of risk factors. The normalization step makes the value of each risk factor ranging from 0 to 1.
The second step is balancing, in which the dataset must be manipulated in order to achieve the balancing between target categories. The original BCSC dataset has two target categories (0: no cancer, 1: cancer). While the "1" category has only 3.32%, the "0" category has 96.68% of all samples, which is why the BCSC is extremely unbalanced. The minimal appearance of the minor category in the unbalanced datasets leads any classifier to generate inaccurate predictions due to the inappropriate training (Somasundaram and Reddy, 2016). So for the original BCSC dataset, any classifier can produce more than 96% accuracy only if it has recognized classifier, the tool is built based on the trained model. The tool is designed using MATLAB App designer. Figure 2 shows the designed tool. After entering the values of all factors except "count", the tool searches into the BCSC dataset to find the match between the entered risk values and all records of the dataset. If a match is found, the corresponding count is considered as the count of the test sample. Otherwise, the count will be 1.

The preprocessing of the risk estimation dataset
Preprocessing of the risk estimation dataset includes two basic steps, which are normalization and balancing. Table 2 includes the results of the suggested balancing methods, where the majority class label is 0(no cancer), and the minor class label is 1 (cancer risk).
For the oversampling approach, the "1" minor class has been duplicated five times until its percentage became 14.64%, while the majority-class percentage became 85.36%. For the down-sampling approach, we minimized the majority-class samples by a factor (3.524 times) until getting a 10.78% percentage for the minor class and 89.22% for the majority one. For the last approach, we duplicated the minor class samples and eliminated some of the majority class samples until getting 17.1% and 82.9% for the minor and majority classes, respectively.

Risk factor weighting results
In order to obtain the risk factors weights, the results of the questionnaire are analyzed and the questionnairebased degree of importance (DOI q i ) of each risk factor is defined based on Equation 3: Where H i and M i are the high-risk and medium-risk percentages of the risk factor i shown in Table 3.
Based on the analysis of the questionnaire, the following results are inferred: The risk factors with the largest high-risk levels are number of first degree relatives with breast cancer (nrelbc), hormone therapy.
Age, menopause, density and race are the risk factors with the largest medium-risk levels.
Hispanic, breast procedure (brstproc), and surgical menopause have the lowest risk levels.
Factors with a high DOI (more than 0.4) are nrelbc, age, hormone therapy. Some other risk factors like age at first birth, menopause, density, Body Mass Index (BMI), last mammogram before the index mammogram (lastmamm) and race have medium DOI (between 0.3 and 0.4) are Other risk factors with a low DOI (less than 0.3) are Hispanic, brstproc and surgical menopause.
The international medical reports, on the other hand, indicate other opinions. So, we concluded the information about risk factors, and then this information was compiled and classified according to the number of times the factors were mentioned in the list of the essential risk factors (Ess_Numi), then within the list of the secondary-risk all "0" class samples, even if all "1" class samples have been incorrectly estimated as "0". To solve this problem, the so-called balancing mechanism is applied. Three different balancing approaches were applied. The first is oversampling, in which the samples of the minor class are duplicated many times so that their percentage increases. Duplication will enhance the training significantly. The second approach is down-sampling, in which some of the majority-class samples are removed until decreasing its percentage to the required value. As for the last approach, duplicating the minor class samples and eliminating some of the majority class samples are performed until achieving the desired balance.
To perform the third step, the weighting algorithm is applied. To achieve a good, accurate weighting, two branches were taken. First, a questionnaire of the risk factors listed in the BCSC dataset was created. The aim of this questionnaire was to establish medical knowledge, so it was sent to 40 specialist physicians working in the field of cancer treatment and diagnosis. Afterward, the results of the questionnaire were analyzed to define the expert's opinion of the impact of each risk factor in the final score of cancer risk. The second branch of this study investigated the international medical reports from which the recent discussions of breast cancer risk factors were obtained to define the impact from another point of view.

Machine Learning Model selection
After getting the final impact (weight) of each risk factor, the final step is the selection of the ML model. Multiple ML prediction algorithms are available, but the optimization tree model is chosen due to its ability to tune the hyperparameters, deal with missing or noisy data, and handle redundant attributes values (Apté and Weiss, 1997;Mantovani et al., 2018). The decision tree algorithm first considers all samples of the dataset as the root node. The basic challenges are selecting the best attribute to be the root node and then deciding to split the node into all attributes and select the one with the best split performance. Decision trees actually compute the Information Gain (IG) as illustrated in Equation 2 (Kelleher, 2020) across all possible attributes and then choose the attribute with the lowest IG. This means that the selected attribute is the one that separates the training samples the best. (2) Where H(T) is the entropy of the parent node of the tree T, H(T|a) is the entropy of the child node a (attribute a), k is the number of subsets generated by each split, pi is the percentage (probability) of class i in the node T, pr(i|a) is the percentage of class i given that the split child (attribute) is a. Now, for the optimizable tree classifier, three different parameters are tuned. These parameters are the criterion (the attribute selection measure), the splitter (the split strategy) and the maximum depth of a tree.

Prediction Tool Design
After getting the final optimizable decision tree factors (Sec_Numi), and the risk degree DOI R i was calculated according to Equation 4: Where n is the number of medical studies that have been analyzed and the denominator (4) is the maximum DOI. Based on the analysis of the previous studies and previous breast cancer medical reports, like Breast Cancer (2019), we supposed that 90% of the essential risk factor effect and 10% percent of the secondary risk factors will be summed to constitute the final DOIR value.

Facts & Figures (2019), Cancer Facts & Figures (2020), Breast cancer risk factors (2009) and Breast Cancer Risk and Prevention
The final DOI (DOI F i ) is inferred from the medical questionnaire-based degree of importance (DOI Q i of Table  3) and the international medical reports-based degree of importance (DOI R i of Table 4) as Equation 5 suggests, while the suggested training weight (STW) in Table 5  ). The most significant risk factors, as Table 5 describes, are Age group, nrelbc and race while the medium significance risk factors are: Hormone therapy, agefirst, density, Menopause and BM. However, the least essential risk factors are Hispanic, brstproc, lastmamm and surgical menopause.
The effect of weighting the risk factors against the non-weighted version of the dataset is shown in Table  6. The results indicate that the performance increases by 6.9% after the weighting approach; similarly, the False Discovery Rate (FDR) is minimized by 22.6% and 3.2% for the minor and majority class respectively. The False Negative Rate (FNR), as well, is minimized for the majority and minor class by 5% and 17.6% respectively. FDR and FNR indicate the percentage of false positives and false negatives respectively (Pawitan et al., 2005).

Discussion
To check the results shown in Table 6, many test scenarios are suggested by removing one or more essential/non-essential risk factors; so that the optimizable tree-based classifier accuracy, as well as the classification errors, are computed to check the validity of each scenario. Table 7 illustrates that the weighted version of the dataset has better performance than the non-weighted one.     Weighting the risk factors has increased the performance by 6.9%. The risk factors differ in their degree of importance (i.e., their effect in defining the final risk degree). Table 7 shows that the most effective risk factor is the "Race" factor as the accuracy decreased by 4.3% after removing this factor. Other risk factors like age at first birth (agefirst), age group, Nrelbc, BMI and Hispanic affect the performance significantly after removing them from the dataset. By removing one of the risk factors (race, age group, agefirst, BMI and Hispanic), an increment in the minor FNR rate is noticed. By removing couples of risk factors like (age and race) or (Nrelbc, age and race) the performance degrades significantly by 6.2 to 7.8% and the minor class FNR error increases as well by 23% to 40%, which is a very huge error rate (i.e. these factors are essential). However, some factors like menopause, surgical menopause (surgmeno) and hormone-therapy    From another point of view, missing the three risk factors (menopause, brstproc and surgmeno) decreases the accuracy only by 3.4%. So these factors have less impact than others on defining the last risk degree, and in order to validate this conclusion, a down-weight approach was applied in which each weak-impact risk factor is weighted by a less-than-1 factor (0.2, 0.3, 0.5, etc.) and the results are listed in Table 8. Scaling menopause, for example, by 0.5 improves the validation accuracy by 0.1%. Scaling the other low-important risk factors also improves the accuracy by 0.1% and reduces the FNR error by 0.2%. However, in some cases; it increases the FNR of the minor class (and this is because the minor class percentage is small), but at the same time the FDR rate has been decreased by (0.5-0.9%).
The same scaling technique used on the oversampled dataset; is applied to the down-sampled and the mixed ones, Figure 3 includes a detailed comparison of the performance of scaling choice (age=4, race=3, agefirst=2, nrelbc=3, current hormone therapy (current_ hor)=3, menopause=0.5, density=0.3, brstproc=0.2, lastmamm=0.3, surgmeno=0.2) over the three balanced datasets. Figure 3 shows that the down-sampled dataset has the highest accuracy (100%) and the least error rates (0%); however, this down-sampled dataset has a volume of 27.15% only compared with the over-sampled version. So although the down-sampled dataset has the best accuracy, the over-sampled and the mixed versions have better performance since they consist of a much larger number of samples so that the new test samples will be classified more correctly.
In this research, the effect of weighting and selection of the risk factors has been studied. In addition, three versions of the balanced dataset were tested. The experiments proved that the weighting technique improved the accuracy and reduced the errors significantly. In future work, the weighting model will be used to generate a fuzzy risk factor score in the range (0-100) instead of a scalar risk score.

Author Contribution Statement
Khozama S. proposed the idea and design of the work. Mayya A. collaborated in editing the layout of the paper and coding some parts of the software. Both authors contributed in formulating the mathematical equations, writing the manuscript, and testing the software.