Breast Cancer Diagnostic Efficacy in a Developing South-East Asian Country

Background: Breast cancer, is increasing in prevalence amongst South East (SE) Asian women, highlighting the need for high quality, early diagnoses. This study investigated radiologists’ detection efficacy in a developing (DC) and developed (DDC) SE Asian country, as compared to Australian radiologists. Methods: Using a test-set of 60 mammographic cases, 20 containing cancer, JAFROC figures of merit (FOM) and ROC area under the curves (AUC) were calculated as well as location sensitivity, sensitivity and specificity. The test set was examined by 35, 15, and 53 radiologists from DC, a DDC and Australia, respectively. Results: DC radiologists, compared to both groups of counterparts, demonstrated significantly lower JAFROC FOM, ROC AUC and specificity scores. DC radiologists had a significantly lower location sensitivity than Australian radiologists. DC radiologists also demonstrated significantly lower values for age, hours of reading per week, and years of mammography experience when compared with other radiologists. Conclusion: Significant differences in breast cancer detection parameters can be attributed to the experience of DC radiologists. The development of inexpensive, innovative, interactive training programs are discussed. This non-uniform level of breast cancer detection between countries must be addressed to achieve the World Health Organisation goal of health equity.


Introduction
. Given that mortality and morbidity is highly reliant on early detection of small lesions, investigations to ensure optimum diagnostic efficacy are required.
Cancer detection depends on an individual reader's interpretation, with perceptual errors accounting for 60% of all diagnostic errors in radiology. The minimisation of perceptual errors is usually dependent on the radiologist's characteristics, such as specialisation and experience (Rawashdeh et al., 2013). It is also well-established that the effect of incorrect diagnosis of normal images (false positives) can result in significant emotional trauma, even when normality is subsequently shown (Rawashdeh et al., 2013). The use of test-set mammograms designed to test diagnostic efficacy has previously been instrumental in providing a deeper knowledge of the factors that influence the performance of radiologists world-wide (Rawashdeh et al., 2013;Mello-Thoms et al., 2014;Suleiman et al., 2014;Soh et al., 2016). However, despite the increase in breast cancer prevalence and the aggressive nature of these cancers in younger women (Trieu et al., 2015), The aim of this study is therefore to develop an understanding of the efficacy of DC radiologists involved in breast cancer diagnosis by using a test set methodology. Performance of DC radiologists will be compared with that of Australian readers as well as those reporting in a DDC.

Materials and Methods
A test set containing 60 mammographic examinations, 40 of which were normal and 20 demonstrating cancers was used in this study. Cancer cases were verified using biopsy and normal cases were identified after a two year follow up. Each examination consisted of a two-view mammogram (cranio-caudal (CC) and medio-lateral oblique projections (MLO)) of both breasts. All Breastscreen Reader Assessment Strategy (BREAST) test images were acquired using digital mammography and were de-identified of all health record data. Cases with visible post-biopsy markers or surgical scars were excluded from the study.
The same test set was examined by 35, 15 and 53 radiologists from a DC, a DDC and Australia respectively. Each radiologist, without being informed of the prevalence of disease within the case set, read all images in his/her own native country.
Reading conditions were standardised through the following measures: reporting rooms had ambient lighting of no greater than 20 lux; mammograms were displayed using two high fidelity workstations, each driving two 5MP reporting monitors calibrated to the Greyscale Standard Display Function (DICOM GSDF). Online software, developed by University of Sydney, was used to present the test set images at full native resolution. Demographic information was collected from each participant at the start of each reading session via a questionnaire, which included information of reader's age, years of experience reading mammograms, number of mammograms read per week and number of hours of mammographic reading per week. This information did not include any participant identifiers.
Radiologists were asked to localise all detected lesions and give each marked location a score of 1-5 indicating their level of confidence that a lesion was present: a rating of 1 indicated complete confidence that the case was normal and 5 complete confidence that a cancer was present. Post-processing tools and unlimited time were provided to each radiologist. All performance data was anonymised and no link between the performance data and individual readers was made.
Institutional and ethics approvals were granted for the study and informed consent was obtained from each reader. The need for obtaining informed consent from patients whose mammograms used was waived by the New South Wales Cancer Institute.
Data gained from each individual radiologist was used to calculate jackknife free response operating characteristic figure-of-merit (JAFROC FOM), receiver operating characteristic curve area under the curve (ROC AUC), sensitivity, location sensitivity and specificity.
The ROC AUC is based upon whether a mammogram does or does not have cancer. ROC curves plot false positives against true positives, thus the area under the curve represents the overall accuracy of a test with a perfect test of 1.0 indicating high sensitivity and specificity. A ROC AUC score of 0.5 represents zero discrimination (Lalkhen and McCluskey, 2008). However, ROC does not take into account lesion location and therefore JAFROC is used as a measure of radiologists performance in detecting lesion location (Chakraborty and Yoon, 2009). These performance values as well as demographic data were then compared across radiologists' groups using the non-parametric two-tailed Kruskall Wallis test followed by Dunn's Multiple Comparisons test to compare pairs of data (e.g. DC vs Australian data). GraphPad© PRISM software was used for all statistical comparisons and a P-value of <0.05 was considered to be significant.

Results
Results are shown in Tables 1 and 2, whilst the significant findings are summarised below.

Performance Metrics
As shown in Table 1, the JAFROC analysis demonstrated significant lower scores for DC radiologists compared with their counterparts in both the DDC and Australia (P< 0.0001). The DC radiologists' ROC scores were significantly lower than the Australian (P= 0.0003) and DDC (P= 0.01) radiologists.
Whilst no significant differences were seen for the sensitivity values, the location sensitivity analysis yielded several statistically significant results: Australia demonstrated higher scores than both the DC (P< 0.0001) and DDC (P= 0.01) SE Asian country scores. Radiologists' scores in the DDC was higher than that of their DC harmful implications for subsequent biopsy location and outcome actions for patients in developing countries. It may be tempting to highlight that in this study we showed no difference between the radiologist groupings for case-based sensitivity and that this metric is a key feature of any screening program, this result must be considered in light of the specificity results. Specificity for the DC radiologists was lower than the other two groupings (although only significant when compared to the DDC radiologists), which would imply that DC radiologists are recalling more women than DDC radiologists. This means they are sliding up their ROC curve, thus potentially inflating the case-based sensitivity figures, at the expense of accurately recognising the normal cases. The unintended harmful effects of unnecessary recalls include long term impacts on the patient's psychological well-being when compared with those who weren't recalled (Brodersen and Siersma, 2013). Differences in the location sensitivity and specificity of the DC radiologists and their counterparts and subsequent effects on JAFROC and ROC values can be attributed to their experience in reading mammograms. Our study shows that the DC radiologists were significantly younger than their counterparts in both the other countries in addition to having fewer years of experience reading mammograms. These results confirm the findings of other authors who have linked experience with performance: the most experienced radiologists had better location sensitivity scores than the least experienced radiologists (Rawashdeh et al., 2013;Suleiman et al., 2014). Experience is a determining factor in the risk of false positives, with younger and less experience radiologists more likely to have higher recall rates (Elmore et al., 1998;Reed et al., 2010;Alberdi et al., 2011;Hawley et al., 2016) with direct relationships between reader volume (Rawashdeh et al., 2013;Reed et al., 2010) and radiologist performance. Clearly therefore making sure that the experience of radiologists specifically in breast reading is critically important. However, this cannot be achieved simply by allowing lots of young radiologists' freedom to continually report on many different types of images. Some level of specialism is required to develop and fine tune the reader skills required to achieve good levels of diagnostic efficacy for this radiologic domain. The significant differences identified in this experiment also highlight the need for effective training and reading strategies to minimise the variance in radiologic diagnosis between DC radiologists and their Australian and DDC counterparts. In the situation where readers have low levels of experience and less dedicated time devoted to reading mammograms, one approach would be the development of innovative, interactive training programs that imposes little expense and inconvenience. For example, the BREAST training programs (Suleiman et al., 2014) available to radiologists in several countries including Australia, New Zealand, Singapore and the Middle East would allow clinicians wherever they are located to login in a confidential way and in their own environment and test their ability to diagnose mammograms using several available test sets. Immediate feedback is available with details on performance levels as well as clear, localised counterparts (P= 0.0079).
With regards to specificity, readers from the DDC demonstrated significantly higher scores than those from the DC (P= 0.008). Table 2 shows that radiologists from the DC have lower values for age (P <0.0001), hours of reading per week (P<= 0.0004) and years of mammography experience (P <0.0001) when compared with their counterparts from the DDC and Australia. In addition, the readers from the DC read fewer mammographic cases per week (P <0.0001) than the Australian readers.

Discussion
This work examined the performance of radiologists based in a developing country when asked to diagnose breast cancer using mammographic images. As a baseline we compared their diagnostic efficacy against two countries with mature breast imaging training programs, one is a typical westernised country (Australia), and the other is a developed country located in SE Asia.
The DC radiologists displayed a significantly lower location sensitivity than both their DDC and Australian counterparts. Location sensitivity refers to the ability to accurately locate lesions, which has previously been linked to radiologists' ability to detect smaller more difficult lesions (Mello-Thoms et al., 2014). Importantly, this large difference in location sensitivity could have  information provided on reader-specific errors for each image diagnosed. As shown to be the case elsewhere (Suleiman et al., 2014), this approach should increase the efficacy of the DC radiologists in demonstrated areas of weakness and has been shown to be a promising method in training radiologists and residents both by the current authors and others (Suleiman et al., 2014;Poot and Chetlen, 2016). Another potential solution would be to have radiologists performing double readings with consensus decisions on cases, as is the case in most non-US westernised countries. This approach has shown to increase detection rates, whilst minimising recall rates (Anttinen et al., 1993), however the authors acknowledge that there are resource implications associated with this approach as well as a potential delay in diagnoses, so the feasibility of such an approach in a developing country would need to be fully evaluated. Alternatively, utilizing computer assisted diagnosis (CAD) as a first reading may be more feasible in DC's. However, the use of CAD even with experienced radiologists has shown to increase recall rates (Bargalló et al. 2014). Finally, establishing National Accreditation Standards around minimum levels of readings per year to be proficient at diagnosing breast cancer may also be a useful strategy. In countries such as Australia, it is clearly defined that breast reading radiologists should read a minimum number of 2,000 cases per year (Reed et al., 2010) with larger numbers being stated elsewhere. To achieve such numbers however, it may be necessary to reduce the numbers of individuals in DC countries reading mammograms so that minimum numbers of readings can be achieved per radiologist.
There are a number of limitations of this study that need to be acknowledged. It is noted for example that this study's results rely on a test set methodology to describe performance rather than clinical audit data, however work elsewhere has shown a strong agreement in performance across these two environments (Soh et al., 2013). Also, reader performance may be attributed to DC radiologists interpreting mammograms from unfamiliar populations, since all the images came from Australian clinics where for example the mammographic density may be lower than that seen typically in their own country. It is important therefore that future studies should develop test-sets based on populations that radiologists most commonly work in. Furthermore, for statistical robustness, the BREAST test-set has a much higher rate of cancer than typically found in clinical situations and such higher prevalence may affect performance, although the impact of such prevalence has previously been shown to be minimal (Reed et al., 2010).
In summary this work has shown that mammographic diagnostic efficacy in a DC may not be as high as levels demonstrated within developed countries. This discrepancy may be linked to experience levels and some corrective strategies have been suggested. The solution however requires a collaborative approach to embrace educational, professional and regulatory components.