A Systematic Approach of Data Collection and Analysis in Medical Imaging Research

Background: Obtaining the right image dataset for the medical image research systematically is a tedious task. Anatomy segmentation is the key step before extracting the radiomic features from these images. Objective: The purpose of the study was to segment the 3D colon from CT images and to measure the smaller polyps using image processing techniques. This require huge number of samples for statistical analysis. Our objective was to systematically classify and arrange the dataset based on the parameters of interest so that the empirical testing becomes easier in medical image research. Materials and Methods: This paper discusses a systematic approach of data collection and analysis before using it for empirical testing. In this research the image were considered from National Cancer Institute (NCI). TCIA from NCI has a vast collection of diagnostic quality images for the research community. These datasets were classified before empirical testing of the research objectives. The images in the TCIA collection were acquired as per the standard protocol defined by the American College of Radiology. Patients in the age group of 50-80 years were involved in various clinical trials (multicenter). The dataset collection has more than 10 billion of DICOM images of various anatomies. In this study, the number of samples considered for empirical testing was 300 (n) acquired from both supine and prone positions. The datasets were classified based on the parameters of interest. The classified dataset makes the dataset selection easier during empirical testing. The images were validated for the data completeness as per the DICOM standard of the 2020b version. A case study of CT Colonography dataset is discussed. Conclusion: With this systematic approach of data collection and classification, analysis will be become more easier during empirical testing.


Introduction
, volume thresholding and gradient-based edge detection methods (Cai et al., 2013) and Principal curvature (Lee et al., 2011). The effective colon segmentation at the mucous membrane still needs improved methods, because the base of the soft tissue structure is the key information to start measuring the polyp's height and width (Lefere and Gryspeerdt, 2011). The dataset created in colon cancer screening has been archived by NCI (Smith et al., 2016;Johnson et al., 2008;Clark et al., 2013). There are other databases like BMIAXNAT (XNAT, 2020) and CERN's Zenodo (CERN, 2020) repository.
The research objectives are colon segmentation, electronic cleansing of the tagged fecal matter, and measurement of the smaller polyps. Our objective was to systematically classify and arrange the dataset based on the parameters of interest so that the empirical testing becomes easier in medical image research. The required images were downloaded from the National Cancer Institute website's TCIA CT Colonography collection (NCI, 2020). This source is a vast collection with more than eight hundred patients scanned in a mass colon cancer screening (with ACRIN 6664 protocol), and the ground truths are also available. The organization of this paper has two different sections. The first section discusses the details about the CTC datasets, the dataset selection methods, the curation of data, and validation against DICOM standards and the second section on the colon segmentation on different abdomen CT cases, along with the results.

A. Acquisition and Validation Methods
CT Colonography data collection is made available from the Walter Reed Army Medical Center (WRAMC, Bethesda) in collaboration with NCI and NIH (courtesy: Dr. Richard Choi). It is a multicenter, clinical trial, anonymized images which were part of colon cancer screening done at WRAMC, and Naval Medical Center, San Diego, USA. The American College of Radiology and Imaging in Network (ACRIN) and the American College of Radiology (ACR) have jointly defined the protocol ACRIN 6664 (ACR, 2020;Johnson, 2016) for performing the CTC procedure. The protocol details are available in the article PMC2654614. The study included both symptomatic and asymptomatic male and female patients in the age group of 50 -80 years. The study excludes patients with symptoms of the disease of the lower gastrointestinal tract and anemia, inflammatory bowel disease, familial polyposis syndrome, and prior colonoscopy in the previous five years cases. Scanning involved administering the patient with positive oral contrast for fecal tagging with Barium with medium-dose and full dose colon preparation, breath-hold technique to avoid the bowel peristalsis, and insufflating the air for colon distention. Then, the abdomen was CT scanned from the diaphragm to the pelvic region 12. Table 1 shows a summary of the image acquisition parameters from all the downloaded dataset.

B. Data Format and Usage Notes
The diagnostic quality CTC images required for the research are downloaded from National Cancer Institute (NCI, 2020). All images in the dataset are in DICOM format with .ima and .dcm as the file extension. From the website, the user has to select the patient id from the available list, then a manifest file with .tcia extension gets downloaded. This file opens in the NBIA data retriever tool, which is a java based application. The organization of the images in the dataset follows the sequence Patient->Study->Series-CTImage. The metadata sheets are available, which contains the abstracts of radiology and optical colonoscopy reports describing the polyp occurrence (various sizes) in various segments of the colon. The confidence level of the radiologists who evaluated the polyp during colon cancer screening was least-certain, intermediate, and most certain.

C. Data selection
The ideal number of samples is essential in empirical testing. Statisticians calculated (n=150) the required number of samples for the research objectives. Systemic bias and sampling error are usually reported problems in inappropriate sample design. This bias is a problem while working with retrospect data. Even though there was option to select the dataset with only polyps, the bias is avoided by selecting cases without the polyp also. The sampling error is kept relatively small by selecting more number of samples (n=180). Datasets are carefully selected based on the diagnostic quality of the image, optimal colonic distention, ST, kVp, and pixel size, etc. There are more than 10,000 subjects (population) of different anatomical sites available. Further, the search was focused only on the CTC dataset, which resulted in the sample unit. The selection of datasets includes stratified sampling, in which the entire sample unit is divided into two homogeneous groups. Strata1 comprises of patients with polyps and colon cancer, and strata2 without these two. From the population, six hundred samples (N=600) were collected, out of which 540 patients were with polyps and 60 without polyps. The required sample size was 150. with N=600, we got N1=540, N2=60 (after Eq. 1). Thus the required sample sizes for strata1 and strata2 are 135 and 15, respectively, which is proportional to the size of the strata viz. 600:540.

D. Data collection
Datasets were collected for nearly five months through F. Data Validation DICOM image validation is a prerequisite step in any medical image processing research before using the dataset for empirical testing to check the completeness and uniqueness of the header details. Even though the CTC images from NCI have passed the data completeness verification (Smith et al., 2016), to safeguard from using incomplete data according to the latest standard (NEMA 2020, Philips, 2018Philips, 2013;Siemens, 2012), a DICOM validation framework is implemented ( Figure  4). Dataset is validated for type 1 and type 2 attributes. Type 3 validations were performed only for few tags as their values are not significant. Upon selection of the CT image series, the files are opened and read for the DICOM data elements using parallel processing. The slice location tag has its value stored in two different tags. Generally, if the value is not available in the tag (0020, 1041), then it has to be considered from the z component of the Image position tag (0020, 0032). After reading all images, they are sorted in the ascending order of slice location (in z direction) to check the missing slices. Then, across CT images, specific modules (Table 5)  Some of the manufacturer's specific private tags (Philips, 2018;Philips, 2013;Siemens, 2012) are also considered as defined in the DICOM conformance statement, which was part of the DICOM standard 2015 prior release. By considering the private tags, the backward compatibility of different versions is achieved. Old datasets might not work if the tags are ignored. Clinical trial and contrast bolus modules are not considered for validation as they were removed when data was anonymized. Seven datasets were failed during DICOM validation, as type 2 attributes were empty without any values.

Results
In the tabulated data ( After dataset validation as per the DICOM standard, the 3D volume is reconstructed, segmented the VOI (colon), and measured the smaller polyps. Exploratory research in polyp analysis is the potential application of this dataset and also for the clinical validation. Datasets have pixel sizes in both x and y axis in the range of 0.546875-0.9765625 mm and in the z axis in the range the questionnaire method. The key contact at NCI (help @ cancerimagingarchive.net) clarified the doubts over the email. Search criteria (Table 2) include CT as an imaging modality, scans with and without oral contrast administration, ST of 1-3mm, and availability of dataset during 2002 -2019. By looking into the CTC protocol followed (Johnson, 2016;Cash, 2010) and the data completeness, the images are carefully selected. Three hundred datasets are downloaded based on the calculated sample size and the source of the polyp as colon (by discarding the source as the rectum). As a first step, the authenticity (the reliability -who, how, and when data was collected, suitability -less noise and good tissue contrast, and adequacy -completeness and compatibility of DICOM images) of the data is checked.
E. Data analysis and processing E.1. Data analysis CTC samples are manually checked for the diagnostic quality (possible artifacts, as mentioned in Table 3). As the diagnostic quality is not up to the mark, eight out of 187 cases are discarded. It is difficult to process such images. With this, the samples are reduced to 179. Fifteen datasets are rejected due to metal artifacts (Figure 2c), motion artifact, and quantum noise ( Figure 2a). With this, the dataset count reduced to 164 (Figure 3). Editing any of the DICOM files either for the header details or the pixel details are not encountered. Also, it is unethical to modify the dataset. The images are contrast corrected for underexposed and overexposed regions (these regions resulted during CT image acquisition) without losing the soft tissue structure details on CT images. Gamma correction is applied to convert the stored pixel values in DICOM to the native display system. Without this, the same image looks different in different display systems (Kagadis et al., 2013) which may lead to wrong interpretation. A prototype software has been developed that has the basic features of medical image processing applications (Manjunath et al., 2017). In addition to testing the images in the prototype software, the images were checked syngoFastView from SIEMENS (Siemens, 2019), DicomViewer from Philips (Philips, 2020) and MITK software from dkfz (GCRF, 2020), Germany for the viewing the images of the patient. These software provides basic features like windowing, MPR visualization, and surface rendering techniques.

E.2. Classification
After finalizing the total number of datasets, based on essential parameters, a homogeneous group of datasets is created, which is called classification based on attributes.
To test the developed image processing methods during empirical testing, it becomes easy to select the samples based on the parameters of interest. An excel sheet of datasets and the image acquisition parameters are created to refer to the samples (Table 4) quickly. For example, to pick the dataset which is acquired at specific kVp value, directly, kVp column is selected in this excel sheet. This filters the datasets acquired at specific kVp. This approach of selecting the required parameters of interest in the excel sheet reduces the time for searching the entire database.  (Figure 5d). Few studies (Song et al., 2014;Summers, 2010)

Discussion
Segmenting the VOI from the 3D volumetric data is an important step before polyp analysis. A new boundarybased semi-automatic colon segmentation (Figure 5e-h) method was developed, which works on the knowledge of colon distension grading (Manjunath et al., 2016). Figure  5 shows the results of segmentation. Figure 5e, and Figure  5f shows the colon distribution on DRR (artificial X-Ray) image before and after segmentation, respectively. The results and unsegmented volume are compared through DRR images. Figure 5g-5h illustrates the surface rendered (with marching cube algorithm (Bourke, 2013)) and direct volume rendered (with Microsoft Volume Rendering Framework (Melancon et al., 2016)) images. Figure 5i-5l shows the endoluminal view of the colon interior, and a few cases are a smaller polyp (Figure 5i), a pedunculated polyp (Fig. 5j), floating fecal matter ( Figure 5k) and a polyp on the haustral fold (Figure 5l).
The implementation of the work includes Microsoft. NET Framework 4.7.2 and C# programming language with object-oriented design and multithread programming for parallel processing. The system workstation configuration is Intel Xeon ® CPU E52620 2.0GHz, 64GB DDR3 RAM, NVidia 4GB GPU, Microsoft Visual Studio

Limitations of the study
Despite the vast dataset, the TCIA collection has limited samples of the CTC images acquired at different levels of kVp and with least slice thickness such as 0.625mm, 0.5 mm etc.. There were no images in the collection apart from 120kVP and 100kVp. Empirical testing of virtual colon cleansing required the images acquired with different kVp values. There are many datasets where the patient's body lies outside the scan field of view. It is a time-consuming task to process such images. Other image database from different University hospitals and government supported research centers are available freely for the research community. Few of these are shown in Table 6. NCI dataset is a source of inspiration for any researcher working in medical image processing. With this dataset, automated methods have been developed for DICOM data validation, colon segmentation, Electronic Cleansing, and smaller polyp measurement. As the dataset collection is too vast, the researcher should be careful in sample design and collection on which the statistical analysis of the results completely depends. Therefore it is essential to classify the dataset based on the attributes of interest and to prepare an index sheet that simplifies the empirical testing based on the parameters of interest. This approach even helps in continuing with machine learning of medical big data images. Further, the scope of the work is on other anatomical sites and other cancer types to develop decision-making systems and also on the brain tumor quantification using MRI dataset from TCIA-Glioblastoma collection. This study successfully researched the TCIA CT Colonography collection. It is good if the datasets with the least slice thickness images are also available.