Hybrid Modelling of Pulmonary Cancer Risk Prediction Using Classical Algorithms to Modern Machine Learning Techniques

Document Type : Research Articles

Authors

1 Symbiosis College of Nursing (SCON), Symbiosis International (Deemed University), Maharashtra, India.

2 B. Tech. in Artificial Intelligence and Machine Learning; Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, Maharashtra, India.

3 Tutor, Symbiosis College of Nursing (SCON), Symbiosis International (Deemed University), Maharashtra, India.

4 Department of Child Health Nursing, Bharati Vidyapeeth (Deemed to be University), Pune.

Abstract

Background: Despite significant advancements in oncology, early diagnosis of pulmonary cancer poses a clinical challenge, thus making it a leading cause of cancer-related mortality and a focal point for the development of data-driven prediction models. The objective of the study was to predict pulmonary cancer using hybrid machine learning models. Methods: This study presents a comprehensive review of machine learning (ML) algorithms to facilitate early prediction of pulmonary carcinoma using electronic medical records (EMRs) data. The dataset comprising 1000 patient records and 25 predictor variables, was subjected to rigorous pre-processing, including label correction, multicollinearity assessment, and dimensionality reduction. Eighteen statistically significant features, encompassing symptoms, lifestyle factors, and environmental exposures were identified through variance inflation factor (VIF) analysis and chi-square testing. Multiple ML models, including Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR), and Deep Learning (DL) classifiers, were trained and evaluated using precision, recall, F1 score, specificity, and AUC metrics. Results: The chi-square test revealed that age (χ²=44.187, p<0.001), passive smoking (χ²=752.960, p<0.001), obesity (χ²=712.088, p<0.001), smoking (χ²=671.006, p<0.001), and symptoms like coughing blood (χ²=818.669, p<0.001) were significantly associated with pulmonary Carcinoma. The performance metrics indicate that most basic and ensemble models, including DT, SVM, LR, KNN, AdaBoost, and RF, achieved perfect scores (accuracy, precision, recall, F1, AUC = 1.000), demonstrating optimal classification. DL and SVM Bagging showed 97% accuracy, while NN and MLP performed well with accuracy above 96%, though slightly less than the ensemble models. Conclusion: These findings accentuate the potential of ML, especially SVM, for early prediction of pulmonary carcinoma using structured EMR data. These findings support the integration of ML-based tools into clinical workflows, supporting data-driven, personalized cancer screening and decision-making in health care.

Keywords

Main Subjects