Machine Learning Models for Disease Prediction: Leveraging Blood Test Data

Data Collection and Processing

The data utilized in this study were meticulously gathered from inpatient records within the Neurology and Cardiology Departments, alongside health check-up data from healthy individuals at the First Affiliated Hospital of Xiamen University, spanning from 2018 to 2023. Sourced directly from the hospital information system, this dataset encompasses routine blood and biochemical tests. For patients, we specifically selected the initial blood test results post-hospitalization as features for model development. Conversely, for healthy individuals, we opted for the first blood test data from their annual physical examinations. To mitigate the impact of missing values on prediction accuracy, features exhibiting a missing data rate exceeding 50% were excluded. Consequently, we refined our feature set to 22 from routine blood tests and 28 from biochemical analyses (detailed in Supplementary Tables 1 and 2). Patient diagnoses were established according to the International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10). To ensure robust sample sizes for each circulatory system disease, conditions with fewer than 100 cases were removed. Similarly, samples with over 50% missing feature data were also eliminated. The final dataset comprised 25,794 healthy individuals and 32,822 patients diagnosed with circulatory system diseases, forming the basis for our model construction (Fig. 1; Table 1). This comprehensive dataset was then randomly partitioned into a training set (70%) and a validation set (30%).

Table 1 Data distribution of diseasesFull size table

Fig. 1 The flow chart of this study, outlining the data collection, processing, and analysis methodology used in the research.

Full size image

Machine Learning Methodologies for Model Construction

In this study, we explored several supervised machine learning techniques to construct predictive models. These methodologies, including Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and Deep Neural Networks (DNN), are widely recognized for their efficacy in classification and prediction tasks. While unsupervised learning constructs models by identifying patterns in unlabeled data, our approach leverages supervised learning, which relies on labeled datasets to train models for specific prediction outcomes. Each of these algorithms offers unique strengths in feature optimization and model construction.

Logistic Regression, a generalized linear model, is frequently employed in data mining and disease diagnosis due to its ability to estimate the probability of an event. Random Forest, an ensemble of decision trees, excels in both classification and regression, demonstrating robustness to outliers and noise. Support Vector Machines, another powerful classifier, define decision boundaries by maximizing the margin between data categories. eXtreme Gradient Boosting (XGBoost), an optimized gradient boosting algorithm, is known for its efficiency and flexibility in various data science applications. Deep Neural Networks (DNNs), with their multiple hidden layers, can model complex nonlinear relationships, offering enhanced abstraction and learning capabilities compared to shallow networks.

Feature optimization is inherent in each of these algorithms. LR utilizes regularization, RF reduces noise impact through ensemble averaging, SVM employs kernel functions, XGBoost optimizes feature usage via gradient boosting, and DNNs automatically learn and refine features through deep layers. To rigorously compare these methods, we selected LR, RF, SVM, XGBoost, and DNN for model construction [21,22,[23](#ref-CR23 “Kumar A, Loharch S, Kumar S, et al. Corrigendum to “Exploiting cheminformatic and machine learning to navigate the available chemical space of potential small molecule inhibitors of SARS-CoV-2″ [Computational and Structural Biotechnology Journal 19 (2021) 424–438]. Comput Struct Biotechnol J. 2023;21:4408.”),24,25].

Prior to model training, we standardized both training and validation datasets to eliminate feature scale effects. Hyperparameter optimization for each algorithm was performed using grid search cross-validation (CV) combined with manual fine-tuning. Specific parameters tuned for each model included: LR (C, max_iter, penalty, solver), RF (max_depth, min_samples_leaf, n_estimators), SVM (C, gamma, kernel), XGBoost (colsample_bytree, gamma, learning_rate, max_depth, n_estimators, subsample), and DNN (activation, number of layers, neurons per layer). Optimal parameter sets were determined within the training set, using 5-fold cross-validation and Area Under the Curve (AUC) as the primary performance metric (Supplementary Data 1).

Implementation was conducted using scikit-learn (version 1.3.0) for LR, RF, and SVM, xgboost package (version 2.0.2) for XGBoost, and tensorflow (version 2.0.2) in python for DNN.

Model Performance Evaluation Metrics

To rigorously assess the performance of each constructed model, we employed a suite of evaluation metrics on the validation set. These metrics included Sensitivity (Sn), Specificity (Sp), Positive Predictive Value (PPV), Negative Predictive Value (NPV), F1 score, Matthews Correlation Coefficient (MCC), and Accuracy (Acc). The mathematical formulations for these metrics are as follows [26,27,28]:

$${text{Sn}} = frac{{{text{TP}}}}{{{text{TP + FN}}}} $$

$${text{Sp}} = frac{{{text{TN}}}}{{{text{TN + FP}}}} $$

$$ {text{PPV}} = frac{{{text{TP}}}}{{{text{TP + FP}}}} $$

$$ {text{NPV}} = frac{{{text{TN}}}}{{{text{TN + FN}}}} $$

$$ {text{Acc}} = frac{{{text{TP + TN}}}}{{{text{TP + FN + TN + FP}}}} $$

$$ {text{F1~score}} = frac{{{text{2TP}}}}{{{text{2TP + FN + FP}}}} $$

$$ {text{MCC}}~ = ~frac{{{text{TP}}~ times ~{text{TN}}~ – {text{FP}}~ times ~{text{FN}}}}{{sqrt {({text{TP}}~ + ~{text{FP}})({text{TP}}~ + ~{text{FN}})({text{TN}}~ + {text{FP}})({text{TN}}~ + ~{text{FN}})} }} $$

Where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. Furthermore, we utilized the Area Under the ROC curve (AUC) for a comprehensive evaluation of model performance. To ascertain model robustness, all performance metrics were calculated on the validation set using bootstrapping to derive 95% Confidence Intervals (CI) [29,30,31].

Model Interpretation using SHAP Values

To address the inherent black-box nature of machine learning models and elucidate feature contributions, we integrated the SHAP (SHapley Additive exPlanations) algorithm. SHAP assigns a value to each feature, quantifying its impact on the model’s prediction [32]. SHAP values were computed using the shap python package (version 0.44.0). This interpretability enhances our understanding of how each feature influences disease prediction.

Feature Identification for Cardiovascular Diseases

To pinpoint specific hematological and metabolic features characteristic of different cardiovascular diseases (CVDs), we applied the SHAP algorithm across 69 models designed to differentiate between various disease types. SHAP values, without normalization to preserve raw impact, were visualized in a heatmap. Hierarchical clustering was applied to both rows and columns of the heatmap, and ordering was adjusted based on clustering outcomes. The heatmap was then generated using Python, providing a visual representation of feature importance across different CVD models.

Furthermore, to explore universal features relevant across multiple diseases, we identified the top ten features from each of the 69 models and constructed a network graph linking these features to their corresponding diseases. Feature node size in the network was scaled according to its frequency of appearance within the top ten feature lists, highlighting potentially universal features for disease discrimination. Cytoscape (version 3.10.2) was used for network visualization [33]. This network-based approach aids in identifying key biomarkers for cardiovascular disease prediction.