Predicting the Risk of Arterial Stiffness in Coal Miners Based on Different Machine Learning Models*
CHEN Qian Wei, HUANG Xue Zan, DING Yu, ZHU Feng Ren, WANG Jia,ZOU Yuan Jie, DU Yuan Zhen, ZHANG Ya Jun, HUI Zi Wen,ZHU Feng Lin,#, and MU Min,#
Coal is one of the world’s main energy resources,accounting for approximately 68% of China’s current total power generation.However, several studies have demonstrated that dust, exhaust fumes, and other harmful factors in coal mines increase the risk of cardiovascular disease (CVD) among miners[1].Arterial stiffness (AS) is an independent risk factor of CVD, and epidemiological studies have shown that AS plays a vital role in assessing the risk of CVD[2].Currently, Pulse Wave Velocity (PWV) serves as the gold standard for assessing AS, and it is widely utilized in CVD screening for diagnosis[3].Machine learning is an artificial intelligence technique that is widely used in disease diagnosis and prediction because it offers quick and accurate identification of risk factors and condition likelihoods[4].Studies have shown that AS is associated with traditional CVDrelated factors, such as blood pressure and lipids, as well as with coal dust and other harmful factors in coal mines[5].Therefore, this study aimed to use these potential predictors to predict AS risk in coal miners using machine learning.
This study collected data from 1,535 coal miners employed by a major coal mining company in Shaanxi Province, China.After excluding individuals who did not meet the criteria or whose relevant information was incomplete, data on 1,443 coal miners were collected for inclusion in our study.The investigators used a unified standard questionnaire to collect respondents’ information.Data on height, weight,body mass index (BMI), blood pressure, and blood lipids were collected using standard conventional methods.PWV was measured using the Vascular Profiler BP-203RPEIII system (Omron, Japan).
R software version 4.2.2 was used for statistical analysis of the data and machine learning classification modeling.Count data were expressed as frequencies and percentages (%), and theχ2test was used for comparison between groups;measurement data conforming to a normal distribution were expressed as (x±s) and compared by T-test, while data not conforming to a normal distribution were expressed asM(P25,P75) and compared by the rank sum test.P< 0.05 was considered a statistically significant difference.
To ensure the quality of the data, before data analysis, the data were pre-processed by deleting duplicates and outliers, and the Synthetic Minority Oversampling Technique (SMOTE) method was used to balance the data.We included factors with significant differences or thought to be associated with AS in a LASSO regression analysis.The variables that were finally included in the prediction model were determined according to the optimal λ value obtained.The dataset was randomly divided into training (70%) and test (30%) datasets.Five different ML models were used to analyze the data: Random Forest (RF), Extreme Gradient Augmentation (XGBoost), Logistic Classification (LR), Back Propagation(BP), and Classification and Regression Tree (CART).The predictive performances of the five ML models were evaluated by comparing their accuracy,sensitivity, specificity, positive predictive value,negative predictive value, F1 value, and area under the subject operating characteristic curve (AUC) on the dataset.
A total of 1,443 eligible coal miners were included in this study.There were 651 cases(45.11%) in the non-AS group and 792 cases(54.89%) in the AS group.There were significant differences in age, BMI, pulse, systolic blood pressure (SBP), diastolic blood pressure (DBP),and other factors between the two groups(Supplementary Table S1, available in, all atP< 0.05.
To identify suitable predictive variables, we employed Lasso regression with cross-validated noose fitting binomial deviance plots (Figure 1A) and noose fitting coefficient locus plots (Figure 1B).The optimal λ value, corresponding to the lowest point on the loss function, was determined from Figure 1A.The variables intersecting with the optimal λ value in Figure 1B were ultimately included as model variables.Consequently, the model’s predictive variables corresponding to the optimal λ value were determined to be age, pulse, SBP, whole blood high shear rate (HS), carbon dioxide-combining power(CO2CP), Cl, and TG.
Figure 1.LASSO regression screening of machine learning model predictors: (A) The process of selecting the most suitable λ in the LASSO model.(B) LASSO coefficient curve of the variable.
Numerous studies have established an association between these factors and CVD occurrence.The academic community has reached a consensus regarding the close relationship between age,hyperlipidemia, and CVD incidence, which is potentially attributed to vascular aging and AS[6].The link between hypertension and CVD is welldocumented.For example, Webb’s study demonstrated a relationship between blood pressure and AS, where higher DBP and SBP corresponded to a greater likelihood of AS occurrence[5].Pulse has also been connected to CVD, with some researchers proposing its use as a predictor of such conditions[7].Studies have also found associations between CO2CP and the occurrence and prognosis of CVD[8].Moreover, HS serves as an important indicator of blood viscosity, and numerous studies have revealed that higher blood viscosity is associated with more severe AS and an increased likelihood of CVD[9].Unfortunately, although we collected data on the exposure of coal miners to occupational hazards and tried to include them in our study, we excluded all variables of occupational hazards when we used LASSO regression to screen predictor variables;therefore, the variables of the final predictive model were all composed of physiological indicators.We speculate that there may be some mediating factors between occupational hazards and AS or CVD.We will attempt such an analysis in future studies to improve our research.
The five machine learning models were compared using various indicators to assess their predictive performance for the occurrence of AS in coal miners (Table 1).The results demonstrated that the RF model achieved the highest accuracy (83.6%),sensitivity (80.2%), specificity (86.3%), positive predictive value (82.2%), negative predictive value(84.7%), F1 value (0.812), and AUC (0.893) on both the training and test datasets.
An AUC closer to 1 indicates a better predictive performance of the machine learning model.The AUCs of the five machine learning models on the dataset are shown in Figure 2.On the training dataset, the AUC of the RF model was significantly higher than those of the other models (Figure 2A).On the test dataset, the AUC value of the RF model was also the highest (0.893), which proves that the RF model has good prediction performance for AS(Figure 2B).
The RF model is a classifier model based on decision trees that has been widely applied in the medical field.The bagging method it employs significantly enhances the accuracy of the model predictions.In our study, the RF model outperformed other machine learning models on the training dataset, exhibiting higher evaluation scores and superior performance on the test set.Therefore,based on our evaluation index analysis, we considered the RF model to be the most suitable machine learning model for predicting the risk of AS among coal miners.
The RF model has demonstrated exceptional performance in predicting various diseases or symptoms, likely owing to the following advantages:1) ability to generate highly accurate classifiers,2) capability to evaluate variable importance and build models accordingly, and 3) potential to estimate missing data and balance errors within a dataset[10].
In our study, we used simple physical examination data, such as age, pulse, blood pressure, and HS, to accurately predict the risk of AS among coal miners.Consequently, we were able to predict the risk of CVD among coal miners by asking a few questions and collecting a small number of blood samples, without relying on professional PWV detection equipment.To facilitate better use of our model, we provide the variable importance scores of the RF model in Supplementary Figure S1 (available in a decision tree model,the variable importance score can help readers better understand the value of these variables in predicting AS outcomes.
Nevertheless, our study had certain limitations.First, the outcome of our study was AS, which predicts CVD but does not directly indicate it.Second, the data used were obtained exclusively from a large coal mine in Shaanxi Province, which may affect the generalizability of our findings.Third, although we selected five machine learning models, there are numerous other widely used models, including Gaussian Parsimonious Bayesian Classification (GNB),Neural Network Classification (MLP), andComplementary Parsimonious Bayesian Classification(CNB), which could be included in future studies.
Table 1.Efficacy results for the five ML models
Figure 2.AUCs of the five machine learning models: (A) training and (B) test dataset ROC curves.
This study represents the first attempt to employ various machine learning models to predict the risk of AS in coal miners.Among the five machine learning models examined, the RF model demonstrated the best predictive performance for AS in this population.Clinicians and public health practitioners can effectively utilize the RF model to assess the early-stage risk of AS among coal miners,enabling the implementation of appropriate preventive interventions.If readers are interested in our model, they can contact us via email, and we will provide the model and code.
There are no potential conflicts of interest to disclose.
MU Ming and DING Yu conceived of and designed this study.ZHU Feng Ren contributed to the writing of the manuscript.DING Yu, WANG Jia,ZOU Yuan Jie, and DU Yuan Zhen contributed to the data retrieval and manuscript review.ZHU Feng Lin,ZHANG Ya Jun, and HUI Zi Wen contributed to data collection and collation.All authors made significant contributions to the research process of this manuscript and have read and approved the submitted manuscript.
&These authors contributed equally to this work.
#Correspondence should be addressed to ZHU Feng Lin, E-mail:, Tel: 19155445669;MU Min, E-mail:, Tel: 13655618753.
Biographical notes of the first authors: CHEN Qian Wei, male, born in 1995, Master, Teaching Assistant,majoring in occupational health; HUANG Xue Zan, male,born in 1997, Master Candidate, majoring in occupational epidemiology.
Received: July 10, 2023;
Accepted: October 20, 2023
Biomedical and Environmental Sciences的其它文章
- Seizing Opportunities for Further Advancements— Address at the 35th Anniversary Symposium of BES
- Assessing Dietary Consumption of Sodium and Potassium in China through Wastewater Analysis*
- Prognostic Nutritional lndex Associates with the Severity of Silicosis: A Study from a Tertiary Class A Prevention and Treatment lnstitute for Occupational Diseases in China*
- Metallothionein 1E Alleviates Cadmium-induced Renal Cytotoxicity through Promoting Mitochondrial Functional Recovery*
- Joint Effects of Multipollutant Mixtures on Mortality in Chengdu, China*
- Prevalence of Anemia in Patients with Diabetes Mellitus:A Systematic Review and Meta-Analysis*