APP下载

Feature selection and machine learning approach for carotid atherosclerosis in asymptomatic adults

2022-11-03TaoLiangQiaoLiWangXiaoQinLiuZhenZhouScottLoweZiHengChenChenYuSun

Medical Data Mining 2022年4期

Tao Liang ,Qiao-Li Wang ,Xiao-Qin Liu ,Zhen Zhou ,Scott Lowe ,Zi-Heng Chen ,Chen-Yu Sun

1Department of Gastroenterology,Peoples’ Hospital of Deyang City,Deyang 618000,China.2Department of Physical Examination Center,Peoples’ Hospital of Deyang City,Deyang 618000,China.3Menzies Institute for Medical Research,University of Tasmania,Hobart,TAS 7000,Australia.4College of Osteopathic Medicine,Kansas City University,Kansas City,MO 64106,USA.5Applied Mathematics and Statistics,Stony Brook University,Stony Brook,NY 11794,USA.6AMITA Health Saint Joseph Hospital Chicago,University of Illinois Chicago,Chicago 60657,USA.

Abstract Objective:The presence of carotid atherosclerosis reflects the overall atherosclerotic load and may predict cardiovascular and cerebrovascular accidents.Studies have reported risk factors for carotid atherosclerosis.However,few practical models have been established to predict carotid atherosclerosis risk.Thus,this study was conducted to investigate important features of carotid atherosclerosis and to propose a machine learning-based method for predicting carotid atherosclerosis in asymptomatic adults.Methods:Cross-sectional study was conducted using routine medical check-up data of individuals from January 2019 to January 2020.Pearson’s correlation analysis was performed to correlate the features.Then,features were selected by python’s feature-selection library and analyzed through three algorithms.Multiple machine learning algorithms,including Decision Tree,Random Forest and Logistic Regression (LR) were used to predict the risk of carotid atherosclerotic plaques and compared their precision,accuracy,recall,F1-score and area under the curve.Results:A total of 150 individuals were enrolled in this study,30 (20%) of them were found with carotid atherosclerotic plaques.Sex,age,body mass index,total cholesterol,Systolic blood pressure (SBP),and carbohydrate antigen 724 (CA724) were independently correlated to carotid atherosclerosis.Pepsinogen I and pepsinogen II serum levels had no correlations with Carotid intima-media thickness and pulse wave velocity.SBP,diastolic blood pressure age,low-density lipoprotein,Pepsinogen I,pepsinogen II,body mass index,Waist,CA724,and Uric Acid contribute to the cumulative importance of 0.9,and SBP was the most crucial feature for carotid atherosclerosis.LR algorithm has the precision (0.92),values of recall(0.91),F1 (0.9),and area under the curve (0.95),and showed the optimal performance to predict the presence or absence of carotid atherosclerosis in asymptomatic adults.The code for analysis in this article was uploaded to GitHub(https://github.com/ganbingliangyi/machine-learning).Conclusions:SBP was the most crucial feature in ranking features,the LR algorithm showed the optimal performance to predict the presence or absence of carotid atherosclerosis in asymptomatic adults.

Keywords: machine learning;feature selection;gastric biomarkers;carotid atherosclerosis;asymptomatic adults

Background

Carotid atherosclerosis is a comprehensive disease whose pathology is mainly characterized by carotid intima thickening and is a risk factor for stroke,which can be diagnosed with image tests such as ultrasound[1-3].However,the diagnosis of carotid atherosclerosis often happens after the patient presents with neurovascular symptoms,such as syncope,instead of diagnosis at an early stage by routine screening.This is mainly because there is uncertainty regarding the asymptomatic patients at higher risk of poor outcomes that may benefit most from early screening [4].

In recent years,artificial intelligence (AI) has been a technological breakthrough and contributes to the analysis of clinical data in biomedical fields such as cardiovascular medicine [5].Machine learning (ML) is a subset of AI that can automate decision-making and predict outcomes based on patient data [6].Fan et al.predicted the risk of carotid atherosclerosis using ML,which found that logistic regression (LR) showed optimal performance (area under the curve(AUC) 0.809,accuracy 74.7%,and F1-score 59.9%) in predicting carotid atherosclerosis [7].

Serum gastric biomarkers,such as pepsinogens I (PGI),and pepsinogens II (PGII),are essential parameters in clinical screening for atrophic gastritis and gastric cancer.Previous studies found a positive correlation between gastric biomarkers and atherosclerosis in individuals [8].However,as we know,no previous studies explored the feature of serum gastric biomarkers in predicting carotid atherosclerosis risk in asymptomatic adults using the ML approach.Therefore,this study was conducted to identify ML approaches for predicting carotid atherosclerosis risk in asymptomatic adults.This could provide a new theoretical foundation for future research on the screening and diagnosis of carotid atherosclerosis.

Material and Methods

Study participants

Our study included individuals who underwent annual health examinations at the People’s Hospital of Deyang City between January 2019 and January 2020.Inclusion criteria: (1) age of 18-80 years old;(2) individuals who have carotid duplex ultrasonography.Excluded criteria: (1) mental illness,communication disorders;(2) Patients with tumors.The Ethics Committee approved this study of Peoples’Hospital of Deyang City (No.2022-04-135).De-identified retrospective data that were collected during the health screening process were used.

Methods

Demographic data acquisition.Data from 150 individuals were collected.These data included sex,age,lifestyle factors,waist,body mass index (BMI),systolic blood pressure (SBP),diastolic blood pressure(DBP),fasting blood glucose,uric acid(UA),total cholesterol,triglycerides,high-density lipoprotein cholesterol,low-density lipoprotein cholesterol,pulse wave velocity (PWV),PGI,PGII,and CA724.Blood pressure,height,weight,and waist circumference were measured according to standard operation.

A standard questionnaire was administered by trained staff to obtain data on lifestyle risk factors,including cigarette smoking(defined as subjects who smoked ≥1 cigarette/day during the past 30 days or had smoked ≥100 cigarettes in their lifetime or still have the habit of smoking during the study),and alcohol consumption (defined as drinking ≥500 g of alcohol/week for ≥1 year) [9,10].

Carotid ultrasonography and atherosclerotic tests.Senior doctors checked the Carotid intima-media thickness (CIMT) with B-ultrasound(Philips IU22,Philips Healthcare).The study determined carotid atherosclerosis if CIMT ≥ 1.3 mm with or without atherosclerotic plaque[11,12].

Trained medical practitioners were responsible for the measurement of PWV by the automated device (Beijing Chioy Medical Technology,Model VBP-9T),with the subject lying supine in the resting condition.

Machine learning-based diagnostic model

Data preprocessing.Data shall be preprocessed before training machine learning algorithm models.Missing values and high heterogeneity were cleared,replacing the secondary variable with 0 or 1.

Feature and model selection.Feature selection was made using the Feature-selector library,a tool for ML datasets(https://github.com/WillKoehrsen/feature-selector).Model selection was made using three ML algorithms,including Decision Tree (DT),Random Forest (RF),and LR.Among the individuals enrolled,the remaining 15% (testing sample) served to test the model,and 85%were randomly selected (training sample),who were used to develop the model.We set model parameters to LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,max_iter=100,multi_class=‘ovr’,n_jobs=1,penalty=‘l2’,random_state=None,solver=‘liblinear’,tol=0.0001,verbose=0,warm_start=False),RandomForestClassifier(n_estimators=100,max_depth=5,oob_score=True,class_weight=“balanced”,random_state=1),DecisionTreeClassifier(criterion=‘entropy’,max_depth=5).

Predictive performance measurements.Several evaluation parameters related to the performance of machine learning algorithm models will be described and used to compare three different algorithm models.For example,the receiver operating characteristic curve (ROC) and the AUC value,accuracy,precision,recall,and F1 values.ROC and AUC are used to evaluate the overall performance of classification and prediction [13].

Precision was the ratio of the actual positive sample to all positive samples in the predicted sample.The recall was the ratio of actual positive samples to the number of predicted samples in the forecast sample.F1 was the summed average of precision and recall.Accuracy is the ratio of the number of all predicted correct samples divided by the total number of samples [13].

All models were built using the Python environment (version 3.9.0)using the sklearn,numpy,pandas,matplotlib,seaborn and scipy packages.

Statistical analysis

Statistical analysis and ML algorithms were conducted using Python version 3.9.0 programming language (http://www.python.org).Categorical variables are presented in the form of cases (percentage).Variables that fitted normal distribution (e.g.,age,BMI) were represented by “mean ± standard deviation” (normal distribution).The correlations of characteristics were estimated using Pearson’s correlation analysis and multiple linear regression analysis.Multivariate analysis was performed using a logistic regression model.P<0.05 indicated a difference of statistical significance.

Results

Baseline characteristics of included subjects

A total of 150 individuals aged 30-77 years met the inclusion criteria,with a mean age of 53.90 ± 8.84 years and a male-to-female ratio of 1.78:1.Among them,37 (24.7%) had hypertension,and 30 (20%) had carotid atherosclerosis plaques (Table 1).

Correlation of features

The correlations of characteristics were estimated using Pearson’s correlation analysis (Figure 1).Correlation analysis showed that serum PGI level was not correlated with CIMT or PWV (P=0.296,P=0.518,respectively),nor the serum PGII level was correlated with CIMT or PWV (P=0.172,P=0.466,respectively).CA724 level was positively correlated with CIMT (R=0.188,P=0.021).CA724 level was not correlated with PWV (R=0.037,P=0.651).

Ranking of influencing carotid atherosclerosis feature

The presence of carotid atherosclerosis was considered a target variable.Use github’s feature analysis library “Feature selector” toanalyze the data and rank the features affecting high to low carotid atherosclerosis.From the perspective of clinical practice,the results of feature ranking are analyzed and screened,and the final results can be regarded as the risk factors of carotid atherosclerosis.

Table 1 Baseline characteristics of participants(N=150)

The carotid atherosclerosis results for feature selection techniques using the ‘Feature-selector’ library,we further ranked those 16 features (Figure 2).SBP,Age,low-density lipoprotein (LDL),PGI,BMI,Waist,CA724,UA,PGII,and DBP contribute to the cumulative importance of 0.9 (Figure 3).SBP contributed the most to the carotid atherosclerosis outcome.The code for analysis in this article was uploaded to GitHub (https://github.com/ganbingliangyi/machine-lea rning)

Prediction model of machine learning

In Table 2,the results of three algorithm models,DT,RF and LR,are evaluated and compared.Which was accuracy precision,recall,and F1 value,ROC curves were drawn respectively.DT is shown in Figure 4,RF is shown in Figure 5,and LR is shown in Figure 6.

Among the three algorithms,the LR algorithm has the best performance,including precision (0.92),recall (0.91),F1 (0.9),and AUC (0.95) respectively.The final results showed that the LR algorithm model was superior to other algorithms in recall,F1,accuracy and AUC,showing the best model classification and prediction capabilities.Combined with the clinical situation,after evaluating the performance of the algorithm model based on various factors,the study chose to use the LR algorithm model to predict carotid atherosclerosis.

Discussion

In this study,SBP,Age,LDL,PGI,BMI,Waist,CA724,UA,PGII,and DBP were significant for carotid atherosclerosis in asymptomatic adults.These findings are similar to some previous studies,among them,Gender,Age and SBP associated with the risk of carotid pulsatile atherosclerosis [14,15].Compared with traditional statistical analysis,a feature selection tool was used in the present study to obtain the importance of relevant factors for carotid atherosclerosis.In addition,our study showed the ranking of influencing carotid atherosclerosis feature,and SBP,Age,LDL,PGI and BMI were the feature of weights of top-five.This information is essential for guiding the prevention of carotid atherosclerosis.

It was proposed that Gastrin,PGI,and PGII were positively correlated with carotid atherosclerosis in patients with H.pylori infection [16].However,our study showed that serum PGI and PGII were not correlated with carotid atherosclerosis in HP -negative individuals.Instead,we also found that serum CA724 weakly correlated with CIMT.

With the rapid advances in recent years,AI-based techniques have gained popularity and have been more widely applied in medicine,particularly in medical imaging and decision support system [17-19].Luca Saba et al.reviewed that AI technology was used to assist in the diagnosis of arteriosclerosis plaque [20].CT-based carotid arteries were used for features by training several ML algorithms.The support vector machine algorithm received an accuracy of 0.88,with a sensitivity of 0.90 and a specificity of 0.86 [21].

Figure 1 The correlations of characteristics were shown by Pearson Correlation Heatmap.BMI,body mass index;SBP,systolic blood pressure;DBP,diastolic blood pressure;UA,uric acid;GLU,glucose;TC,total cholesterol;TG,triglycerides;HDL-C,high-density lipoprotein cholesterol;LDL-C,low-density lipoprotein cholesterol;CA724,carbohydrate antigen 724;PGI,Pepsinogen I;PGII,pepsinogen II;PWV,pulse wave velocity;CIMT,carotid intima-media thickness.

Figure 2 Feature importance based on feature permutation for carotid atherosclerosis.BMI,body mass index;SBP,systolic blood pressure;DBP,diastolic blood pressure;UA,uric acid;GLU,glucose;TC,total cholesterol;TG,triglycerides;LDL,low-density lipoprotein;HDL,high-density lipoprotein;PGI,Pepsinogen I;PGII,pepsinogen II.

Figure 3 Cumulative importance versus the number of features

Table 2 compared in terms of precision,recall,F1 value,and AUC

Figure 4 Receiver operating characteristic curves for the decision tree(DT)model.The area under the receiver operating characteristic curve.DT,Decision Tree.

Figure 5 Receiver operating characteristic curves for the RandomForest model.The area under the receiver operating characteristic curve.

Figure 6 Receiver operating characteristic curves for the LogisticRegression model.The area under the receiver operating characteristic curve.

Jian Yu et al.built ML algorithms to diagnosis carotid atherosclerosis using RF,DT,support vector machine,extreme gradient boosting,and multilayer perceptron with more than a dozen features.Among them,the multilayer perceptron,an artificial neural network,obtained the highest accuracy (0.748),F1 score (0.742),and AUC (0.766) [22].

In this study,carotid atherosclerosis was accurately estimated using three ML models,including DT,RF,and LR.The result showed that the model evaluation of the LR algorithm performs best,precision(0.92) and recall (0.91),F1 score (0.9),and AUC (0.95).The relevant indicators are better than those in the previous studies [23].

Our study has several limitations.Firstly,it is a cross-sectional study rather than a randomized controlled trial.We use test sets to evaluate the models,but the randomized trial is the most widely accepted model evaluation method in clinical research.Secondly,this study only included people who received annual physical examinations,and these participants are generally healthier than those who do not receive annual physical examinations.Thus,our study cohort might not be representative of the general population.Thirdly,some input features could affect the model’s accuracy,and the mode’s false positive and false negative results predicted by the model should be further analyzed in the future.Fourthly,the sample size of this study is small,a larger sample size and multicenter clinical are still needed in future studies.

Conclusion

Our results demonstrated that serum PGI and PGII are not correlated with CIMT or PWV.However,we found the valence of PGI as a predictor of carotid atherosclerosis as a feature.Furthermore,SBP was the most crucial feature in ranking features.LR algorithm has a precision (0.92),values of recall (0.91),F1 (0.9),and AUC (0.95),and showed the optimal performance to predict carotid atherosclerosis in asymptomatic adults.Our study may offer an alarming early system,allowing a non-imaging diagnosis of carotid atherosclerosis in asymptomatic adults.