APP下载

Heart Disease Diagnosis Using the Brute Force Algorithm and Machine Learning Techniques

2022-08-24JunaidRashidSaminaKanwalJungeunKimMuhammadWasifNisarUsmanNaseemandAmirHussain

Computers Materials&Continua 2022年8期

Junaid Rashid,Samina Kanwal,Jungeun Kim,*,Muhammad Wasif Nisar,Usman Naseem and Amir Hussain

1Department of Computer Science and Engineering,Kongju National University,Cheonan,31080,Korea

2Department of Computer Science,COMSATS University Islamabad,Wah Campus,47040,Pakistan

3School of Computer Science,University of Sydney,Sydney,NSW 2006,Australia

4Centre of AI and Data Science,Edinburgh Napier University,Edinburgh,EH11 4DY,UK

Abstract: Heart disease is one of the leading causes of death in the world today.Prediction of heart disease is a prominent topic in the clinical data processing.To increase patient survival rates,early diagnosis of heart disease is an important field of research in the medical field.There are many studies on the prediction of heart disease,but limited work is done on the selection of features.The selection of features is one of the best techniques for the diagnosis of heart diseases.In this research paper,we find optimal features using the brute-force algorithm,and machine learning techniques are used to improve the accuracy of heart disease prediction.For performance evaluation,accuracy,sensitivity,and specificity are used with split and cross-validation techniques.The results of the proposed technique are evaluated in three different heart disease datasets with a different number of records,and the proposed technique is found to have superior performance.The selection of optimized features generated by the brute force algorithm is used as input to machine learning algorithms such as Support Vector Machine(SVM),Random Forest(RF),K Nearest Neighbor(KNN),and Naive Bayes(NB).The proposed technique achieved 97% accuracy with Naive Bayes through split validation and 95%accuracy with Random Forest through cross-validation.Naive Bayes and Random Forest are found to outperform other classification approaches when accurately evaluated.The results of the proposed technique are compared with the results of the existing study,and the results of the proposed technique are found to be better than other state-of-the-art methods.Therefore,our proposed approach plays an important role in the selection of important features and the automatic detection of heart disease.

Keywords: Heart;disease;brute force;machine learning;feature selection

1 Introduction

Heart disease is a life-threatening disease that can lead to heart failure.Heart disease is an important disease that affects heart function and creates problems such as reduced blood vessel function and coronary artery infection[1,2].Heart disease is also known as a cardiovascular disease that causes death around the world.The heart is a muscular organ that pumps blood throughout the body.Heart disease refers to a variety of problems that affect the heart and blood arteries.Examples of various types of heart disease include coronary artery disease,angina,heart attack,and heart failure.Coronary heart disease(CHD)is one of the leading causes of illness and mortality in today’s society.The cost of treating coronary artery disease is a significant financial burden,making prevention of coronary artery disease a critical step in therapy.A heart attack occurs when a coronary artery suddenly becomes blocked,usually by a blood clot.The various types of cardiovascular disease are high blood pressure,coronary artery disease,heart valve disease,and stroke.According to the world health organization,17.7 million individuals passing brought about the cardiovascular disease[3].It is fundamental that people determine how to understand and control cardiovascular disease unintended factors,for example,a healthy diet,physical exercise,and a doctor’s medication for blood pressure,cholesterol,and weight.To anticipate heart disease in an early stage,a handful of deaths can be prevented.

The diagnosis of heart disease involves numerous factors that complicate the task of a physician.Basic properties utilized for heart sickness are age,gender,fasting blood sugar,chest pain type,resting ECG,huge fluoroscopic shad vessels,test blood pressure(hypertension),serum cholesterol(coronary disease hazard),thalach(greatest pulse),ST gloom,fasting glucose,exang(including angina),smoke,hypertension,Food inclination,weight,height and stiffness [4].Chest pain,arm pain,slow and lightheadedness,fatigue,and sweating are some of the early warning signs of a heart attack [5].For diagnosing heart conditions,data mining and ensemble techniques are used,whose finding is most credible and precise [6,7].Due to numerous risk factors,such as cholesterol,diabetes,high blood pressure,and many other factors,diagnosing the disease in patients is challenging.Patients with heart disease will not feel ill in the early stages of the disease,but they do in the later stages.Then it is too late to recover harm [8].There are some other methods which are used for human activity using LSTM(Long short-term memory) [9].Machine learning methods are also considered for the prediction of Coronavirus disease [10].So American health association indicated that expanses of health care related to heart disease are estimated to double by 2030[11].Every country has more hospitals and more patient records due to the growth of the population.Most hospitals maintain patient health information,but it is hardly used for decision making because sometimes the doctor does not have enough time to examine patient data from the large database.To extract relevant information from a huge database,data mining techniques are needed.

Data mining is referred to as information finding in the database since it incorporates several techniques in various fields such as machine learning,neural networks,and retrieval of information.To extract information,data mining is essential to find a hidden pattern,build analytical structures,perform clustering,classification,and regression with meditation.The application mining method should identify patterns at different levels of abstraction [12].In medical research,there is extensive raw data that needs to be processed and then used to predict heart disease.Data analysis may involve machine learning algorithms and other data mining methods.In[13],higher accuracy is expected to be achieved,but it is not easy.For this purpose,some techniques and future selection methods to help the clinical industry achieve high accuracy.The data mining methods that are utilized in the medical sector need less time to diagnose heart disease with accurate results[14].Seven machine learning algorithms are used with three feature selection methods to identify healthy and unhealthy patients.This analysis evaluated complete features,reduced feature set,and positive arc classifier under ARC computed for execution time and reduction in accuracy of the effect of features on classifiers[15].The application of machine learning brings a new dimension to medical diagnosis,drug discovery processes,and the field of radiation therapy and eliminates the training cost to predict the disease.

Predictions of heart disease approaches have been developed in previous research using the Internet of Things(IoT)platform.To predict heart failure,patient information is analyzed with a support vector machine(SVM).IoT devices are used to collect heart data,such as pulse,blood temperature,and blood pressure (BP).The method is used successfully for the detection of heart failure.The suggested solution recognizes cardiovascular disorders quickly;however,accuracy decreases when a significant amount of data is used[16].A fuzzy analytical hierarchy(FAHP)technique is developed to assess the risk of a heart attack.The authors estimate the weights on the basis of many parameters that influence cardiovascular growth.Only where a high risk of heart attack has been observed,the proposed system requires monitoring.The device successfully identifies the initial results for medical professionals before prescribing expensive clinical tests.This technology reduces the cost of goods and the amount of resources used[17].

Previous studies show that previous approaches for heart disease prediction are not sufficient.These studies are carried out to restrict the selection of features for algorithms.Therefore,the identification of significant features and machine learning methods are essential to improve the performance of heart disease.The objective of this research work is to find significant features and machine learning algorithms to diagnose and improve the accuracy of heart disease performance.The selection of these significant features is important to eliminate redundant and inconsistent data.After that,analysis of various data from clinical centers and machine learning algorithms are useful to improve heart disease performance.The following are the main contributions of this study paper:

• In this paper,significant features are selected to remove redundant and inconsistent data using the brute-force algorithm that increases the efficiency of the prediction model.

• We find a superior performance machine learning algorithm using cross-validation and split validation that will be applied to the diagnosis of heart disease.

• Heart disease is predicted using three commonly used state-of-the-art datasets,namely Statlog,Cleveland,and Hungarian.

• The experimental results showed that the proposed technique achieved 97% accuracy with Naive Bayes through split validation and 95% accuracy with Random Forest through crossvalidation and the results are better than previous state-of-the-art methods.

• This research study offers credible assistance to health and medical professionals with significant modifications in the healthcare industry,and immediate resolution is obtained during the diagnosis of disease.

The rest of the article is structured as follows.The proposed methodology is shown in Section 2.Section 3 describes the performance evaluation.Section 4 describes the experimental results,including a discussion and comparative analysis.Section 5 discusses the conclusion and future work.

2 Proposed Methodology

This section consists of four main phases,namely data preprocessing techniques,feature selection method,validation method,and machine learning algorithms.The implementation of the proposed model methodology is illustrated in Fig.1.

Figure 1:The framework of the proposed methodology

2.1 Preprocessing

Pre-processing is the first phase of the diagnostic method.It requires three steps:replace missing attributes and remove redundancy,convert multiclass to binary class,and separate.If most of the attribute values of a patient match,a value is replaced by a similar position.In datasets,some instances have blank entries that are handled by preprocessing.By eliminating redundant(irrelevant)attributes,eliminating redundancy reduces the size of the data.Silent data helps diagnose patterns associated with heart disease effectively.The noise reduction procedure improves the process of identifying heart disease.Data preprocessing is done by removing missing values,removing redundancy,and converting the multiclass class to binary to extract appropriate and usable data from the datasets.The Cleveland dataset contains 303 entries with 13 attributes with missing values and one attribute as a final class attribute.They require preprocessing,noise reduction,and replacement of missing data.After applying to preprocessing,297 records are used in the implementation.The meticulous value of the Cleveland dataset(0 for absence and 1,2,3,4 for occurrence)is converted to a binary class in a preprocessing step.The Statlog dataset consists of 270 records with 13 attributes with no missing values.The class attribute‘num’has two values 1 and 2 converted to 0 and 1 before implementation.The Hungarian dataset consists of 261 records with 11 attributes after missing values are removed.The class attribute‘num’has two values 0 and 1.

2.2 Feature Selection

To assess how each attribute participates in predicting performance,a feature selection technique is used.Feature selection is a search technique used to identify a subset of features.The method of reducing the feature to a minimal subset of features is called feature selection.It is a widely used optimization method and problem in machine learning[18].

Feature selection is an important method to use prior to model construction to minimize data complexity by excluding irrelevant and useless features from the data.Feature selection methods reduce the size of data,allowing to train and test the model in less time.Feature selection is advantageous,as it reduces data size,reducing processing time,space,and power use [19].In this research study,a brute-force algorithm extracts the features and selects the optimal feature that gives a better impact on accuracy than other feature selection techniques.Irrelevant features can cause overlap.Our goal,for example,is to conclude the connection between symptoms and their corresponding diagnosis in the field of medical diagnosis.The brute force feature selection process tests all possible combinations of the input functions and finds the best subset.The brute force algorithm attempts to solve a problem through a large number of patterns.It is a straightforward approach to solving a problem.Typically,this approach requires a direct calculation based on the problem statement.Select and compare the number with all other numbers.The method of choosing brute force comprehensively analyzes all possible variations of the input features and then finds the best subset.The computational cost of an exhaustive search is prohibitively high,with a significant risk of overfitting.The approach to choosing brute force features is to carefully evaluate all possible input combinations and then find the correct subset.

She didn t come on the 9:18 either, nor on the 9:40, and when the passengers from the 10:02 had all arrived and left, Harry was looking pretty desperate. Pretty soon he came close to my window so I called out and asked him what she looked like.918,940。1002,。,,。

In this research study,feature selection is done first using Algorithm 1.The optimized feature set is used as input for well-organized and established classifiers.Two loops are used in Algorithm 1.The first loop is used to find the subset of features and the second loop is used to evaluate the performance of the brute-force algorithm.This comprehensive feature selection strategy is a wrapper for the evaluation of brute-force feature subsets.The algorithm will choose each combination,compute its score,and then determine the optimal combination based on its score

Algorithm 1:A Brute Force Algorithm for optimal feature selection.1.Collect a training collection of data from the particular domain.2.Shuffle the dataset.3.Split it into P partitions.4.for each partition(j=0,1,...,P-1)a.Let ExternalTrainset(j)=all partitions except j.b.Let ExternalTestset(j)=j’th partition.c.Let InternalTrain(j)=70%of the externalTrainset(j)arbitrarily selected.d.Let InternalTest(j)=the remaining 30%of the externalTrainset(j).e.for(k=0,1,...,m)Search for the best feature set with k components,f sjk using leave-one-out on InnerTrain(j)(i)Let IntenalTestScorejk=RMS score of f sjk on InternalTest(j).End loop of(k).f.Choice the f sjk with the highest internal test ranking.g.Let ExternalScorej=RMS score of the selected feature set on ExternalTestset(j)End of a loop of(j).5.Return the mean External Score.

2.3 Validation Method

In this study,we used two validation techniques:K-fold cross-validation and split validation.Cross-validation of K-folds is a method of separating data into subsets for training and model testing.This method is often used for prediction models to determine how the built model implements unseen data.Typically,a model is trained in predictive analytics on an accessible dataset with the required class labels.When the model is designed,it is used to test a new dataset with unknown output labels to assess the achievement of the model[19].Cross-validation is a resampling method that tests the learning of machine learning models on a small-scale dataset.A single parameter called k is used to calculate the number of different classes in a given dataset.Split validation consists of two subprocesses:training and a test process.The training process is used to build a model.The learned model is then added to the test process.In this research,both cross-validation and split validation are used to train and test the model.

2.4 Machine Learning Algorithms

This section explains the machine learning algorithms.After the selection of features,the machine learning algorithms are used to find the best prediction performance for heart disease.

2.4.1 Random Forest

Random Forest is an ensemble-based learning method not only for classification but also for regression and many other decision-based tasks.The random forest deals with missing values and is effective for large datasets.Combine many decision trees in a single model.The overfitting is not faced if the tree is enough.In prediction and likelihood estimation,the RF algorithm has been used.RF has several decision-making trees.Any judgment tree gives a vote that shows the judgment on the class of the object.Three essential tuning parameters are available in the random forest(1)Number of nodes(tree n)(2)Minimum node size(3)Number of features used to split each node(4)Number of features used to split each node for each tree.

2.4.2 Support Vector Machine

SVM is a supervised learning technique and an associated learning algorithm that analyzes data for regression and classification techniques.Identify the optimal route.SVM separates the data into two classes by creating a separate line.It is a good and fast algorithm that is equally useful for both classification and regression.In a vast dimensional space,SVM generates a hyperplane,or a set of a hyperplane used for regression,classification,and various other tasks.

SVM is based on the concept of structural risk minimization(SRM),which states that a machine learning algorithm should attempt to reduce structural risk rather than empirical risks to achieve good generalization results[20].

Naive Bayes is a form of data extraction that reveals the effectiveness of classifying patients diagnosed with heart disease[21].It is known as the simplest Bayesian network model.By applying the Bayesian theorem with the strong independent assumption,the naive Bayesian is considered a simple probabilistic classifier.It is a high-bias,low-variance classifier.Building a good model is useful for relatively small datasets.Naive Bayesian areas are text categorization,spam detection,sentiment analysis,and recommended systems.It is suitable for both numerical and nominal data with multiple dimensions.

To find the most probable possible classifications,Naive Bayes relies on probability theory[22].Naive Bayes classifier assumes,in simple terms,that the presence or absence of a specific class attribute is independent of the presence or absence of some other class attribute.It is also used to determine the posterior possibilities of the findings and to make better decisions about probability[23].

2.4.4 K-Nearest Neighbor

In feature space,the K-NN classification depends on the closest training examples [24].K-NN is a comparison technique that takes place between a training example and an unknown example.Closeness based on distance formulas in an n-dimensional space,an attribute in the training dataset,defines it.Second,K-NN ranks based on a majority vote in the neighbors they meet.The most common Euclidean distances are used to calculate the distance of all neighbors.K-NN implies that identical cases exist side by side and closest neighbors contribute to the final class forecast[25].It is a simple method to implement in the shortest possible time.

3 Performance Evaluation

This section describes the datasets and evaluation metrics used in our experiments.

3.1 Dataset Description

The datasets used for the construction of predictive models from an online UCI (University of California-Irvine) repository [26].Various datasets are available in the UCI repository.Three heart disease datasets,such as the Cleveland,Statlog,and Hungary datasets,are downloaded from the University of California-Irvine ML (machine learning) repository.There are 76 attributes and 303 records in the Cleveland dataset,but the dataset contained in the repository only provides information for 14 subsets of attributes.In the“goal”field,the presence of heart disease in the patient is indicated by a number that can range from 0 (no presence) to 4.Experiments in the Cleveland database have focused on distinguishing the existence of disease (values 1-4) from non-existence (value 0) [27].In Cleveland,the dataset for the target attributes “num”resulted in 160 heart disease absence records and 137 heart disease presence records.The statlog dataset contains 270 records with 13 attributes with no missing values.In the statlog dataset for the target“num”attributes,150 heart disease absence records and 120 heart disease presence records are obtained.The Hungarian dataset consists of 261 records with 11 attributes after removing missing values.A total of 34 database samples are deleted due to missing values,leaving 261 samples.Heart disease is absent in 62.5%of the class and prevails in 37.5%of the class[28].Each dataset has different records and information from numerous patients.Tab.1 explains a brief overview of the datasets and Tab.2 describes the features used in the datasets.All features are used in the Cleveland and Statlog dataset.However,in Hungarian datasets,excluding ca,all slope attributes are used with different numbers of records.We chose these datasets for heart disease diagnostics because they are state-of-the-art,easily accessible,and publicly available datasets.Most of the researchers used these datasets for heart disease prediction.Some of the more recent studies that used the same datasets are[29-34].

Table 1:Dataset description

Table 2:Datasets feature information

3.2 Evaluation Metrics

It is essential to evaluate the performance of the algorithm before its application in the medical industry.For the evaluation of the comparison results,a confusion matrix is used.In this study,the results of the prediction model are compared to three standard performance measures,namely precision,specificity,and sensitivity.Accuracy is the simplest and most widely used performance measure.It is calculated by adding true positive and true negative values and dividing them by all observed values.Eq.(1)shows how accuracy will be calculated with the formula.A true positive result is one in which the model accurately predicts the positive class.A true negative,on the other hand,is a result in which the model properly predicts the negative class.A false positive is an outcome in which the model forecasts the positive class inaccurately.A false negative is an outcome in which the model forecasts the negative class inaccurately.

Sensitivity measures the proportion of true negatives that are correctly identified.It are used when the data is unbalanced.The formula used to measure sensitivity is shown in Eq.(2).

Specificity measures the proportion of true positives that are correctly identified.It is calculated by Eq.(3).

4 Experimental Results and Discussion

This section presents the results achieved by the proposed model.The effectiveness of the classifiers is demonstrated using three standard measures:accuracy,sensitivity,and specificity.Four classification algorithms are also utilized for every dataset to compare its prediction performance with the proposed model.Forecast models are built using thirteen features and accuracy is measured using performance evaluation methods.To investigate the influence of the algorithm that affects the most,the results with cross-validation and split validation are compared and show how the proposed feature optimization method works better than other techniques.In the k-fold cross-validation,the dataset is split into k of equivalent size,the classifier is divided into K-1 groups,and the performance at every step is estimated utilizing the remaining section.The validation cycle is repeated k times.The performance of the classifier is calculated from the k results.Cross-validation uses a variety of k-values.In our experiment,k=10 is selected because it works well.The technique is repeated ten times for each step of the process,and prior to acquiring and testing a new set for the next cycle,the instances of the training and estimation group are randomly distributed throughout the data collection [35,36].Finally,at the end of the 10-time cycle,all output metrics are calculated.In split validation,the data are distributed into training and testing phases.We divide 70% of the data into training and 30% into testing.Tabs.4-6 describe the accuracy of the three datasets through crossvalidation achieved by the machine learning technique with the combination of features applied in the model.The best 87%accuracy is presented in Tab.3 with four attributes of the SVM algorithm.The accuracy of the Statlog dataset is listed in Tab.4,and the results of the Hungarian dataset are shown in Tab.5.According to Tab.4,the highest accuracy is 93% achieved by K-NN.In Tab.5,RF gave the best result with eight optimal features.The accuracy of the proposed model for three datasets with split validation is shown in Tabs.6-8.In Tab.6,NB and SVM gave the best accuracy of 91%.Tab.7 shows,with ten optimal features,that NB achieved the highest accuracy.Tab.8 also summarizes the accuracy achieved by SVM.The performance of the proposed method is trained and evaluated with three datasets;it is found that the proposed model chooses the optimal features and gave better accuracy.In the case of cross-validation of general datasets,the random forest gave the best 95%accuracy,with eight optimal features:thalach,vessel,sex,chp,thal,slope,opk/oldpeak,fbs,and exian.naive bayes achieved 97% accuracy following the features cp,oldpeak,ca,slope,exang,thal,chol,sex,thalch,fbs.In terms of accuracy,sensitivity,and specificity,split validation outperforms cross-validation.The results obtained with both validation methods are better than existing studies.Tab.9 shows the specificity and sensitivity of all datasets using cross-validation and split validation.The highest 100%sensitivity is achieved by a random forest from the Cleveland and Hungary datasets using cross-validation.From the Statlog dataset,100%sensitivity is achieved with a random forest.In case of split validation,100%specificity is achieved from the Cleveland dataset using four classifiers.By using all four classifiers,100%sensitivity is achieved for the Statlog dataset.Fig.2 shows the line graph for the sensitivity of all datasets using cross-validation and split validation.Fig.3 shows the line graph for the specificity of all datasets using cross-validation and split validation.

Table 3:Results achieved by using cross-validation(Cleveland)

Table 4:Results achieved by using cross-validation(statlog)

Table 5:Results achieved using cross-validation(Hungarian)

Table 6:Results achieved by using spilt validation(Cleveland)

Table 7:Results achieved by using spilt validation(Statlog)

Table 8:Results achieved by using spilt validation(Hungarian)

Table 9:Sensitivity and specificity for three datasets using both cross and split validation

Figure 2:Sensitivity of cross-validation(a)and split validation(b)

Figure 3:Specificity of cross validation(a)and split validation(b)

We compared the proposed model with previous state-of-the-art approaches.To evaluate the performance of the model with the performance of other models,benchmarking is useful.This approach has been used to determine whether the proposed model has attained reasonable accuracy compared to the accuracy obtained by other tests.Some research has been done on heart disease,but the results are not consistent.The work is contrasted with the available state-of-the-art that is similar to this work in Tab.10.In Tab.10 we compare our proposed method first with studies[11,37-39]using split validation and the same dataset on heart disease and secondly against these studies [17,40-42]on the same datasets,also an additional dataset[43]using cross-validation.To improve heart disease performance,the SVM provided 85%accuracy[9].The study uses hybrid machine learning methods to predict heart disease,in which the Random Forest with the linear model is 88%accurate[37].Another study predicted heart disease performance based on feature optimization.They obtained 83%accuracy using a modified differential evolution algorithm[38].In[39],the deep learning model to predict heart disease attained 95%accuracy.Cross-validation and machine learning algorithms are used to predict heart disease when the relief algorithm achieves high accuracy[17].Tab.11 shows the features selected for the construction of predictive models by twelve different research using the UCI heart disease dataset.Based on Tabs.10 and 11 related to the current research,we can see that the proposed model worked better.The performance accuracy of the proposed method with naive Bayes through split validation is 97%.The cross-validation performance of the proposed method is 95% with random forest.The overall results indicate that the accuracy of the proposed model is better as compared to previous approaches.In the prediction of heart disease,the selection of features is very critical.This study shows that convenient features increased efficiency with distinctive machine learning and feature selection algorithms.In our proposed method,the highest 100% sensitivity and specificity achieved with NB in case of spilt validation.In case of cross-validation,the highest sensitivity and specificity achieved with Random Forest.

Table 11:Features selection of the proposed model with other studies of UCI

5 Conclusion and Future Work

Identifying the raw health data collection of cardiac information would help save human lives in the long term and early diagnosis of heart disease failure.Machine learning methods have been used to process raw data and offer innovative perspectives on heart disease.Predicting heart disease is challenging and very significant in the medical field.However,once the disease is detected,the mortality rate is significantly regulated.In the early stages,preventive measures will be implemented as soon as possible.In this paper,experiments are carried out using three mostly used UCI datasets.Optimal features are selected using a brute-force algorithm.The experimental results indicated that the proposed technique achieved 97% accuracy with Naive Bayes through split validation and 95%accuracy with the random forest through cross-validation and the results are better than previous methods.The proposed model achieved better performance of accuracy,specificity,and sensitivity.Overall,the proposed model is better for the diagnosis of heart disease compared to previous approaches.The potential direction of this research is to use various combinations of machine learning techniques to improve prediction techniques.It is definitely desirable to extend this work to direct the findings to real-world datasets rather than just theoretical techniques and simulations.In the future,different machine learning algorithms and deep learning methods such as LSTM will be used for largescale data sets.Furthermore,new feature selection approaches are used to obtain an overall perspective on key features to increase prediction accuracy.

Funding Statement:This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No.2020R1I1A3069700).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.