APP下载

Cloud-Based Diabetes Decision Support System Using Machine Learning Fusion

2021-12-14ShabibAftabSaadAlanaziMunirAhmadMuhammadAdnanKhanAreejFatimaandNouhSabriElmitwally

Computers Materials&Continua 2021年7期

Shabib Aftab,Saad Alanazi,Munir Ahmad,Muhammad Adnan Khan,Areej Fatima and Nouh Sabri Elmitwally,6

1School of Computer Science,National College of Business Administration&Economics,Lahore,54000,Pakistan

2Department of Computer Science,Virtual University of Pakistan,Lahore,54000,Pakistan

3College of Computer and Information Sciences,Jouf University,Sakaka,72341,Saudi Arabia

4Riphah School of Computing&Innovation,Riphah International University,Lahore Campus,Lahore,54000,Pakistan

5Department of Computer Science,Lahore Garrison University,Lahore,54000,Pakistan

6Department of Computer Science,Faculty of Computers and Artifcial Intelligence,Cairo University,12613,Egypt

Abstract: Diabetes mellitus, generally known as diabetes, is one of the most common diseases worldwide.It is a metabolic disease characterized by insulin defciency,or glucose(blood sugar)levels that exceed 200 mg/dL(11.1 ml/L)for prolonged periods,and may lead to death if left uncontrolled by medication or insulin injections.Diabetes is categorized into two main types—type 1 and type 2—both of which feature glucose levels above “normal,” defned as 140 mg/dL.Diabetes is triggered by malfunction of the pancreas, which releases insulin, a natural hormone responsible for controlling glucose levels in blood cells.Diagnosis and comprehensive analysis of this potentially fatal disease necessitate application of techniques with minimal rates of error.The primary purpose of this research study is to assess the potential role of machine learning in predicting a person’s risk of developing diabetes.Historically, research has supported the use of various machine algorithms,such as naïve Bayes, decision trees, and artifcial neural networks, for early diagnosis of diabetes.However, to achieve maximum accuracy and minimal error in diagnostic predictions, there remains an immense need for further research and innovation to improve the machine-learning tools and techniques available to healthcare professionals.Therefore, in this paper, we propose a novel cloud-based machine-learning fusion technique involving synthesis of three machine algorithms and use of fuzzy systems for collective generation of highly accurate fnal decisions regarding early diagnosis of diabetes.

Keywords: Machine learning fusion; artifcial neural network; decision trees; naïve Bayes; diabetes prediction

1 Introduction

Diabetes mellitus, widely known as diabetes, is an increasingly common physiological health issue.A patient with diabetes, or a diabetic, suffers from a critical shortage of insulin, resulting in an inability to adequately process glucose (sugar) [1].Diabetes is generally classifed into two types:type 1 and type 2.Type-1 diabetes is characterized by insulin dependency, while type-2 diabetes is characterized by insulin defciency.Insulin is one of the vital hormones produced by the pancreas, the organ responsible for regulating glucose (blood sugar) levels in the human body.The primary underlying causes of diabetes are an imbalanced diet (i.e., one high in sugary foods), obesity, and genetic inheritance.Recent industrial and technological advancements have signifcantly affected the average human lifestyle, leading to the higher standard of living and accompanying decrease in physical activity commonly observed in developed countries.Accordingly, rates of diabetes have increased, and clinical analysis and effective diagnosis of diabetes have become key subjects of healthcare studies.Traditionally, diabetes has been diagnosed via clinical tests of glucose tolerance levels in patients [2].Like many other metabolic diseases, diabetes is associated with severe complications such as heart failure, kidney problems, and eyesight issues including complete blindness [3].An alarming report issued by the Diabetes Research Centre stated that the prevalence of diabetes has increased at a rate of 7% annually and doubled globally during the last decade, with more than 200 million now diagnosed.Research studies have indicated that 8% of the population aged 25–65 suffer from ailments linked to pancreatic dysfunction, and in a sample of 2.2 million of such patients, 17% were adults; most of these patients have high risk of developing diabetes in the near future [4].Diabetes can be fatal and otherwise can lead to severe,often irreparable damage to multiple organs.There is an immense need for tools and technologies enabling effcient, accurate investigation and diagnosis to support the decision making of health experts in managing this disease.

Recent studies indicate that accurate and timely diagnosis may prevent 80% of complications in patients with type-2 diabetes.Accurate and timely diagnosis provides a solid basis for effective treatment, helping to minimize cost of treatment and other diffculties for patients [5].These are the key success factors for prevention of diabetes complications and development of effective treatment strategies.Healthcare professionals can implement such strategies to reduce long-term damage caused by this disease.Due to its signifcant advantages, early detection has become a top priority among healthcare prognosis personnel.Notably, detection of type-2 diabetes requires a higher level of medical expertise, as this disease is more complex compared to type-1 diabetes.One of the most promising new methods for accurate early diagnosis is the use of an artifcial neural network (ANN).ANN is one of a number of recently developed machine-learning methods being implemented to predict disease earlier and more accurately.According to M.S.Shanker in his research paper “Using neural networks to predict the onset of diabetes mellitus” [6], ANN is considered a more suitable approach to early diagnosis than other machine-learning methods,particularly when one considers the factor of network topology.However, parameter optimization presents a major issue when utilizing ANN.Multi-layer perceptron (MLP), a subset of Deep neural networks (DNN), has offered effective resolutions to this problem.DNN are increasingly recommended to support diagnostic processes for diverse diseases [7], as DNN facilitate disease identifcation and diagnosis while minimizing human error [8].When utilizing neural networks for diagnosis, it is vital to attain a high level of accuracy, which is achieved via suffcient training and testing on patient datasets.DNN have shown particular promise for achieving maximum accuracy and minimal error through training and testing on datasets.

Machine-learning models are commonly used for diabetes prognostication and provide better results.Among machine-learning models, one of the most widely used methods for results classifcation is the Decision tree (DT).In machine-learning methods for disease diagnosis, the results of multiple DT can be synthesized to generate a random forest (RF) that yields a single collective fnal result—that is, a fnal diagnostic decision.The authors used RF in parallel with Principal component analysis (PCA).RF approximately obtains 80% accuracy.Historically,the primary objective of diabetes diagnosis was simply to help control the development of the disease.With support from machine learning, early diagnosis has become possible.High-risk individuals may now take precautionary measures to avoid consequences of the disease for as long as possible.Successful early diagnosis largely depends on accurate selection of classifers and related features.Researchers have been experimenting with various machine-learning methods,testing different algorithms with the aim of achieving superior rates of prediction accuracy.Previously explored algorithms include support-vector machines (SVM), J48, naïve Bayes, and DT;studies of these algorithms have proven that machine-learning methods achieve superior diagnostic results [9].The real strength of these algorithms lies in their fexibility to integrate data from varying sources [10].

In this study, we propose a new DNN approach for generating highly accurate predictions of type-2 diabetes.Our approach utilizes a cloud-based decision support system for early identifcation of diabetic patients.The proposed system uses real-time patient data as input to predict whether a particular patient has diabetes.We apply three popular machine-learning algorithms and a fuzzy system to achieve fnal diagnostic results with accuracy rates higher than those achieved in similar past studies.

2 Related Research

Researchers in [11] presented a hybrid framework for detection of type-2 diabetes that uses two techniques:K-means and C4.5.They used the clustering algorithm to identify class labels and C4.5 for classifcation.Their experiment on the Pima Indians diabetes dataset (PIDD) yielded a 92.38% accuracy rate.Researchers in [12] proposed a model using fuzzy C-means clustering techniques to diagnose type-2 diabetes.They used 768 records with nine features in their experiment, achieving 94.3% accuracy.In [13], researchers performed a comparative analysis of various classifcation and clustering techniques for diabetes diagnosis.They conducted tests to evaluate the performance of applied data-mining techniques.Their results indicated that the J48 classifer outperformed all other techniques in Weka with an accuracy rate of 81.33%.Researchers in [14]proposed a framework to diagnose diabetes using DT along with a fuzzy decision boundary system.The proposed framework achieved an accuracy of 75.8%.Researchers in [15] presented a system to detect diabetes using generalized discriminant analysis and least-squares SVM.Their proposed system demonstrated 82.50% accuracy.Researchers in [16] presented a diabetes detection system using a modifed artifcial bee colony (ABC) optimization technique with fuzzy rules.Their proposed system showed an accuracy rate of 82.68%.Researchers in [17] proposed a model for diabetes detection that integrated ANN and SVM using a stacked ensemble technique.They applied their model to the PIDD and achieved an accuracy rate of 88.04%.In [18], researchers presented an ensemble classifcation model based on data streams.The proposed model was able to perform classifcation tasks in a data-streaming environment.Researchers in [19] also presented an ensemble classifcation model; theirs was designed to detect diabetic retinopathy.They used fuzzy RF and applied Dominance-based Rough Sets Theory.Their experiment used the SRJUH dataset and showed an accuracy rate of 77%.Researchers in [20] presented a heterogeneous ensemble classifcation model that included a fuzzy rule inference engine to tackle the issue of uncertainty in the results of base classifers.

3 Materials and Methods

Early diagnosis of type-2 diabetes can offer patients the opportunity to improve their lifestyles and dietary habits.Moreover, early detection can guide patients to start taking proper medication before the disease worsens.In our study, we present a method for early detection of diabetes that uses a cloud-based intelligent framework empowered by supervised machine-learning techniques and fuzzy systems as shown in Fig.1.Our framework consists of two layers:Training and testing.Each layer further consists of multiple stages.

Figure 1:CBD-DSS-FM using machine-learning fusion

The training layer begins with the selection of a proper dataset.In the present study, we selected a pre-labeled dataset of diabetes patients [21] for the implementation of our proposed framework.This dataset consists of 15,000 instances and a total of 10 features, of which nine features are independent and one, the output class, is dependent.The pre-processing layer of our proposed framework involves two stages:1) Data cleaning and normalization and 2) data splitting.Data cleaning removes missing values using the mean imputation method, while normalization brings the values of all features into a certain range.Both activities help the classifcation process achieve higher performance/accuracy.After data cleaning and normalization, the dataset is divided into training data and test data at a ratio of 70:30 on the basis of class split.

After pre-processing is the classifcation process, which consists of training of three widelyused supervised classifcation techniques:ANN, DT, and naïve Bayes (NB).This layer receives input from the training set and test set in the pre-processing stage and provides three prediction results for the next stage.All three classifcation algorithms must be optimized to achieve maximum accuracy.During ANN confguration, we used one hidden layer with 10 neurons and backpropagation technique to tune the weights.We used a multi-layer perceptron with at least one hidden layer besides the input and output layers.The steps involved in backpropagation are as follows:initialization of weight, feed forward, backpropagation of error, and updating of weight and bias.Every neuron present in the hidden layer has an activation function such asf(x)=Sigmoid(x).The sigmoid function for input and the hidden layer of the proposed BPNN can be written as

Input derived from the output layer is

The output layer activation function is

Backpropagation error is represented by the above equation, where,τkandppkrepresent the desired output and estimated output, respectively.In Eq.(6), rate of change in weight for output,the layer is written as

After applying the chain rule method, the above equation can be stated as

By substituting the values in Eq.(7), the value of weight changed can be obtained as presented in Eq.(8).

where,

Then, we apply the chain rule method for the updating of weights between input and hidden layers:

where∊represents the constant:

After simplifcation, the above equation can be stated as

where

Eq.(10) is used for updating the weights between hidden layers and output.

Eq.(11) is used for updating the weights between the input and hidden layer.

In DT, we used three optimizers one by one:Random search, Bayesian optimization, and grid search.Bayesian optimization performed well and was hence selected for this framework.

GINI index is

and information gain is

In machine learning, information gain is used to defne a desired sequence of attributes for investigation of the most rapidly reduced state ofS.DT depicts how each stage depends on the outcomes of the analysis of the last attribute; applied in the area of machine learning, this is known as decision-tree learning.An element with high mutual information must be preferred to other attributes.

Here,f(z)serves to minimize error rate, or Root mean squared error (RMSE), assessed on the validation set.zcan take on any value from domainZ, andz∗is the set of hyper-parameters that relent the lowest value of the score.In simple terms, we aimed to fnd the model hyperparameters that would deliver the best score on the validation set metric.This model is known as a “surrogate,” which is represented asp(z|n), for the objective function:

We intended to optimize expected improvement with respect to proposed set of hyperparametersn.Here,z∗is an edge value of the objective function, whereaszdepicts the actual value of the function using hyper-parametersn, andp(z|n)is the surrogate probability model stating the probability ofzgivenn.This suggests the best hyper-parameters under the functionp(z|n).

The hyper-parameters are not expected to produce any improvement ifp(z|n)is zero everywhere thatz

Thep(n|z)function is expressed as

wherel(n)is the distribution of the hyper-parameters when the score is lower than the thresholdz∗, andg(n)is the distribution when the score is higher thanz∗.

z∗is the minimum observed true objective function score, whereaszstands for new scores.To maximize the expected improvement result under the Gaussian Process model, the new scorezmust be less than the current minimum score (z

Our rationale for this equation is that we have two different distributions for the hyperparameters:the frst represents where the value of the objective function is less than the threshold,l(n), and the other where the value of the objective function is greater than the threshold,g(n).

To increase expected improvement, points with high probability underl(n)and low probability underg(n)might be chosen as the next hyper-parameter.

In NB, three kernel types are used:Box, Gaussian, and Triangle.

Probability of OutCome|Evidence(Posterior Probability)

The traditional NB classifer estimates probabilities by an approximation of the data through a function, such as a Gaussian distribution:

whereμtrepresent the mean of the values of attributeStaveraged over training points with class labelz, andσzrepresents the standard deviation.The one-parameter Box–Cox transformations are defned as

and the two-parameter Box–Cox transformations as

After particular optimization, each optimized model is stored in the cloud.The next stage of the training layer in our proposed framework deals with the creation and implementation of fuzzy logic on the results of optimized classifcation algorithms as shown in Fig.2.This layer receives the results of ANN, DT, and NB and generates the output using fuzzy rules as shown in Figs.3 and 4, which is again stored in the cloud.

Conditional orif-thenstatements are used to make fuzzy logic.On the basis of these statements, fuzzy rules are constructed as follows:

IF (NeuralNetwork is yes and NaïveBayes is yes and DecisionTree is yes) THEN (Diabetes is yes).

IF (NeuralNetwork is yes and NaïveBayes is yes and DecisionTree is no) THEN (Diabetes is yes).

IF (NeuralNetwork is yes and NaïveBayes is no and DecisionTree is yes) THEN (Diabetes is yes).

IF (NeuralNetwork is no and NaïveBayes is yes and DecisionTree is yes) THEN (Diabetes is yes).

IF (NeuralNetwork is no and NaïveBayes is no and DecisionTree is also no) THEN (Diabetes is no).

IF (NeuralNetwork is yes and NaïveBayes is no and DecisionTree is no) THEN (Diabetes is no).

IF (NeuralNetwork is no and NaïveBayes is no and DecisionTree is yes) THEN (Diabetes is no).

IF (NeuralNetwork is no and NaïveBayes is yes and DecisionTree is no) THEN (Diabetes is no).

In formulating the rules, it is evident that if any two of the three supervised classifcation techniques aretrue, then diabetes istrue; otherwise, diabetes isfalse.

Figure 2:Proposed fused ML rule surface

Figure 3:Proposed fused ML result with diabetes (yes)

Figure 4:Proposed fused ML result with diabetes (no)

Fig.2 shows the proposed fused ML rule surface of diabetes with respect to the neural network and naïve Bayes results.If both neural network and naive Bayes solutions predict no diabetes, then the resultant fused ML also predicts no diabetes; otherwise, the fused ML predicts diabetes.

Fig.3 shows that if the neural network diagnoses no diabetes and remaining algorithms—naïve Bayes and decision tree—both diagnose diabetes, then the fused ML diagnoses the patient with diabetes.

Fig.4 shows that if all three algorithms—neural network, naïve Bayes, and decision tree—diagnose no diabetes, then the fused ML also diagnoses no diabetes.

The second layer of the proposed framework deals with the real-time classifcation of diabetic patients.The real-time patient data can be given as input to the proposed machine-learning fuzzed model, and appointments can be made on the basis of the results.If any patient is predicted to be a diabetic, then he or she is appointed to an early slot on an emergency basis; meanwhile, if the patient is predicted to be a non-diabetic, then he or she can be given an appointment following the regular schedule.

4 Results and Discussion

To implement the proposed framework, we used a dataset [21] consisting of 10 features and 15,000 instances as shown in Tab.1.The frst nine features were independent features used as inputs to calculate and predict the tenth feature, the output class indicating whether the particular patient is suffering from diabetes or not.If the value of this feature is 1, the patient is diabetic,and if the value is 0, the patient is non-diabetic.

Table 1:Dataset parameters

We divided the dataset into two parts, 70% training data (10,500) and 30% test data (4,500).We performed the pre-processing activities of cleaning and normalization on the dataset prior to classifcation.For classifcation of the dataset, we used three machine learning algorithms:ANN,DT, and NB.We optimized these techniques iteratively until we achieved maximum performance.We applied various statistical measures to assess the performance of the classifcation techniques as shown below.

whereRO0,RO1,EO0andEO1represent the predicted positive output, predicted negative output,expected positive output, and expected negative output, respectively.

First, we used ANN to classify the dataset.We used one hidden layer consisting of nine neurons while designing the structure of the neural network.We used 70% of the dataset, consisting of 10,500 records, for training the model and the remaining 30% of the dataset, consisting of 4,500 records, for testing.Of the 10,500 records reserved for training, 7,000 were negative and 3,500 were positive.During the training process with ANN, 6,801 records were classifed as negative and 3,273 were classifed as positive.After comparing the expected results with the output results shown in Tab.2, we achieved 96% accuracy with a 4% miss rate.In testing with ANN,2,831 records were classifed as negative and 1,285 were classifed as positive (Tab.2).The accuracy rate of ANN in the testing stage was 91.5% and the miss rate was 8.5%.

Table 2:Artifcial neural network (ANN)

During the training process with DT, 6,801 records were classifed as negative and 3,273 were classifed as positive.After comparison of the expected negative and positive records with the output results of the training process with DT (Tab.3), we achieved an accuracy rate of 95.9% and miss rate of 4.1%.During the testing process with DT, 2,898 records were classifed as negative while 1,404 were classifed as positive (Tab.3).During our comparison of expected output with output of the testing process with DT, we achieved an accuracy rate of 94.9% and miss rate of 5.1%.

Table 3:Decision tree (DT)

During training with NB, 6,647 records were classifed as negative and 3,109 were classifed as positive.After comparing the achieved output of NB in the training stage with the expected output (Tab.4), we achieved 92.91% accuracy and a miss rate of 7.09%.During the testing process, we used 4,500 records (30% of the dataset) for validation.Of these records, 3,000 were negative and 1,500 records were positive.The NB classifed 2,828 records as negative and 1,348 as positive.After comparison with the expected output (Tab.4), the proposed model achieved an accuracy rate of 92.8% and miss rate of 7.2%.

Table 4:Naïve based (NB)

Finally, we inputted all of the records of test data into the fuzzy system along with the output class for the fnal decision.The fuzzy system classifed 2,903 records as negative and 1,380 as positive (Tab.5).During comparison of expected output and fuzzy system output, we achieved 95.2% accuracy with a miss rate of 4.8%.

Table 5:FM proposed (testing)

Table 6:Detailed results of proposed decision support system

Table 7:Performance analysis of proposed decision support system

Tab.6 presents detailed results of the three classifcation techniques along with those of our proposed model (FM).In testing, the fuzzy model outperformed other algorithms in all applied accuracy measures.

Tab.7 refects the detailed results of our proposed fused model along with input and output.We can observe that the real-time input parameters of the patients were given to the decision support system, where the three classifers individually predicted diabetes diagnosis and the fuzzy inference system then formulated the fnal result.

Tab.8 displays the accuracy and error rates achieved by our proposed framework in comparison with other algorithms previously applied in diabetes diagnosis.The results obtained from the fused model in the proposed framework are compared with backpropagation [9], Bayesian regulation [22], ANN [23], GRNN [24], PNN [25], DELM [26], NB [1], J48 [1], and RBF [1].The data indicates that our proposed FM framework signifcantly outperformed the algorithms used in previous research.

Table 8:Accuracy comparison of decision support systems

5 Conclusion

Early diagnosis of diabetes using machine-learning techniques is a challenging task.In this paper, we proposed a novel cloud-based decision-support system for diabetes prediction using a fused machine-learning technique.Our proposed system integrates the classifcation accuracy of three supervised machine-learning techniques (ANN, NB, and DT) with a fuzzy inference system to generate accurate predictions.Our system consists of two layers:training and testing.The training layer initiates with data pre-processing activities—data cleaning and normalization—and is followed by data splitting for classifcation.In our study, we divided the dataset for training and testing at a ratio of 70:30 to optimize classifcation techniques and yield more accurate results in the validation data.After pre-processing, we executed the classifcation process, which involved training of the three classifcation techniques (ANN, NB, and DT) followed by validation on our selected dataset.We optimized these techniques until maximum accuracy was achieved.Finally,using a fuzzy system, we synthesized the three prediction results from the three classifcation techniques to generate the fnal prediction output.In our study, our proposed system achieved an accuracy rate of 95.2%, outperforming previously applied machine-learning techniques for diabetes diagnosis.

Acknowledgement:The authors thank their families and colleagues for their continued support.

Funding Statement:The author(s) received no specifc funding for this study.

Conficts of Interest:The authors declare that they have no conficts of interest to report regarding the present study.