diabetes prediction using machine learning research paper

Open access
Published: 20 December 2021

Machine learning and deep learning predictive models for type 2 diabetes: a systematic review

Luis Fregoso-Aparicio ORCID: orcid.org/0000-0003-4986-5745 1 ,
Julieta Noguez ORCID: orcid.org/0000-0002-6000-3452 2 ,
Luis Montesinos ORCID: orcid.org/0000-0003-3976-4190 2 &
José A. García-García ORCID: orcid.org/0000-0001-6876-4558 3

Diabetology & Metabolic Syndrome volume 13 , Article number: 148 ( 2021 ) Cite this article

19k Accesses

52 Citations

11 Altmetric

Metrics details

Diabetes Mellitus is a severe, chronic disease that occurs when blood glucose levels rise above certain limits. Over the last years, machine and deep learning techniques have been used to predict diabetes and its complications. However, researchers and developers still face two main challenges when building type 2 diabetes predictive models. First, there is considerable heterogeneity in previous studies regarding techniques used, making it challenging to identify the optimal one. Second, there is a lack of transparency about the features used in the models, which reduces their interpretability. This systematic review aimed at providing answers to the above challenges. The review followed the PRISMA methodology primarily, enriched with the one proposed by Keele and Durham Universities. Ninety studies were included, and the type of model, complementary techniques, dataset, and performance parameters reported were extracted. Eighteen different types of models were compared, with tree-based algorithms showing top performances. Deep Neural Networks proved suboptimal, despite their ability to deal with big and dirty data. Balancing data and feature selection techniques proved helpful to increase the model’s efficiency. Models trained on tidy datasets achieved almost perfect models.

Introduction

Diabetes mellitus is a group of metabolic diseases characterized by hyperglycemia resulting from defects in insulin secretion, insulin action, or both [ 1 ]. In particular, type 2 diabetes is associated with insulin resistance (insulin action defect), i.e., where cells respond poorly to insulin, affecting their glucose intake [ 2 ]. The diagnostic criteria established by the American Diabetes Association are: (1) a level of glycated hemoglobin (HbA1c) greater or equal to 6.5%; (2) basal fasting blood glucose level greater than 126 mg/dL, and; (3) blood glucose level greater or equal to 200 mg/dL 2 h after an oral glucose tolerance test with 75 g of glucose [ 1 ].

Diabetes mellitus is a global public health issue. In 2019, the International Diabetes Federation estimated the number of people living with diabetes worldwide at 463 million and the expected growth at 51% by the year 2045. Moreover, it is estimated that there is one undiagnosed person for each diagnosed person with a diabetes diagnosis [ 2 ].

The early diagnosis and treatment of type 2 diabetes are among the most relevant actions to prevent further development and complications like diabetic retinopathy [ 3 ]. According to the ADDITION-Europe Simulation Model Study, an early diagnosis reduces the absolute and relative risk of suffering cardiovascular events and mortality [ 4 ]. A sensitivity analysis on USA data proved a 25% relative reduction in diabetes-related complication rates for a 2-year earlier diagnosis.

Consequently, many researchers have endeavored to develop predictive models of type 2 diabetes. The first models were based on classic statistical learning techniques, e.g., linear regression. Recently, a wide variety of machine learning techniques has been added to the toolbox. Those techniques allow predicting new cases based on patterns identified in training data from previous cases. For example, Kälsch et al. [ 5 ] identified associations between liver injury markers and diabetes and used random forests to predict diabetes based on serum variables. Moreover, different techniques are sometimes combined, creating ensemble models to surpass the single model’s predictive performance.

The number of studies developed in the field creates two main challenges for researchers and developers aiming to build type 2 diabetes predictive models. First, there is considerable heterogeneity in previous studies regarding machine learning techniques used, making it challenging to identify the optimal one. Second, there is a lack of transparency about the features used to train the models, which reduces their interpretability, a feature utterly relevant to the doctor.

This review aims to inform the selection of machine learning techniques and features to create novel type 2 diabetes predictive models. The paper is organized as follows. “ Background ” section provides a brief background on the techniques used to create predictive models. “ Methods ” section presents the methods used to design and conduct the review. “ Results ” section summarizes the results, followed by their discussion in “ Discussion ” section, where a summary of findings, the opportunity areas, and the limitations of this review are presented. Finally, “ Conclusions ” section presents the conclusions and future work.

Machine learning and deep learning

Over the last years, humanity has achieved technological breakthroughs in computer science, material science, biotechnology, genomics, and proteomics [ 6 ]. These disruptive technologies are shifting the paradigm of medical practice. In particular, artificial intelligence and big data are reshaping disease and patient management, shifting to personalized diagnosis and treatment. This shift enables public health to become predictive and preventive [ 6 ].

Machine learning is a subset of artificial intelligence that aims to create computer systems that discover patterns in training data to perform classification and prediction tasks on new data [ 7 ]. Machine learning puts together tools from statistics, data mining, and optimization to generate models.

Representational learning, a subarea of machine learning, focuses on automatically finding an accurate representation of the knowledge extracted from the data [ 7 ]. When this representation comprises many layers (i.e., a multi-level representation), we are dealing with deep learning.

In deep learning models, every layer represents a level of learned knowledge. The nearest to the input layer represents low-level details of the data, while the closest to the output layer represents a higher level of discrimination with more abstract concepts.

The studies included in this review used 18 different types of models:

Deep Neural Network (DNN): DNNs are loosely inspired by the biological nervous system. Artificial neurons are simple functions depicted as nodes compartmentalized in layers, and synapses are the links between them [ 8 ]. DNN is a data-driven, self-adaptive learning technique that produces non-linear models capable of real-world modeling problems.

Support Vector Machines (SVM): SVM is a non-parametric algorithm capable of solving regression and classification problems using linear and non-linear functions. These functions assign vectors of input features to an n-dimensional space called a feature space [ 9 ].

k-Nearest Neighbors (KNN): KNN is a supervised, non-parametric algorithm based on the “things that look alike” idea. KNN can be applied to regression and classification tasks. The algorithm computes the closeness or similarity of new observations in the feature space to k training observations to produce their corresponding output value or class [ 9 ].

Decision Tree (DT): DTs use a tree structure built by selecting thresholds for the input features [ 8 ]. This classifier aims to create a set of decision rules to predict the target class or value.

Random Forest (RF): RFs merge several decision trees, such as bagging, to get the final result by a voting strategy [ 9 ].

Gradient Boosting Tree (GBT) and Gradient Boost Machine (GBM): GBTs and GBMs join sequential tree models in an additive way to predict the results [ 9 ].

J48 Decision Tree (J48): J48 develops a mapping tree to include attribute nodes linked by two or more sub-trees, leaves, or other decision nodes [ 10 ].

Logistic and Stepwise Regression (LR): LR is a linear regression technique suitable for tasks where the dependent variable is binary [ 8 ]. The logistic model is used to estimate the probability of the response based on one or more predictors.

Linear and Quadratic Discriminant Analysis (LDA): LDA segments an n-dimensional space into two or more dimensional spaces separated by a hyper-plane [ 8 ]. The aim of it is to find the principal function for every class. This function is displayed on the vectors that maximize the between-group variance and minimizes the within-group variance.

Cox Hazard Regression (CHR): CHR or proportional hazards regression analyzes the effect of the features to occur a specific event [ 11 ]. The method is partially non-parametric since it only assumes that the effects of the predictor variables on the event are constant over time and additive on a scale.

Least-Square Regression: (LSR) method is used to estimate the parameter of a linear regression model [ 12 ]. LSR estimators minimize the sum of the squared errors (a difference between observed values and predicted values).

Multiple Instance Learning boosting (MIL): The boosting algorithm sequentially trains several weak classifiers and additively combines them by weighting each of them to make a strong classifier [ 13 ]. In MIL, the classifier is logistic regression.

Bayesian Network (BN): BNs are graphs made up of nodes and directed line segments that prohibit cycles [ 14 ]. Each node represents a random variable and its probability distribution in each state. Each directed line segment represents the joint probability between nodes calculated using Bayes’ theorem.

Latent Growth Mixture (LGM): LGM groups patients into an optimal number of growth trajectory clusters. Maximum likelihood is the approach to estimating missing data [ 15 ].

Penalized Likelihood Methods: Penalizing is an approach to avoid problems in the stability of the estimated parameters when the probability is relatively flat, which makes it difficult to determine the maximum likelihood estimate using simple methods. Penalizing is also known as shrinkage [ 16 ]. Least absolute shrinkage and selection operator (LASSO), smoothed clipped absolute deviation (SCAD), and minimax concave penalized likelihood (MCP) are methods using this approach.

Alternating Cluster and Classification (ACC): ACC assumes that the data have multiple hidden clusters in the positive class, while the negative class is drawn from a single distribution. For different clusters of the positive class, the discriminatory dimensions must be different and sparse relative to the negative class [ 17 ]. Clusters are like “local opponents” to the complete negative set, and therefore the “local limit” (classifier) has a smaller dimensional subspace than the feature vector.

Some studies used a combination of multiple machine learning techniques and are subsequently labeled as machine learning-based method (MLB).

Systematic literature review methodologies

This review follows two methodologies for conducting systematic literature reviews: the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [ 18 ] and the Guidelines for performing Systematic Literature Reviews in Software Engineering [ 19 ]. Although these methodologies hold many similarities, there is a substantial difference between them. While the former was tailored for medical literature, the latter was adapted for reviews in computer science. Hence, since this review focuses on computer methods applied to medicine, both strategies were combined and implemented. The PRISMA statement is the standard for conducting reviews in the medical sciences and was the principal strategy for this review. It contains 27 items for evaluating included studies, out of which 23 are used in this review. The second methodology is an adaptation by Keele and Durham Universities to conduct systematic literature reviews in software engineering. The authors provide a list of guidelines to conduct the review. Two elements were adopted from this methodology. First, the protocol’s organization in three stages (planning, conducting, and reporting). Secondly, the quality assessment strategy to select studies based on the information retrieved by the search.

Related works

Previous reviews have explored machine learning techniques in diabetes, yet with a substantially different focus. Sambyal et al. conducted a review on microvascular complications in diabetes (retinopathy, neuropathy, nephropathy) [ 20 ]. This review included 31 studies classified into three groups according to the methods used: statistical techniques, machine learning, and deep learning. The authors concluded that machine learning and deep learning models are more suited for big data scenarios. Also, they observed that the combination of models (ensemble models) produced improved performance.

Islam et al. conducted a review with meta-analysis on deep learning models to detect diabetic retinopathy (DR) in retinal fundus images [ 21 ]. This review included 23 studies, out of which 20 were also included for meta-analysis. For each study, the authors identified the model, the dataset, and the performance metrics and concluded that automated tools could perform DR screening.

Chaki et al. reviewed machine learning models in diabetes detection [ 22 ]. The review included 107 studies and classified them according to the model or classifier, the dataset, the features selection with four possible kinds of features, and their performance. The authors found that text, shape, and texture features produced better outcomes. Also, they found that DNNs and SVMs delivered better classification outcomes, followed by RFs.

Finally, Silva et al. [ 23 ] reviewed 27 studies, including 40 predictive models for diabetes. They extracted the technique used, the temporality of prediction, the risk of bias, and validation metrics. The objective was to prove whether machine learning exhibited discrimination ability to predict and diagnose type 2 diabetes. Although this ability was confirmed, the authors did not report which machine learning model produced the best results.

This review aims to find areas of opportunity and recommendations in the prediction of diabetes based on machine learning models. It also explores the optimal performance metrics, the datasets used to build the models, and the complementary techniques used to improve the model’s performance.

Objective of the review

This systematic review aims to identify and report the areas of opportunity for improving the prediction of diabetes type 2 using machine learning techniques.

Research questions

Research Question 1 (RQ1): What kind of features make up the database to create the model?

Research Question 2 (RQ2): What machine learning technique is optimal to create a predictive model for type 2 diabetes?

Research Question 3 (RQ3): What are the optimal validation metrics to compare the models’ performance?

Information sources

Two search engines were selected to search:

PubMed, given the relationship between a medical problem such as diabetes and a possible computer science solution.

Web of Science, given its extraordinary ability to select articles with high affinity with the search string.

These search engines were also considered because they search in many specialized databases (IEEE Xplore, Science Direct, Springer Link, PubMed Central, Plos One, among others) and allow searching using keywords combined with boolean operators. Likewise, the database should contain articles with different approaches to predictive models and not specialized in clinical aspects. Finally, the number of articles to be included in the systematic review should be sufficient to identify areas of opportunity for improving models’ development to predict diabetes.

Search strategy

Three main keywords were selected from the research questions. These keywords were combined in strings as required by each database in their advanced search tool. In other words, these strings were adapted to meet the criteria of each database Table 1 .

Eligibility criteria

Retrieved records from the initial search were screened to check their compliance with eligibility criteria.

Firstly, papers published from 2017 to 2021 only were considered. Then, two rounds of screening were conducted. The first round focused mainly on the scope of the reported study. Articles were excluded if the study used genetic data to train the models, as this was not a type of data of interest in this review. Also, articles were excluded if the full text was not available. Finally, review articles were also excluded.

In the second round of screening, articles were excluded when machine learning techniques were not used to predict type 2 diabetes but other types of diabetes, treatments, or diseases associated with diabetes (complications and related diseases associated with metabolic syndrome). Also, studies using unsupervised learning were excluded as they cannot be validated using the same performance metrics as supervised learning models, preventing comparison.

Quality assessment

After retrieving the selected articles, three parameters were selected, each one generated by each research question. The eligibility criteria are three possible subgroups according to the extent to which the article satisfied it.

The dataset contains sociodemographic and lifestyle data, clinical diagnosis, and laboratory test results as attributes for the model.

Dataset contains only one kind of attributes.

Dataset contains similar kinds of attributes.

Dataset uses EHRs with multiple kinds of attributes.

The article presents a model with a machine learning technique to predict type 2 diabetes.

Machine Learning methods are not used at all.

The prediction method in the model is used as part of the prepossessing for the data to do data mining.

Model used a machine learning technique to predict type 2 diabetes.

The authors use supervised learning with validation metrics to contrast their results with previous work.

The authors used unsupervised methods.

The authors used a supervised method with one validation metric or several methods with supervised and unsupervised learning.

The authors used supervised learning with more than one metric to validate the model (accuracy, specificity, sensitivity, area under the ROC, F1-score).

Data extraction

After assessing the papers for quality, the intersection of the subgroups QA2.3 and QA1.1 or QA1.2 or QA1.3 and QA3.2 or QA3.3 were processed as follows.

First, the selected articles were grouped in two possible ways according to the data type (glucose forecasting or electronic health records). The first group contains models that screen the control levels of blood glucose, while the second group contains models that predict diabetes based on electronic health records.

The second classification was more detailed, applying for each group the below criteria.

The data extraction criteria are:

Machine learning model (specify which machine learning method use)

Validation parameter (accuracy, sensitivity, specificity, F1-score, AUC (ROC))

Complementary techniques (complementary statistics and machine learning techniques used for the models)

Data sampling (cross-validation, training-test set, complete data)

Description of the population (age, balanced or imbalance, population cohort size).

Risk of bias analyses

Risk of bias in individual studies.

The risk of bias in individual studies (i.e., within-study bias) was assessed based on the characteristics of the sample included in the study and the dataset used to train and test the models. One of the most common risks of bias is when the data is imbalanced. When the dataset has significantly more observations for one label, the probability of selecting that label increases, leading to misclassification.

The second parameter that causes a risk of bias is the age of participants. In most cases, diabetes onset would be in older people making possible bound between 40 to 80 years. In other cases, the onset occurs at early age generating another dataset with a range from 21 to 80.

A third parameter strongly related to age is the early age onset. Complications increase and appear early when a patient lives more time with the disease, making it harder to develop a model only for diabetes without correlation of their complications.

Finally, as the fourth risk of bias, according to Forbes [ 24 ] data scientists spend 80% of their time on data preparation, and 60% of it is in data cleaning and organization. A well-structured dataset is relevant to generate a good performance of the model. That can be check in the results from the data items extraction the datasets like PIMA dataset that is already clean and organized well generate a model with the recall of 1 [ 25 ] also the same dataset reach an accuracy of 0.97 [ 26 ] in another model. Dirty data can not achieve values as good as clean data.

Risk of bias across studies

The items considered to assess the risk of bias across the studies (i.e., between-study bias) were the reported validation parameters and the dataset and complementary techniques used.

Validation metrics were chosen as they are used to compare the performance of the model. The studies must be compared using the same metrics to avoid bias from the validation methods.

The complementary techniques are essential since they can be combined with the primary approach to creating a better performance model. It causes a bias because it is impossible to discern if the combination of the complementary and the machine learning techniques produces good performance or if the machine learning technique per se is superior to others.

Search results and reduction

The initial search generated 1327 records, 925 from PubMed and 402 from Web of Science. Only 130 records were excluded when filtering by publication year (2017–2021). Therefore, further searches were conducted using fine-tuned search strings and options for both databases to narrow down the results. The new search was carried out using the original keywords but restricting the word ‘diabetes’ to be in the title, which generated 517 records from both databases. Fifty-one duplicates were discarded. Therefore, 336 records were selected for further screening.

Further selection was conducted by applying the exclusion criteria to the 336 records above. Thirty-seven records were excluded since the study reported used non-omittable genetic attributes as model inputs, something out of this review’s scope. Thirty-eight records were excluded as they were review papers. All in all, 261 articles that fulfill the criteria were included in the quality assessment.

Figure 1 shows the flow diagram summarizing this process.

Flow diagram indicating the results of the systematic review with inclusions and exclusions

The 261 articles above were assessed for quality and classified into their corresponding subgroup for each quality question (Fig. 2 ).

Percentage of each subgroup in the quality assessment. The criteria does not apply for two result for the Quality Assessment Questions 1 and 3

The first question classified the studies by the type of database used for building the models. The third subgroup represents the most desirable scenario. It includes studies where models were trained using features from Electronic Health Records or a mix of datasets including lifestyle, socio-demographic, and health diagnosis features. There were 22, 85, and 154 articles in subgroups one to three, respectively.

The second question classified the studies by the type of model used. Again, the third subgroup represents the most suitable subgroup as it contains studies where a machine learning model was used to predict diabetes onset. There were 46 studies in subgroup one, 66 in subgroup two, and 147 in subgroup three. Two studies were omitted from these subgroups: one used cancer-related model; another used a model of no interest to this review.

The third question clustered the studies based on their validation metrics. There were 25 studies in subgroup one (semi-supervised learning), 68 in subgroup two (only one validation metric), and 166 in subgroup three ( \(>1\) validation parameters). The criteria are not applied to two studies as they used special error metrics, making it impossible to compare their models with the rest.

Data extraction excluded 101 articles from the quantitative synthesis for two reasons. twelve studies used unsupervised learning. Nineteen studies focused on diabetes treatments, 33 in other types of diabetes (eighteen type 1 and fifteen Gestational), and 37 associated diseases.

Furthermore, 70 articles were left out of this review as they focus on the prediction of diabetes complications (59) or tried to forecast levels of glucose (11), not onset. Therefore, 90 articles were chosen for the next steps.

Table 2 summarize the results of the data extraction. These tables are divided into two main groups, each of them corresponding to a type of data.

For the risk of bias in the studies: unbalanced data means that the number of observations per class is not equally distributed. Some studies applied complementary techniques (e.g., SMOTE) to prevent the bias produced by unbalance in data. These techniques undersample the predominant class or oversample the minority class to produce a balanced dataset.

Other studies used different strategies to deal with other risks for bias. For instance, they might exclude specific age groups or cases presenting a second disease that could interfere with the model’s development to deal with the heterogeneity in some cohorts’ age.

For the risk of bias across the studies: the comparison between models was performed on those reporting the most frequently used validation metrics, i.e., accuracy and AUC (ROC). The accuracy is estimated to homogenize the criteria of comparison when other metrics from the confusion matrix were calculated, or the population’s knowledge is known. The confusion matrix is a two-by-two matrix containing four counts: true positives, true negatives, false positives, and false negatives. Different validation metrics such as precision, recall, accuracy, and F1-score are computed from this matrix.

Two kinds of complementary techniques were found. Firstly, techniques for balancing the data, including oversampling and undersampling methods. Secondly, feature selection techniques such as logistic regression, principal component analysis, and statistical testing. A comparison still can be performed between them with the bias caused by the improvement of the model.

This section discusses the findings for each of the research questions driving this review.

RQ1: What kind of features makes up the database to create the model?

Our findings suggest no agreement on the specific features to create a predictive model for type 2 diabetes. The number of features also differs between studies: while some used a few features, others used more than 70 features. The number and choice of features largely depended on the machine learning technique and the model’s complexity.

However, our findings suggest that some data types produce better models, such as lifestyle, socioeconomic and diagnostic data. These data are available in most but not all Electronic Health Records. Also, retinal fundus images were used in many of the top models, as they are related to eye vessel damage derivated from diabetes. Unfortunately, this type of image is no available in primary care data.

RQ2: What machine learning technique is optimal to create a predictive model for type 2 diabetes?

Figure 3 shows a scatter plot of studies that reported accuracy and AUC (ROC) values (x and y axes, respectively. The color of the dots represents thirteen of the eighteen types of model listed in the background. Dot labels represent the reference number of the study. A total of 30 studies is included in the plot. The studies closer to the top-right corner are the best ones, as they obtained high values for both validation metrics.

Scatterplot of AUC (ROC) vs. Accuracy for included studies. Numbers correspond to the number of reference and color dot the type of model, desired model has values of x-axis equal 1 and y-axis also equal 1

Figures 4 and 5 show the average accuracy and AUC (ROC) by model. Not all models from the background appear in both graphs since not all studies reported both metrics. Notably, most values represent a single study or the average of two studies. The exception is the average values for SVMs, RFs, GBTs, and DNNs, calculated with the results reported by four studies or more. These were the most popular machine learning techniques in the included studies.

Average accuracy by model. For papers with more than one model the best score is the model selected to the graph. A better model has a higher value

Average AUC (ROC) by model. For papers with more than one model the best score is the model selected to the graph. A better model has a higher value

RQ3: Which are the optimal validation metrics to compare the models’ improvement?

Considerable heterogeneity was found in this regard, making it harder to compare the performance between the models. Most studies reported some metrics computed from the confusion matrix. However, studies focused on statistical learning models reported hazard ratios and the c-statistic.

This heterogeneity remains an area of opportunity for further studies. To deal with it, we propose reporting at least three metrics from the confusion matrix (i.e., accuracy, sensitivity, and specificity), which would allow computing the rest. Additionally, the AUC (ROC) should be reported as it is a robust performance metric. Ideally, other metrics such as the F1-score, precision, or the MCC score should be reported. Reporting more metrics would enable benchmarking studies and models.

Summary of the findings

Concerning the datasets, this review could not identify an exact list of features given the heterogeneity mentioned above. However, there are some findings to report. First, the model’s performance is significantly affected by the dataset: the accuracy decreased significantly when the dataset became big and complex. Clean and well-structured datasets with a few numbers of samples and features make a better model. However, a low number of attributes may not reflect the real complexity of the multi-factorial diseases.

The top-performing models were the decision tree and random forest, with an similar accuracy of 0.99 and equal AUC (ROC) of one. On average, the best models for the accuracy metric were Swarm Optimization and Random Forest with a value of one in both cases. For AUC (ROC) decision tree with an AUC (ROC) of 0.98, respectively.

The most frequently-used methods were Deep Neural Networks, tree-type (Gradient Boosting and Random Forest), and support vector machines. Deep Neural Networks have the advantage of dealing well with big data, a solid reason to use them frequently [ 27 , 28 ]. Studies using these models used datasets containing more than 70,000 observations. Also, these models deal well with dirty data.

Some studies used complementary techniques to improve their model’s performance. First, resampling techniques were applied to otherwise unbalanced datasets. Second, feature selection techniques were used to identify the most relevant features for prediction. Among the latter, there is principal component analysis and logistic regression.

The model that has a good performance but can be improved is the Deep Neural Network. As shown in Figure 4 , their average accuracy is not top, yet some individual models achieved 0.9. Hence, they represent a technique worth further exploration in type 2 diabetes. They also have the advantage that can deal with large datasets. As shown in Table 2 many of the datasets used for DNN models were around 70,000 or more samples. Also, DNN models do not require complementary techniques for feature selection.

Finally, model performance comparison was challenging due to the heterogeneity in the metrics reported.

Conclusions

This systematic review analyzed 90 studies to find the main opportunity areas in diabetes prediction using machine learning techniques.

The review finds that the structure of the dataset is relevant to the accuracy of the models, regardless of the selected features that are heterogeneous between studies. Concerning the models, the optimal performance is for tree-type models. However, even tough they have the best accuracy, they require complementary techniques to balance data and reduce dimensionality by selecting the optimal features. Therefore, K nearest neighborhoods, and Support vector machines are frequently preferred for prediction. On the other hand, Deep Neural Networks have the advantage of dealing well with big data. However, they must be applied to datasets with more than 70,000 observations. At least three metrics and the AUC (ROC) should be reported in the results to allow estimation of the others to reduce heterogeneity in the performance comparison. Therefore, the areas of opportunity are listed below.

Areas of opportunity

First, a well-structured, balanced dataset containing different types of features like lifestyle, socioeconomically, and diagnostic data can be created to obtain a good model. Otherwise, complementary techniques can be helpful to clean and balance the data.

The machine learning model will depend on the characteristics of the dataset. When the dataset contains a few observations, machine learning techniques present a better performance; when observations are more than 70,000, Deep Learning has a good performance.

To reduce the heterogeneity in the validation parameters, the best way to do it is to calculate a minimum of three parameters from the confusion matrix and the AUC (ROC). Ideally, it should report five or more parameters (accuracy, sensitivity, specificity, precision, and F1-score) to become easier to compare. If one misses, it can be estimated from the other ones.

Limitations of the study

The study’s limitations are observed in the heterogeneity between the models that difficult to compare them. This heterogeneity is present in many aspects; the main is the populations and the number of samples used in each model. Another significant limitation is when the model predicts diabetes complications, not diabetes.

Availability of data and materials

All data generated or analysed during this study are included in this published article and its references.

Abbreviations

Deep Neural Network

Random forest

Support Vector Machine

k-Nearest Neighbors

Decision tree

Gradient Boosting Tree

Gradient Boost Machine

J48 decision tree

Logistic regression and stepwise regression

Linear and quadratric discriminant analysis

Multiple Instance Learning boosting

Bayesian Network

Latent growth mixture

Cox Hazard Regression

Least-Square Regression

Least absolute shrinkage and selection operator

Smoothed clipped absolute deviation

Minimax concave penalized likelihood

Alternating Cluster and Classification

Machine learning-based method

Synthetic minority oversampling technique

Area under curve (receiver operating characteristic)

Diabetic retinopathy

Gaussian mixture

Naive Bayes

Average weighted objective distance

Swarm Optimization

Newton’s Divide Difference Method

Root-mean-square error

AD Association. Classification and diagnosis of diabetes: standards of medical care in diabetes-2020. Diabetes Care. 2019. https://doi.org/10.2337/dc20-S002 .

Article Google Scholar

International Diabetes Federation. Diabetes. Brussels: International Diabetes Federation; 2019.

Google Scholar

Gregg EW, Sattar N, Ali MK. The changing face of diabetes complications. Lancet Diabetes Endocrinol. 2016;4(6):537–47. https://doi.org/10.1016/s2213-8587(16)30010-9 .

Article PubMed Google Scholar

Herman WH, Ye W, Griffin SJ, Simmons RK, Davies MJ, Khunti K, Rutten GEhm, Sandbaek A, Lauritzen T, Borch-Johnsen K, et al. Early detection and treatment of type 2 diabetes reduce cardiovascular morbidity and mortality: a simulation of the results of the Anglo-Danish-Dutch study of intensive treatment in people with screen-detected diabetes in primary care (addition-Europe). Diabetes Care. 2015;38(8):1449–55. https://doi.org/10.2337/dc14-2459 .

Article PubMed PubMed Central Google Scholar

Kälsch J, Bechmann LP, Heider D, Best J, Manka P, Kälsch H, Sowa J-P, Moebus S, Slomiany U, Jöckel K-H, et al. Normal liver enzymes are correlated with severity of metabolic syndrome in a large population based cohort. Sci Rep. 2015;5(1):1–9. https://doi.org/10.1038/srep13058 .

Article CAS Google Scholar

Sanal MG, Paul K, Kumar S, Ganguly NK. Artificial intelligence and deep learning: the future of medicine and medical practice. J Assoc Physicians India. 2019;67(4):71–3.

PubMed Google Scholar

Zhang A, Lipton ZC, Li M, Smola AJ. Dive into deep learning. 2020. https://d2l.ai .

Maniruzzaman M, Kumar N, Abedin MM, Islam MS, Suri HS, El-Baz AS, Suri JS. Comparative approaches for classification of diabetes mellitus data: machine learning paradigm. Comput Methods Programs Biomed. 2017;152:23–34. https://doi.org/10.1016/j.cmpb.2017.09.004 .

Muhammad LJ, Algehyne EA, Usman SS. Predictive supervised machine learning models for diabetes mellitus. SN Comput Sci. 2020;1(5):1–10. https://doi.org/10.1007/s42979-020-00250-8 .

Alghamdi M, Al-Mallah M, Keteyian S, Brawner C, Ehrman J, Sakr S. Predicting diabetes mellitus using smote and ensemble machine learning approach: the henry ford exercise testing (fit) project. PLoS ONE. 2017;12(7):e0179805. https://doi.org/10.1371/journal.pone.0179805 .

Article CAS PubMed PubMed Central Google Scholar

Mokarram R, Emadi M. Classification in non-linear survival models using cox regression and decision tree. Ann Data Sci. 2017;4(3):329–40. https://doi.org/10.1007/s40745-017-0105-4 .

Ivanova MT, Radoukova TI, Dospatliev LK, Lacheva MN. Ordinary least squared linear regression model for estimation of zinc in wild edible mushroom ( Suillus luteus (L.) roussel). Bulg J Agric Sci. 2020;26(4):863–9.

Bernardini M, Morettini M, Romeo L, Frontoni E, Burattini L. Early temporal prediction of type 2 diabetes risk condition from a general practitioner electronic health record: a multiple instance boosting approach. Artif Intell Med. 2020;105:101847. https://doi.org/10.1016/j.artmed.2020.101847 .

Xie J, Liu Y, Zeng X, Zhang W, Mei Z. A Bayesian network model for predicting type 2 diabetes risk based on electronic health records. Modern Phys Lett B. 2017;31(19–21):1740055. https://doi.org/10.1142/s0217984917400553 .

Hertroijs DFL, Elissen AMJ, Brouwers MCGJ, Schaper NC, Köhler S, Popa MC, Asteriadis S, Hendriks SH, Bilo HJ, Ruwaard D, et al. A risk score including body mass index, glycated haemoglobin and triglycerides predicts future glycaemic control in people with type 2 diabetes. Diabetes Obes Metab. 2017;20(3):681–8. https://doi.org/10.1111/dom.13148 .

Cole SR, Chu H, Greenland S. Maximum likelihood, profile likelihood, and penalized likelihood: a primer. Am J Epidemiol. 2013;179(2):252–60. https://doi.org/10.1093/aje/kwt245 .

Brisimi TS, Xu T, Wang T, Dai W, Paschalidis IC. Predicting diabetes-related hospitalizations based on electronic health records. Stat Methods Med Res. 2018;28(12):3667–82. https://doi.org/10.1177/0962280218810911 .

Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the prisma statement. PLoS Med. 2009;6(7):e1000097. https://doi.org/10.1371/journal.pmed.1000097 .

Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technol. 2009;51(1):7–15. https://doi.org/10.1016/j.infsof.2008.09.009 .

Sambyal N, Saini P, Syal R. Microvascular complications in type-2 diabetes: a review of statistical techniques and machine learning models. Wirel Pers Commun. 2020;115(1):1–26. https://doi.org/10.1007/s11277-020-07552-3 .

Islam MM, Yang H-C, Poly TN, Jian W-S, Li Y-CJ. Deep learning algorithms for detection of diabetic retinopathy in retinal fundus photographs: a systematic review and meta-analysis. Comput Methods Programs Biomed. 2020;191:105320. https://doi.org/10.1016/j.cmpb.2020.105320 .

Chaki J, Ganesh ST, Cidham SK, Theertan SA. Machine learning and artificial intelligence based diabetes mellitus detection and self-management: a systematic review. J King Saud Univ Comput Inf Sci. 2020. https://doi.org/10.1016/j.jksuci.2020.06.013 .

Silva KD, Lee WK, Forbes A, Demmer RT, Barton C, Enticott J. Use and performance of machine learning models for type 2 diabetes prediction in community settings: a systematic review and meta-analysis. Int J Med Inform. 2020;143:104268. https://doi.org/10.1016/j.ijmedinf.2020.104268 .

Press G. Cleaning big data: most time-consuming, least enjoyable data science task, survey says. Forbes; 2016.

Prabhu P, Selvabharathi S. Deep belief neural network model for prediction of diabetes mellitus. In: 2019 3rd international conference on imaging, signal processing and communication (ICISPC). 2019. https://doi.org/10.1109/icispc.2019.8935838 .

Albahli S. Type 2 machine learning: an effective hybrid prediction model for early type 2 diabetes detection. J Med Imaging Health Inform. 2020;10(5):1069–75. https://doi.org/10.1166/jmihi.2020.3000 .

Maxwell A, Li R, Yang B, Weng H, Ou A, Hong H, Zhou Z, Gong P, Zhang C. Deep learning architectures for multi-label classification of intelligent health risk prediction. BMC Bioinform. 2017;18(S14):121–31. https://doi.org/10.1186/s12859-017-1898-z .

Nguyen BP, Pham HN, Tran H, Nghiem N, Nguyen QH, Do TT, Tran CT, Simpson CR. Predicting the onset of type 2 diabetes using wide and deep learning with electronic health records. Comput Methods Programs Biomed. 2019;182:105055. https://doi.org/10.1016/j.cmpb.2019.105055 .

Arellano-Campos O, Gómez-Velasco DV, Bello-Chavolla OY, Cruz-Bautista I, Melgarejo-Hernandez MA, Muñoz-Hernandez L, Guillén LE, Garduño-Garcia JDJ, Alvirde U, Ono-Yoshikawa Y, et al. Development and validation of a predictive model for incident type 2 diabetes in middle-aged Mexican adults: the metabolic syndrome cohort. BMC Endocr Disord. 2019;19(1):1–10. https://doi.org/10.1186/s12902-019-0361-8 .

You Y, Doubova SV, Pinto-Masis D, Pérez-Cuevas R, Borja-Aburto VH, Hubbard A. Application of machine learning methodology to assess the performance of DIABETIMSS program for patients with type 2 diabetes in family medicine clinics in Mexico. BMC Med Inform Decis Mak. 2019;19(1):1–15. https://doi.org/10.1186/s12911-019-0950-5 .

Pham T, Tran T, Phung D, Venkatesh S. Predicting healthcare trajectories from medical records: a deep learning approach. J Biomed Inform. 2017;69:218–29. https://doi.org/10.1016/j.jbi.2017.04.001 .

Spänig S, Emberger-Klein A, Sowa J-P, Canbay A, Menrad K, Heider D. The virtual doctor: an interactive clinical-decision-support system based on deep learning for non-invasive prediction of diabetes. Artif Intell Med. 2019;100:101706. https://doi.org/10.1016/j.artmed.2019.101706 .

Wang T, Xuan P, Liu Z, Zhang T. Assistant diagnosis with Chinese electronic medical records based on CNN and BILSTM with phrase-level and word-level attentions. BMC Bioinform. 2020;21(1):1–16. https://doi.org/10.1186/s12859-020-03554-x .

Kim YD, Noh KJ, Byun SJ, Lee S, Kim T, Sunwoo L, Lee KJ, Kang S-H, Park KH, Park SJ, et al. Effects of hypertension, diabetes, and smoking on age and sex prediction from retinal fundus images. Sci Rep. 2020;10(1):1–14. https://doi.org/10.1038/s41598-020-61519-9 .

Bernardini M, Romeo L, Misericordia P, Frontoni E. Discovering the type 2 diabetes in electronic health records using the sparse balanced support vector machine. IEEE J Biomed Health Inform. 2020;24(1):235–46. https://doi.org/10.1109/JBHI.2019.2899218 .

Mei J, Zhao S, Jin F, Zhang L, Liu H, Li X, Xie G, Li X, Xu M. Deep diabetologist: learning to prescribe hypoglycemic medications with recurrent neural networks. Stud Health Technol Inform. 2017;245:1277. https://doi.org/10.3233/978-1-61499-830-3-1277 .

Solares JRA, Canoy D, Raimondi FED, Zhu Y, Hassaine A, Salimi-Khorshidi G, Tran J, Copland E, Zottoli M, Pinho-Gomes A, et al. Long-term exposure to elevated systolic blood pressure in predicting incident cardiovascular disease: evidence from large-scale routine electronic health records. J Am Heart Assoc. 2019;8(12):e012129. https://doi.org/10.1161/jaha.119.012129 .

Kumar PS, Pranavi S. Performance analysis of machine learning algorithms on diabetes dataset using big data analytics. In: 2017 international conference on infocom technologies and unmanned systems (trends and future directions) (ICTUS). 2017. https://doi.org/10.1109/ictus.2017.8286062 .

Olivera AR, Roesler V, Iochpe C, Schmidt MI, Vigo A, Barreto SM, Duncan BB. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes-ELSA-Brasil: accuracy study. Sao Paulo Med J. 2017;135(3):234–46. https://doi.org/10.1590/1516-3180.2016.0309010217 .

Peddinti G, Cobb J, Yengo L, Froguel P, Kravić J, Balkau B, Tuomi T, Aittokallio T, Groop L. Early metabolic markers identify potential targets for the prevention of type 2 diabetes. Diabetologia. 2017;60(9):1740–50. https://doi.org/10.1007/s00125-017-4325-0 .

Dutta D, Paul D, Ghosh P. Analysing feature importances for diabetes prediction using machine learning. In: 2018 IEEE 9th annual information technology, electronics and mobile communication conference (IEMCON). 2018. https://doi.org/10.1109/iemcon.2018.8614871 .

Alhassan Z, Mcgough AS, Alshammari R, Daghstani T, Budgen D, Moubayed NA. Type-2 diabetes mellitus diagnosis from time series clinical data using deep learning models. In: artificial neural networks and machine learning—ICANN 2018 lecture notes in computer science. 2018. p. 468–78. https://doi.org/10.1007/978-3-030-01424-7_46 .

Kuo K-M, Talley P, Kao Y, Huang CH. A multi-class classification model for supporting the diagnosis of type II diabetes mellitus. PeerJ. 2020;8:e9920. https://doi.org/10.7717/peerj.992 .

Pimentel A, Carreiro AV, Ribeiro RT, Gamboa H. Screening diabetes mellitus 2 based on electronic health records using temporal features. Health Inform J. 2018;24(2):194–205. https://doi.org/10.1177/1460458216663023 .

Talaei-Khoei A, Wilson JM. Identifying people at risk of developing type 2 diabetes: a comparison of predictive analytics techniques and predictor variables. Int J Med Inform. 2018;119:22–38. https://doi.org/10.1016/j.ijmedinf.2018.08.008 .

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2019;7:1365–75. https://doi.org/10.1109/access.2018.2884249 .

Yuvaraj N, Sripreethaa KR. Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster. Cluster Comput. 2017;22(S1):1–9. https://doi.org/10.1007/s10586-017-1532-x .

Deo R, Panigrahi S. Performance assessment of machine learning based models for diabetes prediction. In: 2019 IEEE healthcare innovations and point of care technologies, (HI-POCT). 2019. https://doi.org/10.1109/hi-poct45284.2019.8962811 .

Jakka A, Jakka VR. Performance evaluation of machine learning models for diabetes prediction. Int J Innov Technol Explor Eng Regular Issue. 2019;8(11):1976–80. https://doi.org/10.35940/ijitee.K2155.0981119 .

Radja M, Emanuel AWR. Performance evaluation of supervised machine learning algorithms using different data set sizes for diabetes prediction. In: 2019 5th international conference on science in information technology (ICSITech). 2019. https://doi.org/10.1109/icsitech46713.2019.8987479 .

Choi BG, Rha S-W, Kim SW, Kang JH, Park JY, Noh Y-K. Machine learning for the prediction of new-onset diabetes mellitus during 5-year follow-up in non-diabetic patients with cardiovascular risks. Yonsei Med J. 2019;60(2):191. https://doi.org/10.3349/ymj.2019.60.2.191 .

Akula R, Nguyen N, Garibay I. Supervised machine learning based ensemble model for accurate prediction of type 2 diabetes. In: 2019 SoutheastCon. 2019. https://doi.org/10.1109/southeastcon42311.2019.9020358 .

Xie Z, Nikolayeva O, Luo J, Li D. Building risk prediction models for type 2 diabetes using machine learning techniques. Prev Chronic Dis. 2019. https://doi.org/10.5888/pcd16.190109 .

Lai H, Huang H, Keshavjee K, Guergachi A, Gao X. Predictive models for diabetes mellitus using machine learning techniques. BMC Endocr Disord. 2019;19(1):1–9. https://doi.org/10.1186/s12902-019-0436-6 .

Abbas H, Alic L, Erraguntla M, Ji J, Abdul-Ghani M, Abbasi Q, Qaraqe M. Predicting long-term type 2 diabetes with support vector machine using oral glucose tolerance test. bioRxiv. 2019. https://doi.org/10.1371/journal.pone.0219636 .

Sarker I, Faruque M, Alqahtani H, Kalim A. K-nearest neighbor learning based diabetes mellitus prediction and analysis for ehealth services. EAI Endorsed Trans Scalable Inf Syst. 2020. https://doi.org/10.4108/eai.13-7-2018.162737 .

Cahn A, Shoshan A, Sagiv T, Yesharim R, Goshen R, Shalev V, Raz I. Prediction of progression from pre-diabetes to diabetes: development and validation of a machine learning model. Diabetes Metab Res Rev. 2020;36(2):e3252. https://doi.org/10.1002/dmrr.3252 .

Garcia-Carretero R, Vigil-Medina L, Mora-Jimenez I, Soguero-Ruiz C, Barquero-Perez O, Ramos-Lopez J. Use of a k-nearest neighbors model to predict the development of type 2 diabetes within 2 years in an obese, hypertensive population. Med Biol Eng Comput. 2020;58(5):991–1002. https://doi.org/10.1007/s11517-020-02132-w .

Zhang L, Wang Y, Niu M, Wang C, Wang Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: the Henan rural cohort study. Sci Rep. 2020;10(1):1–10. https://doi.org/10.1038/s41598-020-61123-x .

Haq AU, Li JP, Khan J, Memon MH, Nazir S, Ahmad S, Khan GA, Ali A. Intelligent machine learning approach for effective recognition of diabetes in e-healthcare using clinical data. Sensors. 2020;20(9):2649. https://doi.org/10.3390/s20092649 .

Article PubMed Central Google Scholar

Yang T, Zhang L, Yi L, Feng H, Li S, Chen H, Zhu J, Zhao J, Zeng Y, Liu H, et al. Ensemble learning models based on noninvasive features for type 2 diabetes screening: model development and validation. JMIR Med Inform. 2020;8(6):e15431. https://doi.org/10.2196/15431 .

Ahn H-S, Kim JH, Jeong H, Yu J, Yeom J, Song SH, Kim SS, Kim IJ, Kim K. Differential urinary proteome analysis for predicting prognosis in type 2 diabetes patients with and without renal dysfunction. Int J Mol Sci. 2020;21(12):4236. https://doi.org/10.3390/ijms21124236 .

Article CAS PubMed Central Google Scholar

Sarwar MA, Kamal N, Hamid W, Shah MA. Prediction of diabetes using machine learning algorithms in healthcare. In: 2018 24th international conference on automation and computing (ICAC). 2018. https://doi.org/10.23919/iconac.2018.8748992 .

Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting diabetes mellitus with machine learning techniques. Front Genet. 2018;9:515. https://doi.org/10.3389/fgene.2018.00515 .

Farran B, AlWotayan R, Alkandari H, Al-Abdulrazzaq D, Channanath A, Thanaraj TA. Use of non-invasive parameters and machine-learning algorithms for predicting future risk of type 2 diabetes: a retrospective cohort study of health data from Kuwait. Front Endocrinol. 2019;10:624. https://doi.org/10.3389/fendo.2019.00624 .

Xiong X-L, Zhang R-X, Bi Y, Zhou W-H, Yu Y, Zhu D-L. Machine learning models in type 2 diabetes risk prediction: results from a cross-sectional retrospective study in Chinese adults. Curr Med Sci. 2019;39(4):582–8. https://doi.org/10.1007/s11596-019-2077-4 .

Dinh A, Miertschin S, Young A, Mohanty SD. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med Inform Decis Mak. 2019;19(1):1–15. https://doi.org/10.1186/s12911-019-0918-5 .

Liu Y, Ye S, Xiao X, Sun C, Wang G, Wang G, Zhang B. Machine learning for tuning, selection, and ensemble of multiple risk scores for predicting type 2 diabetes. Risk Manag Healthc Policy. 2019;12:189–98. https://doi.org/10.2147/rmhp.s225762 .

Tang Y, Gao R, Lee HH, Wells QS, Spann A, Terry JG, Carr JJ, Huo Y, Bao S, Landman BA, et al. Prediction of type II diabetes onset with computed tomography and electronic medical records. In: Multimodal learning for clinical decision support and clinical image-based procedures. Cham: Springer; 2020. p. 13–23. https://doi.org/10.1007/978-3-030-60946-7_2 .

Chapter Google Scholar

Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM. Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst. 2020;8(1):1–14. https://doi.org/10.1007/s13755-019-0095-z .

Boutilier JJ, Chan TCY, Ranjan M, Deo S. Risk stratification for early detection of diabetes and hypertension in resource-limited settings: machine learning analysis. J Med Internet Res. 2021;23(1):20123. https://doi.org/10.2196/20123 .

Li J, Chen Q, Hu X, Yuan P, Cui L, Tu L, Cui J, Huang J, Jiang T, Ma X, Yao X, Zhou C, Lu H, Xu J. Establishment of noninvasive diabetes risk prediction model based on tongue features and machine learning techniques. Int J Med Inform. 2021;149:104429. https://doi.org/10.1016/j.ijmedinf.2021.10442 .

Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P. Using wearable activity trackers to predict type 2 diabetes: machine learning-based cross-sectional study of the UK biobank accelerometer cohort. JMIR Diabetes. 2021;6(1):23364. https://doi.org/10.2196/23364 .

Deberneh HM, Kim I. Prediction of Type 2 diabetes based on machine learning algorithm. Int J Environ Res Public Health. 2021;18(6):3317. https://doi.org/10.3390/ijerph1806331 .

He Y, Lakhani CM, Rasooly D, Manrai AK, Tzoulaki I, Patel CJ. Comparisons of polyexposure, polygenic, and clinical risk scores in risk prediction of type 2 diabetes. Diabetes Care. 2021;44(4):935–43. https://doi.org/10.2337/dc20-2049 .

García-Ordás MT, Benavides C, Benítez-Andrades JA, Alaiz-Moretón H, García-Rodríguez I. Diabetes detection using deep learning techniques with oversampling and feature augmentation. Comput Methods Programs Biomed. 2021;202:105968. https://doi.org/10.1016/j.cmpb.2021.105968 .

Kanimozhi N, Singaravel G. Hybrid artificial fish particle swarm optimizer and kernel extreme learning machine for type-II diabetes predictive model. Med Biol Eng Comput. 2021;59(4):841–67. https://doi.org/10.1007/s11517-021-02333-x .

Article CAS PubMed Google Scholar

Ravaut M, Sadeghi H, Leung KK, Volkovs M, Kornas K, Harish V, Watson T, Lewis GF, Weisman A, Poutanen T, et al. Predicting adverse outcomes due to diabetes complications with machine learning using administrative health data. NPJ Digit Med. 2021;4(1):1–12. https://doi.org/10.1038/s41746-021-00394-8 .

De Silva K, Lim S, Mousa A, Teede H, Forbes A, Demmer RT, Jonsson D, Enticott J. Nutritional markers of undiagnosed type 2 diabetes in adults: findings of a machine learning analysis with external validation and benchmarking. PLoS ONE. 2021;16(5):e0250832. https://doi.org/10.1371/journal.pone.025083 .

Kim H, Lim DH, Kim Y. Classification and prediction on the effects of nutritional intake on overweight/obesity, dyslipidemia, hypertension and type 2 diabetes mellitus using deep learning model: 4–7th Korea national health and nutrition examination survey. Int J Environ Res Public Health. 2021;18(11):5597. https://doi.org/10.3390/ijerph18115597 .

Vangeepuram N, Liu B, Chiu P-H, Wang L, Pandey G. Predicting youth diabetes risk using NHANES data and machine learning. Sci Rep. 2021;11(1):1. https://doi.org/10.1038/s41598-021-90406- .

Recenti M, Ricciardi C, Edmunds KJ, Gislason MK, Sigurdsson S, Carraro U, Gargiulo P. Healthy aging within an image: using muscle radiodensitometry and lifestyle factors to predict diabetes and hypertension. IEEE J Biomed Health Inform. 2021;25(6):2103–12. https://doi.org/10.1109/JBHI.2020.304415 .

Ramesh J, Aburukba R, Sagahyroon A. A remote healthcare monitoring framework for diabetes prediction using machine learning. Healthc Technol Lett. 2021;8(3):45–57. https://doi.org/10.1049/htl2.12010 .

Lama L, Wilhelmsson O, Norlander E, Gustafsson L, Lager A, Tynelius P, Wärvik L, Östenson C-G. Machine learning for prediction of diabetes risk in middle-aged Swedish people. Heliyon. 2021;7(7):e07419. https://doi.org/10.1016/j.heliyon.2021.e07419 .

Shashikant R, Chaskar U, Phadke L, Patil C. Gaussian process-based kernel as a diagnostic model for prediction of type 2 diabetes mellitus risk using non-linear heart rate variability features. Biomed Eng Lett. 2021;11(3):273–86. https://doi.org/10.1007/s13534-021-00196-7 .

Kalagotla SK, Gangashetty SV, Giridhar K. A novel stacking technique for prediction of diabetes. Comput Biol Med. 2021;135:104554. https://doi.org/10.1016/j.compbiomed.2021.104554 .

Moon S, Jang J-Y, Kim Y, Oh C-M. Development and validation of a new diabetes index for the risk classification of present and new-onset diabetes: multicohort study. Sci Rep. 2021;11(1):1–10. https://doi.org/10.1038/s41598-021-95341-8 .

Ihnaini B, Khan MA, Khan TA, Abbas S, Daoud MS, Ahmad M, Khan MA. A smart healthcare recommendation system for multidisciplinary diabetes patients with data fusion based on deep ensemble learning. Comput Intell Neurosci. 2021;2021:1–11. https://doi.org/10.1155/2021/4243700 .

Rufo DD, Debelee TG, Ibenthal A, Negera WG. Diagnosis of diabetes mellitus using gradient boosting machine (LightGBM). Diagnostics. 2021;11(9):1714. https://doi.org/10.3390/diagnostics11091714 .

Haneef R, Fuentes S, Fosse-Edorh S, Hrzic R, Kab S, Cosson E, Gallay A. Use of artificial intelligence for public health surveillance: a case study to develop a machine learning-algorithm to estimate the incidence of diabetes mellitus in France. Arch Public Health. 2021. https://doi.org/10.21203/rs.3.rs-139421/v1 .

Wei H, Sun J, Shan W, Xiao W, Wang B, Ma X, Hu W, Wang X, Xia Y. Environmental chemical exposure dynamics and machine learning-based prediction of diabetes mellitus. Sci Tot Environ. 2022;806:150674. https://doi.org/10.1016/j.scitotenv.2021.150674 .

Leerojanaprapa K, Sirikasemsuk K. Comparison of Bayesian networks for diabetes prediction. In: International conference on computer, communication and computational sciences (IC4S), Bangkok, Thailand, Oct 20–21, 2018. 2019;924:425–434. https://doi.org/10.1007/978-981-13-6861-5_37 .

Subbaiah S, Kavitha M. Random forest algorithm for predicting chronic diabetes disease. Int J Life Sci Pharma Res. 2020;8:4–8.

Thenappan S, Rajkumar MV, Manoharan PS. Predicting diabetes mellitus using modified support vector machine with cloud security. IETE J Res. 2020. https://doi.org/10.1080/03772063.2020.178278 .

Sneha N, Gangil T. Analysis of diabetes mellitus for early prediction using optimal features selection. J Big Data. 2019;6(1):1–19. https://doi.org/10.1186/s40537-019-0175-6 .

Jain S. A supervised model for diabetes divination. Biosci Biotechnol Res Commun. 2020;13(14, SI):315–8. https://doi.org/10.21786/bbrc/13.14/7 .

Syed AH, Khan T. Machine learning-based application for predicting risk of type 2 diabetes mellitus (T2DM) in Saudi Arabia: a retrospective cross-sectional study. IEEE Access. 2020;8:199539–61. https://doi.org/10.1109/ACCESS.2020.303502 .

Nuankaew P, Chaising S, Temdee P. Average weighted objective distance-based method for type 2 diabetes prediction. IEEE Access. 2021;9:137015–28. https://doi.org/10.1109/ACCESS.2021.311726 .

Samreen S. Memory-efficient, accurate and early diagnosis of diabetes through a machine learning pipeline employing crow search-based feature engineering and a stacking ensemble. IEEE Access. 2021;9:134335–54. https://doi.org/10.1109/ACCESS.2021.311638 .

Fazakis N, Kocsis O, Dritsas E, Alexiou S, Fakotakis N, Moustakas K. Machine learning tools for long-term type 2 diabetes risk prediction. IEEE Access. 2021;9:103737–57. https://doi.org/10.1109/ACCESS.2021.309869 .

Omana J, Moorthi M. Predictive analysis and prognostic approach of diabetes prediction with machine learning techniques. Wirel Pers Commun. 2021. https://doi.org/10.1007/s11277-021-08274-w .

Ravaut M, Harish V, Sadeghi H, Leung KK, Volkovs M, Kornas K, Watson T, Poutanen T, Rosella LC. Development and validation of a machine learning model using administrative health data to predict onset of type 2 diabetes. JAMA Netw Open. 2021;4(5):2111315. https://doi.org/10.1001/jamanetworkopen.2021.11315 .

Lang L-Y, Gao Z, Wang X-G, Zhao H, Zhang Y-P, Sun S-J, Zhang Y-J, Austria RS. Diabetes prediction model based on deep belief network. J Comput Methods Sci Eng. 2021;21(4):817–28. https://doi.org/10.3233/JCM-20465 .

Gupta H, Varshney H, Sharma TK, Pachauri N, Verma OP. Comparative performance analysis of quantum machine learning with deep learning for diabetes prediction. Complex Intell Syst. 2021. https://doi.org/10.1007/s40747-021-00398-7 .

Roy K, Ahmad M, Waqar K, Priyaah K, Nebhen J, Alshamrani SS, Raza MA, Ali I. An enhanced machine learning framework for type 2 diabetes classification using imbalanced data with missing values. Complexity. 2021. https://doi.org/10.1155/2021/995331 .

Zhang L, Wang Y, Niu M, Wang C, Wang Z. Nonlaboratory-based risk assessment model for type 2 diabetes mellitus screening in Chinese rural population: a joint bagging-boosting model. IEEE J Biomed Health Inform. 2021;25(10):4005–16. https://doi.org/10.1109/JBHI.2021.307711 .

Turnea M, Ilea M. Predictive simulation for type II diabetes using data mining strategies applied to Big Data. In: Romanian Advanced Distributed Learning Association; Univ Natl Aparare Carol I; European Secur & Def Coll; Romania Partnership Ctr. 14th international scientific conference on eLearning and software for education - eLearning challenges and new horizons, Bucharest, Romania, Apr 19-20, 2018. 2018. p. 481-486. https://doi.org/10.12753/2066-026X-18-213 .

Vettoretti M, Di Camillo B. A variable ranking method for machine learning models with correlated features: in-silico validation and application for diabetes prediction. Appl Sci. 2021;11(16):7740. https://doi.org/10.3390/app11167740 .

Download references

Acknowledgements

We would like to thank Vicerrectoría de Investigación y Posgrado, the Research Group of Product Innovation, and the Cyber Learning and Data Science Laboratory, and the School of Engineering and Science of Tecnologico de Monterrey.

This study was funded by Vicerrectoría de Investigación y Posgrado and the Research Group of Product Innovation of Tecnologico de Monterrey, by a scholarship provided by Tecnologico de Monterrey to graduate student A01339273 Luis Fregoso-Aparicio, and a national scholarship granted by the Consejo Nacional de Ciencia y Tecnologia (CONACYT) to study graduate programs in institutions enrolled in the Padron Nacional de Posgrados de Calidad (PNPC) to CVU 962778 - Luis Fregoso-Aparicio.

Author information

Authors and affiliations.

School of Engineering and Sciences, Tecnologico de Monterrey, Av Lago de Guadalupe KM 3.5, Margarita Maza de Juarez, 52926, Cd Lopez Mateos, Mexico

Luis Fregoso-Aparicio

School of Engineering and Sciences, Tecnologico de Monterrey, Ave. Eugenio Garza Sada 2501, 64849, Monterrey, Nuevo Leon, Mexico

Julieta Noguez & Luis Montesinos

Hospital General de Mexico Dr. Eduardo Liceaga, Dr. Balmis 148, Doctores, Cuauhtemoc, 06720, Mexico City, Mexico

José A. García-García

You can also search for this author in PubMed Google Scholar

Contributions

Individual contributions are the following; conceptualization, methodology, and investigation: LF-A and JN; validation: LM and JAGG; writing—original draft preparation and visualization: LF-A; writing—review and editing: LM and JN; supervision: JAG-G; project administration: JN; and funding acquisition: LF and JN. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Julieta Noguez .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Fregoso-Aparicio, L., Noguez, J., Montesinos, L. et al. Machine learning and deep learning predictive models for type 2 diabetes: a systematic review. Diabetol Metab Syndr 13 , 148 (2021). https://doi.org/10.1186/s13098-021-00767-9

Download citation

Received : 06 July 2021

Accepted : 07 December 2021

Published : 20 December 2021

DOI : https://doi.org/10.1186/s13098-021-00767-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Machine learning
Deep learning
Electronic health records

Diabetology & Metabolic Syndrome

ISSN: 1758-5996

Submission enquiries: [email protected]

Diabetes Prediction using Machine Learning

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Deep learning approach for diabetes prediction using PIMA Indian dataset

Research article
Published: 14 April 2020
Volume 19 , pages 391–403, ( 2020 )

Cite this article

Huma Naz 1 &
Sachin Ahuja 1

3663 Accesses

147 Citations

Explore all metrics

International Diabetes Federation (IDF) stated that 382 million people are living with diabetes worldwide. Over the last few years, the impact of diabetes has been increased drastically, which makes it a global threat. At present, Diabetes has steadily been listed in the top position as a major cause of death. The number of affected people will reach up to 629 million i.e. 48% increase by 2045. However, diabetes is largely preventable and can be avoided by making lifestyle changes. These changes can also lower the chances of developing heart disease and cancer. So, there is a dire need for a prognosis tool that can help the doctors with early detection of the disease and hence can recommend the lifestyle changes required to stop the progression of the deadly disease.

Diabetes if untreated may turn into fatal and directly or indirectly invites lot of other diseases such as heart attack, heart failure, brain stroke and many more. Therefore, early detection of diabetes is very significant so that timely action can be taken and the progression of the disease may be prevented to avoid further complications. Healthcare organizations accumulate huge amount of data including Electronic health records, images, omics data, and text but gaining knowledge and insight into the data remains a key challenge. The latest advances in Machine learning technologies can be applied for obtaining hidden patterns, which may diagnose diabetes at an early phase. This research paper presents a methodology for diabetes prediction using a diverse machine learning algorithm using the PIMA dataset.

The accuracy achieved by functional classifiers Artificial Neural Network (ANN), Naive Bayes (NB), Decision Tree (DT) and Deep Learning (DL) lies within the range of 90–98%. Among the four of them, DL provides the best results for diabetes onset with an accuracy rate of 98.07% on the PIMA dataset. Hence, this proposed system provides an effective prognostic tool for healthcare officials. The results obtained can be used to develop a novel automatic prognosis tool that can be helpful in early detection of the disease.

The outcome of the study confirms that DL provides the best results with the most promising extracted features. DL achieves the accuracy of 98.07% which can be used for further development of the automatic prognosis tool. The accuracy of the DL approach can further be enhanced by including the omics data for prediction of the onset of the disease.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda

Machine learning and deep learning approach for medical image analysis: diagnosis to detection

“Global Report on Diabetes, 2016”. Available at: https://apps.who.int/iris/bitstream/handle/10665/204871/9789241565257_eng.pdf;jsessionid=2BC28035503CFAFF295E70CFB4A0E1DF?Sequence=1 .

“Diabetes: Asia's 'silent killer'”, November 14, 2013”. Available at: www.bbc.com/news/world-asia-24740288 .

Mathers CD, Loncar D. Projections of global mortality and burden of disease from 2002 to 2030. 2015;3(11). https://doi.org/10.1371/journal.pmed.0030442 .

Swapna G, Vinayakumar R, Soman KP. Diabetes detection using deep learning algorithms. ICT Express. 2018;4(4):243–6. https://doi.org/10.1016/j.icte.2018.10.005 . Elsevier B.V.

Article Google Scholar

Wu H, et al. Type 2 diabetes mellitus prediction model based on data mining. Informatics in Medicine Unlocked. 2018;10:100–7. https://doi.org/10.1016/j.imu.2017.12.006 . Elsevier Ltd.

Emerging T, Factors R. Diabetes mellitus , fasting blood glucose concentration , and risk of vascular disease : a collaborative meta-analysis of 102 prospective studies. The Lancet. 2010;375(9733):2215–22. https://doi.org/10.1016/S0140-6736(10)60484-9 Elsevier Ltd.

Article CAS Google Scholar

Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. 2008 IEEE/ACS International Conference on Computer Systems and Applications 2008;108–15. https://doi.org/10.1109/AICCSA.2008.4493524 .

Huang CL, Chen MC, Wang CJ. Credit scoring with a data mining approach based on support vector machines. Expert Syst Appl. 2007;33(4):847–56. https://doi.org/10.1016/j.eswa.2006.07.007 .

Zhang LM. Genetic deep neural networks using different activation functions for financial data mining. In: Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015; 2015. p. 2849–51. https://doi.org/10.1109/BigData.2015.7364099 .

Grundy SM. Obesity, Metabolic Syndrome , and Cardiovascular Disease. 2004;89(6):2595–600. https://doi.org/10.1210/jc.2004-0372 .

Palaniappan S. Intelligent heart disease prediction system using data mining techniques, (march 2008). 2017. https://doi.org/10.1109/AICCSA.2008.4493524 .

Craven MW, Shavlik JW. Using neural networks for data mining. Futur Gener Comput Syst. 1997;13(2–3):211–29. https://doi.org/10.1016/s0167-739x(97)00022-8 .

Radhimeenakshi S. Classification and prediction of heart disease risk using data mining techniques of support vector machine and artificial neural networks. In: 2016 International Conference on Computing for Sustainable Global Development (INDIACom); 2016;3107–11.

Google Scholar

El-Jerjawi NS, Abu-Naser SS. Diabetes prediction using artificial neural network. International Journal of Advanced Science and Technology. 2018;121:55–64. https://doi.org/10.14257/ijast.2018.121.05 .

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques, IEEE Access. IEEE. 2019;7:1365–75. https://doi.org/10.1109/ACCESS.2018.2884249 .

Perveen S, et al. Performance analysis of data mining classification techniques to predict diabetes. Procedia Computer Science. 2016;82:115–21. https://doi.org/10.1016/j.procs.2016.04.016 Elsevier Masson SAS.

Barakat N, Bradley AP, Barakat MNH. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE Trans Inf Technol Biomed. 2010;14(4):1114–20. https://doi.org/10.1109/TITB.2009.2039485 .

Article PubMed Google Scholar

Ravizza S, Huschto T, Adamov A, Böhm L, Büsser A, Flöther FF, et al. Predicting the early risk of chronic kidney disease in patients with diabetes using real-world data. Nature Medicine. 2019;25(1):57–9. https://doi.org/10.1038/s41591-018-0239-8 . Springer US.

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2017;19(6):1236–46. https://doi.org/10.1093/bib/bbx044 .

Article PubMed Central Google Scholar

Alade OM, Sowunmi OY. Information technology science. 2018;724:14–22. https://doi.org/10.1007/978-3-319-74980-8 .

Carrera EV, Carrera R. Automated detection of diabetic retinopathy using SVM, 2017. pp. 6–9.

Huang YP, Nashrullah M. SVM-based decision tree for medical knowledge representation. In: 2016 International Conference on Fuzzy Theory and Its Applications, iFuzzy 2016; 2017. https://doi.org/10.1109/iFUZZY.2016.8004949 .

Young SR, et al. Optimizing deep learning hyper-parameters through an evolutionary algorithm, (November). 2015. https://doi.org/10.1145/2834892.2834896 .

“Machine Learning: Pima Indians Diabetes”, April 14, 2018. Available at: https://www.andreagrandi.it/2018/04/14/machine-learning-pima-indians-diabetes/ .

Anderson KM, et al. Cardiovascular disease risk profiles. American Heart Journal. 1991;121(1 PART 2):293–8.

Kim JK, Kang S. Neural network-based coronary heart disease risk prediction using feature correlation analysis. Journal of healthcare engineering. 2017;2017(2017):1–13.

CAS Google Scholar

Mierswa I, et al. YALE : rapid prototyping for complex data mining tasks. 2006.

Davazdahemami B, Delen D. The confounding role of common diabetes medications in developing acute renal failure: a data mining approach with emphasis on drug-drug interactions. Expert Systems with Applications. 2019;123:168–77. https://doi.org/10.1016/j.eswa.2019.01.006 . Elsevier Ltd.

“Intuitions on L1 and L2 Regularisation, Dec 26, 2018”. Available at: https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261 .

“Lasso and Ridge Regularization, May 18, 2017”. Available at: https://medium.com/@dk13093/lasso-and-ridge-regularization-7b7b847bce34 .

Design L, et al. Pipe failure modelling for water distribution networks using boosted decision trees. Structure and Infrastructure Engineering. 2018;14(10):1402–11. Taylor & Francis.

Pei D, et al. Identification of potential type II diabetes in a Chinese population with a sensitive decision tree approach. Journal of Diabetes Research. 2019;2019:1–7. https://doi.org/10.1155/2019/4248218 .

Mantovani RG. An empirical study on hyperparameter tuning of decision trees” arXiv : 1812 . 02207v2 [ cs . LG ]. 2019.

Raileanu LE, Stoffel K. Theoretical comparison between the Gini index and information gain criteria, (2100), pp. 77–93. 2004.

Jaafari A, Zenner EK, Thai B. Wildfire spatial pattern analysis in the Zagros Mountains , Iran : A comparative study of decision tree based classifiers. Ecological informatics. 2018;43(2018):200–11.

Supian S, Wahyuni S. Optimization of candidate selection using naive bayes: case study in Company X. 2018.

Amato F, et al. Artificial neural networks in medical diagnosis. 2013:47–58. https://doi.org/10.2478/v10136-012-0031 .

Fayyad U, Piatetsky-shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.

Masih N, Ahuja S. Prediction of heart diseases using data mining techniques: application on Framingham heart study. International Journal of Big Data and Analytics in Healthcare (IJBDAH). 2018;3(2):1–9.

Haritha R, Babu DS, Sammulal P. A Hybrid Approach for Prediction of Type-1 and Type-2 Diabetes using Firefly and Cuckoo Search Algorithms. 2018;13(2):896–907.

Zhang Y, et al. A feed-forward neural network model for the accurate prediction of diabetes mellitus. International Journal of Scientific and Technology Research. 2018;7(8):151–5. Available at: https://www.scopus.com/inward/record.uri?eid=2-s2.085059910862&partnerID=40&md5=40cdc4d37e47645feb76229e7b9c9dfd .

Iyer A, Jeyalatha S, Sumbaly R. Diagnosis of diabetes using classification mining techniques. arXiv preprint arXiv: 1502.03774. 2015.

Kumari VA, Chitra R. Classification of diabetes disease using support vector machine. Int J Eng Res Appl. 2013;3(2):1797–801.

Çalişir D, Doǧantekin E. An automatic diabetes diagnosis system based on LDA-wavelet support vector machine classifier. Expert Syst Appl. 2011;38(7):8311–5. https://doi.org/10.1016/j.eswa.2011.01.017 .

Mohammad S, Dadgar H, Kaardaan M. A Hybrid Method of Feature Selection and Neural Network with Genetic Algorithm to Predict Diabetes. 2017;7(24):3397–404.

Chen W, et al. A hybrid prediction model for type 2 diabetes using K-means and decision tree. In: Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESS, 2017-Novem(61272399); 2018. p. 386–90. https://doi.org/10.1109/ICSESS.2017.8342938 .

Patil RN, Patil RN. International Journal of Computer Engineering and Applications , A novel scheme for predicting type 2 diabetes in women : using K-means with PCA as dimensionality reduction. International Journal of Computer Engineering and Applications. n.d.;XI(Viii):76–87.

Download references

Author information

Authors and affiliations.

Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India

Huma Naz & Sachin Ahuja

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huma Naz .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflicts of interest.

Research involving human participants and/or animals

There is no direct human participation in the manuscript.

Informed consent

Informed consent was obtained from all individual participants involved in the study.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Naz, H., Ahuja, S. Deep learning approach for diabetes prediction using PIMA Indian dataset. J Diabetes Metab Disord 19 , 391–403 (2020). https://doi.org/10.1007/s40200-020-00520-5

Download citation

Received : 06 November 2019

Accepted : 20 March 2020

Published : 14 April 2020

Issue Date : June 2020

DOI : https://doi.org/10.1007/s40200-020-00520-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Diabetes prediction
Deep learning
Data mining algorithms
Neural network
PIMA Indian dataset
Find a journal
Publish with us
Track your research

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Healthcare (Basel)

A Machine Learning Approach to Predicting Diabetes Complications

Associated data.

The data presented in this study is private.

Diabetes mellitus (DM) is a chronic disease that is considered to be life-threatening. It can affect any part of the body over time, resulting in serious complications such as nephropathy, neuropathy, and retinopathy. In this work, several supervised classification algorithms were applied for building different models to predict and classify eight diabetes complications. The complications include metabolic syndrome, dyslipidemia, neuropathy, nephropathy, diabetic foot, hypertension, obesity, and retinopathy. For this study, a dataset collected by the Rashid Center for Diabetes and Research (RCDR) located in Ajman, UAE, was utilized. The dataset consists of 884 records with 79 features. Some essential preprocessing steps were applied to handle the missing values and unbalanced data problems. Furthermore, feature selection was performed to select the top five and ten features for each complication. The final number of records used to train and build the binary classifiers for each complication was as follows: 428—metabolic syndrome, 836—dyslipidemia, 223—neuropathy, 233—nephropathy, 240—diabetic foot, 586—hypertension, 498—obesity, 228—retinopathy. Repeated stratified k-fold cross-validation (with k = 10 and a total of 10 repetitions) was employed for a better estimation of the performance. Accuracy and F1-score were used to evaluate the models’ performance reaching a maximum of 97.8% and 97.7% for accuracy and F1-scores, respectively. Moreover, by comparing the performance achieved using different attributes’ sets, it was found that by using a selected number of features, we can still build adequate classifiers.

1. Introduction

Diabetes mellitus, or diabetes for short, is a chronic disease that occurs either when the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces [ 1 ]. Diabetes has two main types called type 1 and type 2. In type 1 diabetes (also known as insulin-dependent or childhood-onset), there is insulin production deficiency in the body, which requires daily administration of insulin, whereas in type 2 diabetes (known formally as non-insulin-dependent or adult-onset), the body cannot use insulin effectively. According to the World Health Organization (WHO), the number of people with diabetes in 2014 was 422 million. Moreover, in 2016, diabetes was the direct cause of 1.6 million deaths [ 1 ].

There are different causes for diabetes. For instance, type 1 diabetes mellitus (T1DM) can develop due to an autoimmune reaction that destroys the cells in the pancreas that make insulin, called beta cells [ 2 ], whereas type 2 diabetes is mainly caused by age, family history of diabetes, high blood pressure, high levels of triglycerides, heart disease or stroke [ 3 ]. Early detection of diabetes can be of great benefit, especially because the progression of prediabetes to type 2 diabetes is quite high. According to CDC [ 4 ], diabetes can affect any part of the body over time, leading to different types of complications. The most common types are divided into micro- and macrovascular disorders. The former are those long-term complications that affect small blood vessels, including retinopathy, nephropathy, and neuropathy. Macrovascular disorders, however, include ischemic heart disease, peripheral vascular disease, and cerebrovascular disease [ 5 ].

Due to high diabetes mortality and morbidity along with its possible complications, it is very important to understand how to deal with diabetes and how to prevent such possible complications.

To reduce the possibility of developing some serious complications related to diabetes, machine learning and data mining techniques can be applied to diabetes-related datasets. Machine learning is a branch of artificial intelligence and computer science which focuses on the use of data and algorithms to imitate the way that humans learn. Machine learning itself can be divided into two main categories, namely, supervised and unsupervised learning [ 6 ]. The main goal in both cases is to make use of a given dataset to enhance our understanding of the data and discover useful knowledge. Supervised machine learning is characterized by the use of labeled data to train its algorithms and can be utilized for classification or regression tasks. The goal of classification is to assign each unknown instance to one of possible classes or categories for prediction or diagnosis purposes.

The proposed work implements several supervised machine learning techniques and algorithms to predict different complications related to diabetes. Unlike typical diabetes datasets, the complications’ set consists of various collections of complications such as metabolic syndrome, dyslipidemia, neuropathy, nephropathy, diabetic foot, hypertension, obesity, and retinopathy. Furthermore, logistic regression (LR), support vector machine (SVM), decision tree (DT CART), random forest (RF), AdaBoost, and XGBoost were utilized to build and evaluate different resulting classifiers. The contributions of this work are as follows:

Implementation and evaluation of traditional and ensemble machine learning models to predict eight complications in diabetic patients by utilizing a comprehensive UAE-based dataset.
Identification of the dominant characteristics that may lead to diabetic complications using feature selection methods.

2. Literature Review

Data mining can be utilized in different sectors such as education, healthcare, business, and many other fields. The applications of data mining in healthcare enable disease diagnosis, prognosis, and a deep understanding of medical data [ 7 , 8 ]. For instance, it may provide a better understanding of the correlation between different chronic diseases [ 6 ], such as diabetes mellitus (DM), which is a serious health problem and a cause of death.

Diagnosis and prognosis of DM have received a lot of attention. Hasan et al. [ 9 ] proposed a new approach for diabetes prediction using the PIMA Indians Diabetes (PIDD) dataset. The dataset consists of 768 female patients, specifically, 268 diabetic patients (positive) and 500 non-diabetic patients (negative) with eight different attributes: pregnant, glucose, pressure, triceps, insulin, BMI, pedigree, and age. As mentioned by the authors, preprocessing is the heart of achieving state-of-the-art results, which consists of outlier rejection, substitution with the mean for missing values, data standardization, feature selection, and k-fold cross-validation (fivefold in this case). Decision trees, k -NN, AdaBoost, random forest, naïve Bayes, and XGBoost were all used and tested in this study. The authors also used an ensemble technique that aimed to boost the performance using a group of classifiers. In ensemble methods, the aggregation of outputs from different models can improve precision of the prediction. The best models used together were AdaBoost and XGBoost. The area under the curve (AUC) was chosen as the performance metric. The paper was able to achieve an AUC score of 0.95 which outperformed other studies.

Sisodia et al. [ 10 ] aimed to prognosticate the likelihood of diabetes in patients with maximum accuracy. The PIDD dataset used in this paper is the same as the previous one [ 9 ]. A decision tree, SVM, and naïve Bayes were all used in this experiment to detect diabetes at an early stage. Accuracy, precision, and recall as well as the F-score were used to measure the best model performance. As reported in the paper, naïve Bayes achieved the best performance results, with a maximum accuracy of 76.3%.

In [ 11 ], a performance comparison between three data mining models for predicting diabetes or prediabetes was discussed. The data mining models were logistic regression (LR), artificial neural network (ANNs), and decision trees (DT). The balanced dataset used consists of 735 patients and 752 normal controls. The 12 attributes used in building the models were gender, age, marital status, educational level, family history of diabetes, BMI, coffee drinking, physical activity, sleep duration, work stress, consumption of fish, and preference for salty foods. All the previous attributes were gathered by means of a questionnaire. The authors concluded that the C5.0 decision tree performed the best for classification accuracy.

Abdulhadi et al. [ 12 ] constructed several machine learning models to predict the presence of diabetes in women using the PIDD dataset. The authors addressed the missing values problem by using the mean substitution method and rescaled all the attributes using a standardization method. LR, linear discriminant analysis (LDA), SVM (linear and polynomial), RF were used to build the models. According to the paper, a maximum accuracy score of 82% was achieved by the RF model.

In addition to predicting the presence of diabetes in patients, few existing studies have reported the use of machine learning to develop prediction models of diabetes complications. For instance, in [ 13 ], a model was built to predict some chronic diabetes complications, especially eye disease, kidney disease, coronary heart disease, and hyperlipidemia. The authors started with a dataset of 455 records. The number of records decreased through data selection and cleaning. The final number of records as well as the number of features used to build the model were not mentioned in the paper. The authors used an iterative decision tree (ID3) algorithm to build the model [ 14 ]. To evaluate its performance, 10-fold cross validation was used, yielding an accuracy of 92.35%. It is worth mentioning that the high accuracy score in this case is not sufficient to indicate the performance of the model, especially in case of unbalanced data. This is mainly because a model can ignore the minority class by predicting all the instances as the majority class and still achieve good accuracy scores.

In [ 15 ], HbA1c regression models were developed. As mentioned in the paper, HbA1c reflects the average amount of glucose accumulated in the blood over the last 2–3 months and has direct relationships with diabetes and future risk of complications. The dataset used in this study was collected from the Diabetes Research in Children Network (DirecNet) trials on 170 subjects having type 1 diabetes mellitus and aged between 4 to <10 years. The missing data problem was addressed using the mean substitution while any attribute with missing values >20% was discarded. Moreover, several feature extraction and selection methods were applied to the dataset. According to the paper, the final ML model which consisted of two ensemble methods RF and extreme gradient boosting (XGB) achieved a low mean absolute error (MAE) of 3.39 mmol/mol and a high coefficient of determination (R-squared) score of 0.81.

Dagliati et al. [ 16 ] focused on predicting the onset of retinopathy, neuropathy, and nephropathy in T2DM patients in different time scenarios, at 3, 5, and 7 years from the first visit to the hospital. The first visit to the hospital provided the patient’s health status. The selection of patients in this study consisted of the following criteria: patient has a follow-up time longer than the corresponding temporal threshold (3, 5, or 7); patient develops the complication after the first visit; patient’s complication onset date has been registered. The dataset was collected by Istituto Clinico Scientifico Maugeri (ICSM), Hospital of Pavia, Italy, for over 10 years. It contains 943 records with the following features: gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. The classification models used were LR, NB, SVM, and RF. The missing data and class unbalance problems were handled using missForest [ 17 ], whereas the unbalanced class problem was solved by oversampling the minority class. According to the paper, the maximum accuracy score was reached by LR with 77.7%.

In [ 18 ], the authors focused only on studying one complication which is sarcopenia, which is a geriatric syndrome, and it is closely related to the prevalence of type 2 diabetes mellitus (T2DM). The goal of that paper was to make risk assessment of sarcopenia easier by building ML models using SVM and RF. The dataset used in the paper is limited in size with only 132 records of patients aged over 65 and diagnosed with T2DM. It contains several records for each patient, such as age, duration of diabetes, history of hypertension, smoking and drinking habits, as well as some medical records like serum albumin and 25-OH vitamin D3. The missing value problem was solved using a k -NN classifier with a k set to 10. As mentioned in the paper, the area under the receiver operating characteristic (ROC) curve (AUC) was over 0.7, and the mean AUC of SVM models was higher than that of RF.

Alam et al. [ 19 ] studied diabetes-induced nephropathy and cardiovascular disease by building different machine learning algorithms. The dataset used in this paper is a result of a study conducted at the Tokyo Women’s Medical University Hospital and 69 collaborating institutions in Japan. The dataset consists of 779 type 2 diabetes mellitus (T2DM) patients. SMOTE was used to help solve the data-unbalanced problem. Methods such as logistic regression, SVM, naïve Bayes, decision tree, and random forest were used in a supervised environment. RF produced the best results for predicting nephropathy with an AUC score of 0.87.

From the previous literature, it can be noticed that the general research trend is to predict the presence of type 2 diabetes in patients, whereas predicting diabetes complications has received less attention. Moreover, the number of complications studied in most of the available literature is very limited, as it does not exceed two or three complications. Moreover, there is a clear limitation when it comes to the number of features used in each study and the nature of these features. For instance, the number of the available medical tests in [ 18 ] is very limited.

Accordingly, one of the objectives of this research was to achieve reliable and improved results in predicting diabetes complications in diabetic patients using various state-of-the-art machine learning algorithms by utilizing a comprehensive UAE-based dataset. An extensive number of experiments was conducted testing several data imputation methods, balancing techniques, as well as studying the effect of applying feature selection to the dataset.

3. Materials and Methods

This section elaborates the methodology followed in this work ( https://github.com/Yazanjian/Diabetes-Complications-Prediction , accessed on 23 November 2021). Several essential preprocessing steps are discussed along with the machine learning algorithms used. Next, the training process is discussed in detail. Finally, this section presents the evaluation metrics used to assess the learned models’ performance. Figure 1 depicts the workflow of this study.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g001.jpg

The developed workflow for diabetes complications prediction.

3.1. The Dataset

Utilizing an adequate dataset plays a significant role for any ML problem. In this research, the dataset on hand was collected from the Rashid Centre for Diabetes and Research (RCDR) located in Ajman, UAE [ 20 ]. The selection criteria for the collected data must have conformed to the following: all the patients included in this study had already been diagnosed with diabetes and any of its complications under study. Moreover, the dataset mainly consists of medical records which were reported by RCDR.

The dataset consists of 884 patients with 79 input attributes and eight output classes (complications). The input attributes are distributed as follows: 73 numerical attributes and six nominal attributes. From the 73 numerical attributes, we had 64 medical tests, including age, gender, BMI, HbA1c, vitamin D, blood pressure, and diabetes types. For the output (target) attributes, we had the main eight complications, i.e., metabolic syndrome, dyslipidemia, neuropathy, nephropathy, diabetic foot, hypertension, obesity, and retinopathy.

A brief description of these complications is provided below.

Hypertension: according to WHO [ 21 ], hypertension—or elevated blood pressure—is a serious medical condition that significantly increases the risks for heart, brain, kidneys and of other diseases. Hypertension occurs when blood pressure is too high.

Obesity: overweight and obesity are defined as abnormal or excessive fat accumulation that may impair health [ 22 ]. For adults, WHO defines obesity as having a BMI greater than or equal to 30.

Dyslipidemia is defined as having a high plasma triglyceride concentration, low high-density lipoprotein cholesterol (HDL-C) concentration, and decreased concentration of low-density lipoprotein cholesterol (LDL-C) [ 23 ].

Metabolic syndrome is a cluster of metabolic disorders. For example, high blood pressure alone is a serious condition, but when a patient has high blood pressure along with high fasting glucose levels and abdominal obesity, this patient may be diagnosed with metabolic syndrome [ 24 ].

Diabetic foot is defined as the foot of diabetic patients with ulceration, infection, and/or destruction of the deep tissues, associated with neurological abnormalities and various degrees of peripheral vascular disease in the lower limb [ 25 ].

Neuropathy: nerve damage from diabetes is called diabetic neuropathy. According to CDC [ 26 ], high blood sugar can lead to this nerve damage.

Nephropathy is a disease of the kidneys caused by damage to small blood vessels or to the units in the kidneys that clean the blood. People who have had diabetes for a long time may develop nephropathy [ 27 ].

Retinopathy is any damage to the retina of the eyes, which may cause vision impairment. Diabetic retinopathy (DR) occurs when high blood sugar damages the blood vessels below the retina [ 28 ].

3.2. Preprocessing

The given dataset presents issues that require several preprocessing steps that are critical to properly train the machine learning models and fine-tune their performance.

3.2.1. Data Cleaning

The first step in processing the dataset is cleaning it by removing the unnecessary records and attributes by following a systematic procedure. Firstly, the dataset consists of several categorical values that need to be deleted for confidentiality purposes, i.e., hospital number, episode date, and episode description. Furthermore, the dataset consists of missing values for the diabetes type for some patients, which is a critical information in this research since we studied diabetes complications in diabetic patients. Therefore, all the 26 instances suffering from this problem were removed.

Another necessary step in this study is checking the total number of missing values per record (or patient). By testing different percentages using all the classifiers, it was found that removing all the records with >60% of missing values achieved better performance compared to other experiments where this problem was ignored.

Following the approach in [ 19 ], the missing values were also investigated per attribute. Based on several experiments, a threshold of 40% was set for this step, meaning that any attribute with missing values larger than or equal to 40% should be dropped from the dataset. Since this dataset has a large number of numerical attributes, it was found that 16 numerical attributes have missing values of more than 40%. More precisely, most of these attributes have more than 90% missing values. This specific threshold was selected experimentally and influenced by the literature [ 19 ].

3.2.2. Data Imputation

Handling missing values is essential in training classifiers since most of the available machine learning algorithms cannot be utilized with missing data. For the categorical values available in our dataset, such issues occur only with the nationality attribute. The most frequent value in that column (United Arab Emirates) was thus used to fill the missing values.

On the other hand, three different methods were extensively tested and evaluated to solve the missing values problem in numerical attributes. The first method used to overcome this challenge is by using the mean substitution method [ 9 ]. Mean substitution basically is a statistical way to represent and fill any missing value in an attribute (feature) with the average of observed data for that attribute in other records or patients. One of the possible drawbacks of utilizing mean substitution is that it may lead to biased results, hence not reflecting the reality.

Another way to fill the missing values is by using a k -NN model to impute the missing values [ 29 ]. The k -NN classifier goal is to find the nearest neighbors of the missing value based on some predefined distance metric. After that, each missing feature is imputed using values from N nearest neighbors that have a value for the feature. The features of the neighbors are averaged uniformly. If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the particular feature being imputed [ 30 ]. Following the approach in [ 18 ], the number of neighbors selected for this model is k = 10.

The third and last method used in this research was MissForest [ 17 ]. This method imputes missing values using random forests in an iterative fashion. The first step in this algorithm is selecting the first attribute which has the least number of missing values (candidate column). After the selection, the missing values in the candidate column are filled by the mean of that column. Moreover, the candidate column is then fitted on a random forest model and treated as the output where other columns in the dataset are treated as inputs to this model. After training the random forest model, the missing rows of the candidate column are imputed using the prediction from the fitted random forest. These steps continue to cover all other columns in the dataset.

To evaluate and compare the performance of all the three algorithms, RMSE was calculated as per Equation (1), for all the three methods as follows. The first step was to simulate the missing value problem by choosing a complete subset of the dataset with no missing values. The total number of records in the complete subset was 217 records. After that, the missing values percentage in the original dataset was calculated and utilized to drop random values from each column in the complete dataset. More precisely, the percentage found was 4.4%, resulting in dropping nine records per column in the complete dataset. After building the artificial dataset, the three mentioned methods were used to impute the missing values. As noticed in Table 1 , it was found that MissForest [ 17 ] results in the minimum RMSE value followed by k -NN [ 29 ] and mean methods. It is worth mentioning that Table 1 represents RMSE for some randomly selected attributes as well as the total RMSE for all columns.

RMSE results for each imputation method.

In addition to calculating the RMSE values, a visual inspection was performed on the dataset imputed by MissForest. Figure 2 shows an example of the generated values for the albumin test using MissForest. The values in blue represent the values found originally in the dataset, where the orange points show the calculated missing values. It can be noticed that the generated new values are reasonable since they seem to follow the same trend found in the data.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g002.jpg

Original and generated albumin test values using MissForest.

3.2.3. Categorical Encoding

Another needed step is encoding the categorical features in the dataset, which are gender, nationality name, and diabetes type. Encoding is necessary when ML algorithms require numerical data and therefore cannot handle categorical values. For this purpose, one-hot encoding [ 31 ] is utilized in this research, that creates a “dummy” variable for each possible category of each nonnumeric feature. Table 2 represents some categorical attributes after applying one-hot encoding.

Categorical data before applying one-hot encoding.

3.2.4. Data Balancing

One of the common challenges when building and training machine learning and data mining models is dealing with unbalanced dataset. This problem exists in our dataset. Figure 3 depicts the distributions of classes in each complication. It can be noticed that most of the complications are unbalanced. More precisely, neuropathy, nephropathy, retinopathy, and diabetic foot attributes all have some severe unbalanced distributions. For instance, diabetic foot occurs in only 2.5% of the total number of records. This issue needs to be addressed using some effective balancing method. One solution is to reduce the number of instances in the majority class (under sampling) [ 32 ], another possible solution is to increase the number of instances in the minority class (oversampling) [ 33 ].

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g003.jpg

Class distributions for each complication.

Several strategies can be followed to perform under sampling on a dataset, and each has some advantages and disadvantages. The first approach tested to balance the dataset in this research is by randomly reducing the number of instances of the majority class, i.e., by removing some samples from the most frequent class based on a given percentage. Despite the simplicity of this method, removing random samples can lead to deleting valuable information which may be preserved in the majority class. To overcome this limitation, cluster centroids are applied [ 32 ]. This method undersamples the majority class by replacing a cluster of majority samples with the cluster centroid of a k-means algorithm. This algorithm keeps N majority samples by fitting the k-means algorithm with N clusters to the majority class and using the coordinates of the N cluster centroids as the new majority samples. In addition to experimenting with both methods, a visual inspection of the cluster centroids method was conducted. Figure 4 shows an example of preserving information in the majority class. It can be noticed that most of the dropped datapoints belong to clusters that still have other instances after performing undersampling.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g004.jpg

Applying undersampling using cluster centroids. ( a ) Represents the datapoints before applying cluster centroids whereas ( b ) represents the final result of performing undersampling on the dataset.

Similarly, several methods can be applied to achieve oversampling. For instance, one method is to duplicate a specific percentage of the minority class. Despite the simplicity of this technique, having duplicates in the dataset will generally not help the model to learn new information. Another approach can be followed by using a proper method such as the synthetic minority oversampling technique (SMOTE) [ 33 ]. SMOTE first selects a minority class instance randomly and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

After experimenting with all the previously mentioned balancing methods, a combination of both SMOTE and cluster centroids was used for the final output. Figure 5 shows the final class distributions for all the complications. Since the severity of the imbalance problem varies between the complications, we treated each complication independently.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g005.jpg

Class distributions for each complication after handling the imbalance problem.

3.2.5. Data Normalization

As mentioned earlier, most of the attributes available in our dataset are numerical. Moreover, some of these features were recorded with different measurement units. Dealing with such features without any normalization could affect the performance of the models. Therefore, normalization is necessary to rescale all numeric attributes into a range between 0 and 1. Equation (2) describes the normalization formula, where Value is the value needed to be normalized, Max is the maximum value in the column, and Min is the minimum value in the column.

3.3. Machine Learning Models

Several ML learning models were trained to classify the eight complications, namely, logistic regression, SVM, decision tree (CART), random forest, AdaBoost, and XGBoost. These algorithms were selected taking into consideration multiple factors such as the simplicity found in using logistic regression classifiers. LR sometimes surprisingly performs better than other more complicated algorithms, which makes it attractive to apply to this dataset. Equation (3) represents the general formula of LR, where p ( X ) is the dependent variable, X is the independent variable, β 0 is the intercept, and β 1 is the slope coefficient. The algorithm calculates the probability of the target class by utilizing a simple yet effective linear equation. By using an intercept and slope coefficients for the features in the dataset, the probability is calculated [ 19 ]. Although the assumption of linearity between the dependent variable and the independent variables may not be correct in all cases, the simplicity and proven effectiveness of logistic regression make it an attractive algorithm to test in this study [ 11 , 16 ].

The second algorithm used is SVM. It is a supervised algorithm used for both regression and classification problems. The objective of the SVM classifier is to find a hyperplane in an N-dimensional space (N—the number of features) that distinctly classifies the datapoints by deciding on which side they fall around the plane [ 34 ].

The third algorithm used is the CART decision tree, where classification relies on different nodes and branches starting from top (root node) down to leaves (decisions). The root in the case of the decision tree represents the feature that is used to split the dataset first. This model can be utilized to enhance the understanding of each of the diabetes complications by visualizing the model tree which gives clear and easy-to-follow information. However, decision trees can be prone to overfitting as well as being unstable since adding a new attribute may result in a totally new tree (variance). These challenges can be addressed by tuning different hyperparameters such as the depth of the tree or the number of samples allowed per branch [ 13 ]. The criteria to select the attribute to split the data in each of these nodes depend on two measurements, entropy and information gain. Entropy is a measure of disorder or uncertainty and the goal of machine learning models in general is to reduce uncertainty. Information gain, on the other hand, is calculated by comparing the entropy of the dataset before and after a transformation [ 14 ]. Equations (4) and (5) can be used to calculate entropy and information gain, respectively.

where S —the current dataset for which entropy is being calculated, I —the set of classes in S , p i —the proportion of the number of elements in class i to the number of elements in set S .

where E ( S ) —entropy of set S, T —the subsets created from splitting set S by attribute A such that S = ∪ t ∈ T t , p t —the proportion of the number of elements in t to the number of elements in set S , E ( t ) —entropy of subset t .

In addition to the three algorithms explained above, three ensemble algorithms were trained and evaluated in this study. Ensemble approaches which use multiple learning algorithms such as decision trees have proven to be an effective way of improving classification accuracy [ 35 ]. On the one hand, bagging methods such as the random forest (RF) algorithm [ 36 ] apply the principle of majority voting to the results from several decision trees. On the other hand, boosting algorithms such as AdaBoost [ 37 ] and XGBoost [ 38 ] are built sequentially by minimizing the errors from previous models while increasing or boosting the influence of high-performance models [ 35 ].

3.4. Model Training

After processing the dataset and selecting the machine learning algorithms to be used, the next step was to build the actual models by training each algorithm using the processed dataset. Extensive experiments were conducted both to train and fine-tune the models. The rest of this section will discuss the detailed steps.

3.4.1. Cross-Validation

The k-fold cross-validation (KCV) technique is one of the most widely used approaches to select a classifier and evaluate its performance [ 39 ]. Figure 6 shows the detailed pictorial presentation of data splitting using this technique (with tenfold cross-validation). The dataset was split into K folds. The K − 1 folds were used to train and fine-tune the hyperparameters in the inner loop where the grid search algorithm [ 40 ] was employed. In the outer loop, the best hyperparameters and the test data were used to evaluate the model. Since the dataset contains imbalanced records, stratified KCV [ 41 ] was used to preserve the percentage of samples for each class the same as in the original percentage. Moreover, for a better evaluation, this process was repeated 10 times. The final performance metric was estimated using Equation (6) where M is the final performance metric for the classifier and P n ∈ R, n = 1, 2, …, K is the performance metric for each fold.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g006.jpg

The use of KCV for both hyperparameters tuning and training [ 9 ].

3.4.2. Feature Selection

After training the models using the best hyperparameters’ combinations found by grid search, feature selection techniques were applied to the dataset to select the top N features for each model. Feature selection played a significant role in this study because the dataset had more than 70 attributes and it was essential to reduce their number and improve the overall performance of the learning models. To this aim, each model built using the complete attribute set was utilized to calculate and select the top five and ten attributes that contributed most to the results. A performance comparison was then conducted to study the effect of utilizing all the features as well as utilizing the selected ones to build several ML classifiers.

In this research, the selection of the top features relied on utilizing the parameters and equations built in each model. As mentioned before, this study used two types of models. The first one was linear models, such as logistic regression and linear SVM, which rely on a linear equation to calculate the final output (or decision). For such estimators, the coefficients of the equations were used to determine the top features to select. The second type of models was tree-based models. For this type, feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node [ 42 ]. The node probability can be calculated by the number of samples that reach the node divided by the total number of samples. The higher the value, the more important the feature.

3.5. Evaluation Metrics

To test the performance of the built models, several evaluation metrics were utilized. The first metric used is classification accuracy, which is defined as the percentage of instances classified as their true class labels [ 16 ]. Although it is one of the most used evaluation metrics, it does not accurately describe the model performance in case of unbalanced datasets. Hence, it is important to use other techniques in this case. The accuracy of a classifier can be computed using Equation (7), where TP is true positive, TN is true negative, FP is false positive, FN is false negative.

Other interesting metrics to use are precision and recall. Precision is the percentage of instances that were classified as X and are actually X, whereas recall is defined as the percentage of instances that are actually X and were predicted as X by the classifier [ 16 ]. Equations (8) and (9) can be utilized to calculate the precision and recall, respectively.

The fourth metric is F1-score, which is the harmonic mean of precision and recall. Hence, F1-score is maximum at the value of 1 and minimum at the value of 0 [ 43 ]. Equation (10) can be used to calculate F1-score.

According to what was mentioned earlier, accuracy and F1-score are reported for all the conducted experiments.

Since we have eight complications in the dataset and because a patient can suffer from multiple complications at the same time, we decided to build binary classifiers for each complication utilizing all the algorithms mentioned before. Moreover, to assess the performance and the effect of training all the algorithms, some baselines should be constructed and compared to the final performance. For that, we established some simple classifiers to compare them with the trained models. The job for each classifier is simply to predict all the instances as the majority class. After accomplishing this step, the accuracy and F1-score were calculated for all these basic estimators. The performance of the baseline classifiers is reported in Table 3 and will be discussed more in the next section.

Baseline model performance.

Table 4 shows the extensive experiments conducted. Since we used k-fold cross-validation (k = 5) for hyperparameters tuning as well as repeated k-fold cross-validation for model training (with k = 10) and a total of 10 repetitions, we conducted 10 × 5 × 10 = 500 experiments for each single model. For each complication, three main experiments were conducted applying all ML algorithms mentioned before. The first group of experiments evaluated the models using all the attributes available in the dataset, whereas the second and third experiments utilized only the top ten and top five attributes, respectively.

Summary of all experiments for the selection of the best-performing classifier for each diabetes complication.

1 Numbers in bold highlight the best classifiers.

5. Discussion

The reason behind establishing baseline predictors was that the dataset in hand was used for the first time in this research, and there were no prior performance scores to compare against. By comparing the results in Table 3 with the best results achieved for complications’ models, it can be easily noticed that the final trained models overperformed the performance of the basic classifiers.

Moreover, by comparing our results with the reported accuracy scores in [ 16 ], we can notice that our models achieved more than 10% improvement for predicting retinopathy, nephropathy, as well as neuropathy. Table 5 shows further comparisons between our proposed method and other available studies. The accuracy score was used for the comparison since it is the most utilized evaluation metric in the literature.

A comparison of recent works developed for predicting diabetes complications using machine learning.

From the reported results in Table 4 , it can be also observed that RF, AdaBoost, and XGBoost mainly achieved the best performance. This observation reinforces the fact that utilizing tree-based ensemble algorithms is essential in such problems. Moreover, the use of “weak” classifiers to build up the final models helps boosting the final performance results. Linear models, especially logistic regression, also performed well for some complications. This indicates that linearity assumption is indeed correct in some cases. Another observation that can be extracted from the results in Table 4 is that in most cases the best results whether for all attributes or top ten or top five are produced by the same algorithm.

By looking at the performance of the best models in Table 4 , we can notice that by using only a small subset of all the attributes available we can still achieve acceptable results. The performance achieved by using the selected features’ sets and the total number of features was compared by calculating the mean and standard deviation of the difference between the accuracy scores. For example, the difference between the accuracy score achieved by using all the attributes and by using only the top 10 attributes was 0.0332 ± 0.021, whereas using the top five attributes resulted in a difference of 0.06 ± 0.032. It can be noticed that the degradation of performance resulted by using selected features in most cases was very small. This observation emphasizes the positive effects of applying features selection on the dataset. Furthermore, reducing the number of attributes by more than 60 features had positive effects of reducing the training and prediction time needed. In addition to that, we utilized the best model found for each complication to identify the dominant features that affect it. Based on this step, we found that total cholesterol, diabetes age, gender, BMI, and blood pressure are the most useful features to predict the complications. Moreover, T2DM, weight, low-density lipoprotein (LDL), high-density lipoprotein (HDL), and microalbumin creatinine ratio were also found to be useful. This observation can help us build more sophisticated models by giving more attention and weight to such features. Moreover, physicians can also benefit from such information by also investigating possible relations between these features.

It is also important to study and compare the performance reached for each complication. For instance, one observation was related to the distribution of the output class. The distribution itself plays a significant role and can affect the overall performance. For a better investigation, Figure 7 represents the confusion matrix of the correlation between all the target values available in the dataset. A maximum value of 1 describes the high correlation available, in contrast, a value of 0 indicates no correlation at all. The qualitative and quantitative analysis in Figure 7 demonstrate the correlation between the first four targets (metabolic syndrome, dyslipidemia, hypertension, and obesity) as well as the correlation between the last four targets (neuropathy, nephropathy, diabetic foot, and retinopathy). The correlation in both sets can help explain the findings in Table 4 , where the evaluation metrics for each group using all the attributes are indeed adjacent.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g007.jpg

The confusion matrix of the targets’ correlation.

For a better understanding of the conducted tests, we calculated the time for each experiment in Table 4 . Figure 8 shows the total averaged time needed to train a model using all the attributes based on the algorithm used. Although the training time needed for each algorithm varies slightly over the complications, the general observation is that ensemble methods are found to consume the most amount of training time. This is due to the fact that ensemble algorithms rely on building N numbers of smaller and weaker classifiers to come up with the final output. For this problem, and since the size of the data was relatively small, we neglected the time difference when selecting the best models.

An external file that holds a picture, illustration, etc.
Object name is healthcare-09-01712-g008.jpg

The average time needed to train a model.

6. Conclusions

In this paper, data mining and machine learning algorithms were used to classify and predict eight different diabetes complications. The complications’ set consists of metabolic syndrome, dyslipidemia, hypertension, obesity, diabetic foot, neuropathy, nephropathy, and retinopathy. Furthermore, the dataset used consists of 884 records and 79 attributes. After cleaning the dataset, multiple experiments were conducted to solve the missing value problem. For that, simple mean imputation, k -NN as well as MissForest were all tested and evaluated. It was found that MissForest achieved the minimum RMSE score. As a result, it was utilized throughout the rest of this research. After handling all the missing values, one-hot encoding was applied to the categorical attributes such as nationality name, gender, and diabetes type.

Since the dataset on hand suffers from data imbalance issues, different balancing methods were examined. A combination of SMOTE for oversampling the minority class and cluster centroids for undersampling the majority class was used. The algorithms constructed for this study contains logistic regression, SVM, decision tree (CART), random forest, AdaBoost, and XGBoost. Extensive experiments were carried out for model tuning and training. Grid search with cross-validation was employed to select the best hyperparameters for each model. Moreover, k -fold cross-validation (KCV) with k = 10 was utilized to split the data into training and testing sets. Since the data had imbalanced classes, stratified cross-validation was applied. Moreover, to ensure getting reliable results, the process of CV was repeated 10 times.

Along with using all the attributes to build the models, feature selection was applied to the dataset to select the top ten and five features. The models built using the reduced datasets achieved a comparable performance with the models that utilized all the attributes. Moreover, we utilized this step further for a better understanding of the most dominant features that affect the models’ predictions. Based on our analysis, we observed that total cholesterol, diabetes age, gender, BMI, and blood pressure are the most useful features to predict the complications. Moreover, T2DM, weight, low-density lipoprotein (LDL), high-density lipoprotein (HDL), and microalbumin creatinine ratio were also found to be useful.

Acknowledgments

The work in this paper was supported, in part, by the Open Access Program from the American University of Sharjah. This paper represents the opinions of the authors and does not mean to represent the position or opinions of the American University of Sharjah.

Abbreviations

The following abbreviations are used in this manuscript:

Author Contributions

Conceptualization, M.P., A.S. and F.A.; data curation, Y.J.; investigation, Y.J. and A.S.; methodology, Y.J.; project administration, M.P., A.S. and F.A.; resources, Y.J.; software, Y.J.; supervision, M.P., A.S. and F.A.; validation, Y.J.; writing—original draft preparation, Y.J.; writing—review and editing, M.P., A.S. and F.A. All authors have read and agreed to the published version of the manuscript.

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

IMAGES

(PDF) DIABETES PREDICTION USING MACHINE LEARNING ALGORITHMS
(PDF) Predicting Diabetes Mellitus With Machine Learning Techniques
Diabetes Prediction Using Machine Learning
Flowchart for predicting diabetes using Machine Learning.
(PDF) Diabetes Prediction Using Machine Learning
(PDF) An Effective Diabetes Prediction System Using Machine Learning

VIDEO

Machine Learning applied to Diabetes prediction
Machine Learning applied to Diabetes prediction w/Python
Machine Learning applied to diabetes prediction w/Python
Machine Learning applied to Diabetes prediction
DiabNet
Diabetes Disease Prediction Using Machine Learning Algorithms

COMMENTS

(PDF) Diabetes Prediction Using Machine Learning
as 592 million. Diabetes is a disease caused due to the increase level of blood. glucose. This h igh blood glucose produces the symptoms of frequent urination, increased thirst, and increased ...
Diabetes Prediction using Machine Learning Algorithms
Existing method for diabetes detection is uses lab tests such as fasting blood glucose and oral glucose tolerance. However, this method is time consuming. This paper focuses on building predictive model using machine learning algorithms and data mining techniques for diabetes prediction.
Machine Learning Based Diabetes Classification and Prediction for
The proposed LSTM-based diabetes prediction algorithm is trained with 80% of the data, and the remaining 20% is used for testing. We fine-tuned the prediction model by using a different number of LSTM units in the cell state. This fine-tuning helps to identify more prominent features in the dataset.
Diabetes prediction using machine learning and explainable AI
Diabetes can be a reason for reducing life expectancy and quality. Predicting this chronic disorder earlier can reduce the risk and complications of many diseases in the long run. In this paper, an automatic diabetes prediction system using various machine learning approaches has been proposed.
Machine learning based diabetes prediction and ...
Fig. 1 shows each phase of the proposed ML based diabetes prediction model. In the first phase, every dataset is pre-processed. In the second stage, the pre-processed datasets are feed into the different machine learning algorithms. In the third phase, the output of the models is then analyzed using various metrics.
A comprehensive review of machine learning techniques on diabetes
Barik S, Mohanty S, Mohanty S, Singh D (2021) Analysis of prediction accuracy of diabetes using classifier and hybrid machine learning techniques. In: Mishra D, Buyya R, Mohapatra P, Patnaik S (eds) Intelligent and cloud computing. Smart innovation, systems and technologies, vol 153. Springer, Singapore, pp 399-409. 10.1007/978-981-15-6202-0_41
Predicting Type 2 Diabetes Using Logistic Regression and Machine
To improve the understanding of risk factors, we predict type 2 diabetes for Pima Indian women utilizing a logistic regression model and decision tree—a machine learning algorithm. Our analysis finds five main predictors of type 2 diabetes: glucose, pregnancy, body mass index (BMI), diabetes pedigree function, and age.
Diabetes prediction using Machine Learning algorithms and
Machine Learning, Ontology, Diabetes, Prediction. 1. INTRODUCTION Diabetes is a group of the deadliest and metabolic diseases in which the level of blood sugar in the human body is abnormally high. It impacts the body's capacity to produce the hormone insulin. High blood sugar commonly causes many complications such as
Machine learning and deep learning predictive models for type 2
Diabetes Mellitus is a severe, chronic disease that occurs when blood glucose levels rise above certain limits. Over the last years, machine and deep learning techniques have been used to predict diabetes and its complications. However, researchers and developers still face two main challenges when building type 2 diabetes predictive models. First, there is considerable heterogeneity in ...
Diabetes prediction model using machine learning techniques
Diabetes has emerged as a significant global health concern, contributing to various severe complications such as kidney disease, vision loss, and coronary issues. Leveraging machine learning algorithms in medical services has shown promise in accurate disease diagnosis and treatment, thereby alleviating the burden on healthcare professionals. The field of diabetes forecasting has rapidly ...
Diabetes Prediction Using Machine Learning Approach
The trial mainly uses two data sets one is the PIMA Indians Diabetes dataset, which is the source from the National Institute of Diabetes and Digestive and Kidney Diseases, and another dataset from Vanderbilt, based on a study of rural African Americans in Virginia. The selection of functions is done using two different methods.
Diabetes Prediction using Machine Learning Algorithms with Feature
In today's world diabetes has become one of the most life threatening and at the same time most common diseases not only in India but around the world. Diabetes is seen in all age groups these days and they are attributed to lifestyle, genetic, stress and age factor. Whatever be the reasons for diabetics, the outcome could be severe if left unnoticed. Currently various methods are being used ...
Diabetes Prediction Using Machine Learning
The motive behind this research paper is to build a machine learning model that can identify the probability of a person testing positive for diabetes based on the features. Thus, various machine learning algorithms are used to make a comparative study through which best ML technique has been identified. ... Saru S, Subashree S (2019) Analysis ...
Diabetes Prediction Using Machine Learning Algorithms
Diabetes patients with poor control is a kind of diabetes that affects the body's metabolism characterized by an abnormal rise in blood sugar levels due to insulin deficiency, tissue insulin sensitivity, or both. In the sphere of medicine, the diabetes system is quite beneficial. Diabetes prevention is critical, and utmost caution should be exercised to avoid it. Diabetes has now become a ...
Predicting Diabetes Mellitus With Machine Learning Techniques
Han et al. (2015) proposed a machine learning method, which changed the SVM prediction rules. Machine learning methods are widely used in predicting diabetes, and they get preferable results. Decision tree is one of popular machine learning methods in medical field, which has grateful classification power. Random forest generates many decision ...
PDF Diabetes Prediction Using Machine Learning
The aim of this project is to develop a system which can perform early prediction of diabetes for a patient with a higher accuracy by combining the results of different machine learning techniques. This project aims to predict diabetes via three different supervised machine learning methods including: SVM, Logistic regression, KNN.
Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes
Metabolomics, with its wealth of data, offers a valuable avenue for enhancing predictions and decision-making in diabetes. This observational study aimed to leverage machine learning (ML) algorithms to predict the 4-year risk of developing type 2 diabetes mellitus (T2DM) using targeted quantitative metabolomics data. A cohort of 279 cardiovascular risk patients who underwent coronary ...
Diabetes Prediction using Machine Learning
Diabetes is considered to be one of the worst illnesses in the world. Diabetes is caused by a combination of variables, including obesity, excessive blood glucose levels, and other causes. It does this by altering the insulin hormone, which in turn causes an irregular metabolism in the crab and raises its blood sugar levels. This program's primary objective is to lessen the risk that people ...
A survey on diabetes risk prediction using machine learning approaches
The goal of this study was to use machine learning classification approaches based on observable sample attributes to predict diabetes at an early stage. The k-NN, SVM, functional tree (FT), and RFCs were employed as classifiers. k-NN had the highest accuracy of 98%, followed by SVM at 94%, FT at 93%, and RF at 97%.
Deep learning approach for diabetes prediction using PIMA Indian
This research paper presents a methodology for diabetes prediction using a diverse machine learning algorithm using the PIMA dataset. The accuracy achieved by functional classifiers Artificial Neural Network (ANN), Naive Bayes (NB), Decision Tree (DT) and Deep Learning (DL) lies within the range of 90-98%.
Analysis and Prediction of Diabetes Using Machine Learning
Diabetes contributes to heart disease, kidney disease, nerve damage, and blindness. Mining the diabetes data in an efficient way is a crucial concern. The data mining techniques and methods will be discovered to find the appropriate approaches and techniques for efficient classification of Diabetes dataset and in extracting valuable patterns.
A Machine Learning Approach to Predicting Diabetes Complications
Diabetes mellitus, or diabetes for short, is a chronic disease that occurs either when the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces [ 1 ]. Diabetes has two main types called type 1 and type 2. In type 1 diabetes (also known as insulin-dependent or childhood-onset), there is insulin ...
Machine Learning-Based Predictive Modeling of ...
Semantic Scholar extracted view of "Machine Learning-Based Predictive Modeling of Diabetic Nephropathy in Type 2 Diabetes Using Integrated Biomarkers: A Single-Center Retrospective Study" by Ying Zhu et al. ... The research indicates that APOC1 might be a novel diagnostic biomarker for diabetic nephropathy for the first time and suggest that AP ...

Machine learning and deep learning predictive models for type 2 diabetes: a systematic review

Introduction

Machine learning and deep learning

Systematic literature review methodologies

Related works

Objective of the review

Research questions

Information sources

Search strategy

Eligibility criteria

Quality assessment

Data extraction

Risk of bias analyses

Risk of bias across studies

Search results and reduction

RQ1: What kind of features makes up the database to create the model?

RQ2: What machine learning technique is optimal to create a predictive model for type 2 diabetes?

RQ3: Which are the optimal validation metrics to compare the models’ improvement?

Summary of the findings

Conclusions

Areas of opportunity

Limitations of the study

Availability of data and materials

Abbreviations

Acknowledgements

Author information

Contributions

Corresponding author

Ethics declarations

Consent for publication

Additional information

Rights and permissions

About this article

Share this article

Diabetology & Metabolic Syndrome

Diabetes Prediction using Machine Learning

Purchase Details

Profile Information

Deep learning approach for diabetes prediction using PIMA Indian dataset

Cite this article

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda

Machine learning and deep learning approach for medical image analysis: diagnosis to detection

Author information

Corresponding author

Ethics declarations

Research involving human participants and/or animals

Informed consent

Additional information

Rights and permissions

About this article

Share this article

A Machine Learning Approach to Predicting Diabetes Complications

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. The Dataset

3.2. Preprocessing

3.2.1. Data Cleaning

3.2.2. Data Imputation

3.2.3. Categorical Encoding

3.2.4. Data Balancing

3.2.5. Data Normalization

3.3. Machine Learning Models

3.4. Model Training

3.4.1. Cross-Validation

3.4.2. Feature Selection

3.5. Evaluation Metrics

5. Discussion

6. Conclusions

Acknowledgments

Abbreviations

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

IMAGES

VIDEO

COMMENTS