SmartstatXL offers various types of regression analyses to model the relationship between independent and dependent variables. One such analysis that can be performed with SmartstatXL is Logit and Probit Regression.
Logit and Probit Regression, also known as logistic regression, are suitable methods to use when the dependent variable is dichotomous or binary. Logistic regression allows us to predict and explain the relationship between a binary dependent variable (with two possible outcomes) and one or more independent variables, whether they are nominal, ordinal, interval, or ratio. Some examples of binary dependent variables include: Yes or No, Pass or Fail, Spam or Not, and 0 or 1.
Key features of Logit and Probit regression analysis with SmartstatXL include:
- Regression Diagnostics:
- Outlier data information.
- Ability to encode outcomes in numerical form (0, 1) or text (Yes, No; Y, N; Success, Fail, etc.).
- Event response adjustment for outcomes.
- Output includes:
- Regression Equation.
- Regression Statistics/Goodness-of-Fit: R², Cox-Snell R², Nagelkerke R², AIC, AICc, BIC, Log Likelihood.
- Regression Coefficient Estimates: Coefficient Value, Standard error, Wald Stat, p-value, Upper/Lower, VIF.
- Deviance Analysis Table.
- Confusion Matrix (Classification Table and Metrics).
- Graphs:
- Logistic/Probit Regression Curve.
- ROC Curve.
- Performance Metrics.
- Outcome vs. Prediction.
Case Example
Pima Indians Diabetes Database
This dataset originally comes from the National Institute of Diabetes and Digestive and Kidney Diseases. The purpose of this dataset is to predictively diagnose whether a patient has diabetes or not, based on several diagnostic measurements included in the dataset. Some limitations were placed on the selection of these samples from a larger database. All patients here are women who are at least 21 years old and of Pima heritage.
The dataset consists of several medical predictor variables and one target variable, Outcome.
- Pregnancies: Number of times pregnant
- Glucose: 2-hour plasma glucose concentration in an oral glucose tolerance test
- Blood Pressure: Diastolic blood pressure (mm Hg)
- Skin Thickness: Tricep skin fold thickness (mm)
- Insulin: 2-hour serum insulin (mu U/ml)
- BMI: Body Mass Index (weight in kg/(height in m)^2)
- Diabetes Pedigree Function: Diabetes lineage function
- Age: Age (years)
- Outcome: Class variable (0 or 1). 1 means the person is diabetic and 0 means the person is not.

Source: Pima Indians Diabetes Database
Steps for Logit and Probit Regression Analysis
- Activate the worksheet (Sheet) to be analyzed.
- Place the cursor on the dataset (for creating a dataset, see Data Preparation methods).
- If the active cell is not on the dataset, SmartstatXL will automatically attempt to identify the dataset.
- Activate the SmartstatXL Tab
- Click the Menu Regression > Logistic/Probit Regression.

- SmartstatXL will display a dialog box to confirm whether the dataset is correct or not (usually, the dataset is automatically selected correctly).

- If it's correct, click the Next Button
- A Regression Analysis Dialog Box will appear. Select the Predictor Variable(s) (Independent) and one or more Response Variables (Dependent).

- Press the "Next" button
- Select the regression output as shown in the following display:

The category that serves as the reference can be either the first category (0) or the last category (1), as there are only two levels. The reference category can also be directly selected from the outcome levels. In this example, let's say the Response Event is number 1. - Press the OK button to generate the output in the Output Sheet.
Analysis Results
Logit Regression
Analysis Information: type of regression used, regression method, response, and predictors

Regression Equation
In the logistic regression analysis conducted, the response variable used is "Outcome," which indicates whether someone has diabetes or not. Meanwhile, there are eight predictor variables used in the model: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Age.
From the analysis results, the generated regression equation is:
Y = -8.4047 + 0.1232 × Pregnancies + 0.0352 × Glucose - 0.0133 × Blood Pressure + 0.0006 × Skin Thickness - 0.0012 × Insulin + 0.0897 × BMI + 0.9452 × Diabetes Pedigree Function + 0.0149 × Age
From the equation above, we can draw several interpretations:
- Intercept (-8.4047): This is the log-odds of the Outcome when all predictor variables are zero. In practical contexts, the interpretation of the intercept is often not relevant because it's unlikely that all predictor variables would be zero.
- Pregnancies (0.1232): For each one-unit increase in the number of pregnancies, the log-odds of the Outcome (having diabetes) will increase by 0.1232, holding other variables constant.
- Glucose (0.0352): Each one-unit increase in glucose concentration will increase the log-odds of the Outcome by 0.0352, holding other variables constant.
- ... and so on for the other variables.
An R2 value of 0.272 indicates that about 27.2% of the variability in the Outcome can be explained by the predictor variables in this model. While this may seem low, it's important to remember that in the context of logistic regression, R2 does not necessarily have to be high as in linear regression.
A Chi-Squared value of 270.039 with a significance of 0.00 indicates that the overall model is significant, meaning at least one of the predictors has a significant effect on the Outcome.
In conclusion, this logistic regression model shows a significant relationship between the predictor variables and the Outcome (having or not having diabetes). However, further analysis should be conducted to determine the significance of each predictor variable and to consider other factors that may affect the model.
Model Goodness of Fit
Regression goodness-of-fit statistics and coefficient estimates

Interpretation and Discussion on Regression Goodness of Fit:
- R2 (0.2718): The R2 value in logistic regression depicts how well the independent variables in the model predict the dependent variable. An R2 of 0.2718 indicates that approximately 27.18% of the variability in the Outcome can be explained by the predictor variables in the model.
- Cox-Snell R2 (0.2964): Cox-Snell R2 is one of several versions of R2 designed for logistic regression. This value indicates that around 29.64% of the variability in the Outcome can be explained by the model.
- Nagelkerke R2 (0.4085): Nagelkerke R2 is a normalized modification of Cox-Snell R2 that ranges between 0 and 1. A Nagelkerke R2 of 0.4085 indicates that around 40.85% of the variability in the Outcome can be explained by the model. This value is often considered a more interpretable measure of goodness of fit compared to Cox-Snell R2.
- AIC (741.4454) and AICc (741.6828): AIC (Akaike Information Criterion) and AICc (Corrected Akaike Information Criterion) are goodness-of-fit measures that consider model complexity. Lower values indicate a better model. When comparing multiple models, the model with the lowest AIC or AICc is preferred.
- BIC (783.2395): BIC (Bayesian Information Criterion) is also a goodness-of-fit measure that considers the number of parameters in the model and the sample size. Like AIC, a lower BIC value indicates a better model.
- Log Likelihood (-361.7227): This is a measure of the likelihood of the model, in other words, how well the model predicts the given data. In the context of model comparison, a model with a higher log likelihood (or a lower absolute value for negative numbers) is considered better.
Conclusion: When assessing the goodness of fit of a logistic regression model, it's important to consider various statistical measures. In this case, although the R2 might seem low, the Nagelkerke R2 indicates relatively better goodness of fit. Additionally, AIC, AICc, and BIC can be used to compare this model with other alternative models to determine which is most appropriate for the data. Log likelihood also provides additional information on how well the model predicts the given data.
Regression Coefficient Estimates

Interpretation and Discussion on Regression Coefficient Estimates:
- Intercept (-8.405):
- Coefficient: The intercept of the model is -8.405. This is the log-odds of the Outcome when all predictor variables are zero. Although the interpretation of the intercept may not be relevant in many cases, in this context it indicates the baseline log-odds before considering other variables.
- Wald Statistic: A Wald Statistic value of 137.546 with a p-value of 0.000 indicates that this intercept is significant at the 1% significance level.
- Pregnancies (0.123):
- Coefficient: For each one-unit increase in the number of pregnancies, the log-odds of the Outcome (having diabetes) will increase by 0.123, holding other variables constant.
- Wald Statistic: With a Wald Statistic value of 14.747 and a p-value of 0.000, this variable is significant at the 1% significance level.
- 95% Confidence Interval: The coefficient for pregnancies ranges between 0.060 and 0.186.
- VIF (1.431): The VIF (Variance Inflation Factor) value indicates no multicollinearity issues, as it is far below the general threshold of 10.
- Glucose (0.035):
- Coefficient: Each one-unit increase in glucose concentration will increase the log-odds of the Outcome by 0.035.
- Wald Statistic: With a Wald Statistic value of 89.897 and a p-value of 0.000, this variable is significant at the 1% significance level.
- 95% Confidence Interval: The coefficient for glucose ranges between 0.028 and 0.042.
- VIF (1.299): No indication of multicollinearity.
- ... and so on for the other variables.
Conclusion:
Based on the analysis results, several variables such as Pregnancies, Glucose, Blood Pressure, BMI, and Diabetes Pedigree Function show statistical significance towards the Outcome. On the other hand, variables like Skin Thickness, Insulin, and Age are not significant in this model.
Variables with two asterisks (**) indicate significance at the 1% level, while those marked with a single asterisk (*) are significant at the 5% level. Variables without an asterisk (tn) are not significant in the model.
It's important to note that the VIF for all variables is below the common threshold of 10, indicating that there are no significant multicollinearity issues in the model.
Deviance Analysis

Deviance analysis is used to assess the fit of the logistic regression model. This is similar to Analysis of Variance (ANOVA) in linear regression, but for logistic regression, deviance is used as a measure of the variance explained by the model.
- Regression:
- DF (Degrees of Freedom): There are 8 degrees of freedom, corresponding to the number of predictor variables in the model.
- Deviance: The deviance value for the regression component is 270.039. This measures how well the model with predictor variables predicts the data compared to a model without predictor variables (intercept-only).
- P-value: A P-value of 0.000 indicates that the model with predictor variables provides a significant explanation for the variability in the Outcome compared to a model without predictor variables. The predictor variables are collectively significant in explaining the Outcome.
- Chi.05 and Chi.01: The regression deviance value exceeds the Chi-squared threshold at both the 5% and 1% significance levels. This further confirms that the predictor variables are collectively significant.
- Error:
- DF (Degrees of Freedom): There are 759 degrees of freedom.
- Deviance: The deviance value for the error is 723.4454. This measures how far off the model's predictions are from the actual data when using predictor variables.
- Total:
- DF (Degrees of Freedom): There are 767 degrees of freedom.
- Deviance: The total deviance is 993.4839, which is the sum of the regression and error deviance.
Conclusion:
From the deviance analysis, it can be concluded that the predictor variables in the model provide a significant explanation for the variability in the Outcome. The deviance value for the regression, which is significant at the 1% level, indicates that the model with predictor variables is better at predicting the data compared to a model without predictor variables.
Classification Table (Confusion Matrix)

The classification table, also known as the confusion matrix, is a tool used to assess the performance of a classification model. In this context, the table shows how well the logistic regression model predicts the Outcome (i.e., having diabetes or not).
- Actual vs. Prediction:
- Actual 0, Prediction 0: A total of 445 cases were actually non-diabetic (Actual 0), and the model also predicted them as non-diabetic (Prediction 0).
- Actual 0, Prediction 1: A total of 55 cases were actually non-diabetic (Actual 0) but the model incorrectly predicted them as diabetic (Prediction 1).
- Actual 1, Prediction 0: A total of 112 cases were actually diabetic (Actual 1) but the model incorrectly predicted them as non-diabetic (Prediction 0).
- Actual 1, Prediction 1: A total of 156 cases were actually diabetic (Actual 1), and the model also predicted them as diabetic (Prediction 1).
- Classification Accuracy:
- For Actual 0: Out of a total of 500 cases that were actually non-diabetic, the model correctly predicted 445 cases, with an accuracy of 89.00%.
- For Actual 1: Out of a total of 268 cases that were actually diabetic, the model correctly predicted 156 cases, with an accuracy of 58.21%.
- Total: Overall, the model correctly predicted 78.26% of all cases.
Conclusion:
This logistic regression model has a fairly good level of accuracy, with 78.26% of all cases being correctly classified. However, there is a difference in accuracy between the group that actually has diabetes (58.21%) and the group that does not (89.00%). This suggests that the model may be more likely to correctly predict individuals who are non-diabetic compared to those who are diabetic. It is important to consider other metrics such as sensitivity, specificity, positive predictive value, and negative predictive value to gain a more comprehensive understanding of the model's performance.
Performance Evaluation Metrics for Diabetes Prediction Model

Interpretation and Discussion of Other Classification Metrics:
- Accuracy (0.783): This is the proportion of correct predictions out of the total cases. In this case, the model correctly predicts 78.3% of the cases.
- Precision (0.739): This is the proportion of true positive predictions out of the total positive predictions. In other words, of all patients the model predicts to have diabetes, 73.9% actually have diabetes.
- Sensitivity (Recall) (0.582): Also known as the True Positive Rate. Of all the patients who actually have diabetes, the model correctly identifies 58.2% of them.
- F1-score (0.651): The F1-score is the harmonic mean of Precision and Recall. This value attempts to find a balance between Precision and Recall. A higher F1 score indicates a better model.
- Specificity (0.890): This is the True Negative Rate. Of all patients who actually do not have diabetes, the model correctly identifies 89% of them.
- Prevalence (0.349): This shows the proportion of positive cases in the dataset. In this case, 34.9% of the sample have diabetes.
- NPV (0.799): Negative Predictive Value (NPV) shows the proportion of true negative predictions out of the total negative predictions. Of all patients the model predicts to not have diabetes, 79.9% actually do not have diabetes.
- FPR (0.110): False Positive Rate is the inverse of Specificity. Of all patients who actually do not have diabetes, the model incorrectly identifies 11% of them as having diabetes.
- FNR (0.418): False Negative Rate is the inverse of Sensitivity. Of all patients who actually have diabetes, the model incorrectly identifies 41.8% of them as not having diabetes.
- LR+ (5.292): Positive Likelihood Ratio measures how much more likely someone with a positive test result actually has the condition compared to someone with a negative test result.
- LR- (0.470): Negative Likelihood Ratio measures how much less likely someone with a negative test result actually does not have the condition compared to someone with a positive test result.
Conclusion:
Based on the given classification metrics, the model performs quite well in identifying patients who do not have diabetes (as indicated by the high Specificity), but its performance is lacking in identifying patients who actually have diabetes (as indicated by the relatively lower Sensitivity).
Nonetheless, considering other metrics like Precision, F1-score, and Likelihood Ratios, this model offers a fairly good balance between predicting patients who have and do not have diabetes. However, it's always crucial to consider the clinical context and the consequences of prediction errors when evaluating the model's performance in practice.
Receiver Operating Characteristic Curve (ROC)

ROC Curve:
The ROC (Receiver Operating Characteristic) Curve is used to assess the performance of a classification model across all possible classification thresholds.
In the ROC plot:
- The X-axis measures the False Positive Rate (1-Specificity), while the Y-axis measures the True Positive Rate (Sensitivity). For reference, the diagonal line from point (0,0) to point (1,1) represents the performance of a random classification model. If the model performs like this line, then it has no discriminative ability.
- Series for the ROC Curve: This is the plot of Sensitivity against 1-Specificity for various classification thresholds. The higher and to the left this curve is from the diagonal line, the better the model performs.
Area Under the ROC Curve (AUC):
- AUC (0.839): The AUC value ranges between 0 and 1. An AUC of 0.839 indicates that the model has very good discriminative ability. In other words, there's an 83.9% chance that the model will correctly differentiate between positive and negative.
- Standard Error (0.016): This indicates how stable the AUC estimate is. The smaller the standard error, the more accurate the AUC estimate.
- z-stat (20.764): This is the test statistic for the null hypothesis that the actual AUC is 0.5 (no discriminative ability). A high z-statistic indicates that the model's AUC is significantly different from 0.5.
- Significance Level P (0.000): The p-value indicates the probability of obtaining the observed AUC or more extreme if the null hypothesis is true (i.e., the model has no discriminative ability). A very low p-value (0.000) confirms that the model has significant discriminative ability.
Conclusion:
From the ROC graph and AUC value, it can be concluded that the classification model has excellent discriminative ability in distinguishing between individuals with and without diabetes. The high AUC indicates good model quality, which is reinforced by the high z-statistic and extremely low p-value.
Performance Metrics at Various Probability Values
Model Performance Chart at Various Threshold Values: A Comparison between Sensitivity, Specificity, and Accuracy

Interpretation and Discussion of the "Performance Metrics at Various Probability Values" Chart in the Context of Diabetes:
- X-Axis (Cut-off Prob.):
- This shows the various threshold probability values used for classification. By increasing or decreasing this threshold, we can change how often the model predicts positive or negative outcomes.
- In the context of diabetes, this chart displays various threshold probabilities used to predict the presence of diabetes in individuals. By adjusting this threshold, we can alter how often patients are predicted to have or not have diabetes.
- Y-Axis (Performance Metric (%)):
- The chart shows how the accuracy of diabetes diagnosis (Sensitivity, Specificity, and Accuracy) changes according to the chosen probability threshold.
- Series - Sensitivity, Specificity, and Accuracy:
- Sensitivity: There is a noticeable downward trend in Sensitivity as the probability threshold increases. In the context of diabetes, this means the higher the threshold we set, the fewer patients who actually have diabetes we detect. This could be a problem if we want to ensure no diabetic patients are missed.
- Specificity: As the probability threshold rises, Specificity also increases. In medical practice, this means we become more confident that the patients we diagnose as not having diabetes truly do not have the condition. However, the risk is increasing the likelihood of missing patients who actually have diabetes.
- Accuracy: In the context of diabetes, accuracy indicates how often our model is correct in diagnosing the presence or absence of diabetes based on the set probability threshold.
Conclusion:
This chart is crucial in the medical context, particularly in diagnosing diabetes. By adjusting the probability threshold, doctors and medical professionals can determine the desired balance between Sensitivity and Specificity.
For example, in a scenario where it is crucial to ensure all patients with diabetes are identified (e.g., in initial screenings in areas with high diabetes prevalence), doctors might choose a lower threshold to maximize Sensitivity. However, this may result in some actually healthy patients receiving a diabetes diagnosis (False Positives).
Conversely, in settings where the consequences of False Positives are very high (e.g., in the decision to administer invasive treatments based on diagnostic outcomes), doctors might choose a higher threshold to maximize Specificity, albeit at the risk of missing some patients with diabetes (False Negatives).
Therefore, it is important to understand the impact of each probability threshold in medical practice and to adjust it according to specific clinical needs and context.
Distribution of Outcome vs Prediction

Interpretation and Discussion of the "Distribution of Outcome Versus Prediction" Chart:
- Cut-Off Probability (X-Axis):
- This represents the probability threshold used for classifying predictions. For example, if we use a threshold of 0.3, then all predictions with probabilities above 0.3 will be classified as "YES," and those below as "NO."
- Frequency (Y-Axis):
- This indicates the number of predictions that fall into either the "YES" or "NO" category at each probability threshold.
- YES and NO Series:
- At lower thresholds (e.g., 0 to 0.2), we see that the majority of predictions are classified as "NO."
- As the threshold increases, the number of "NO" predictions decreases, while the "YES" predictions increase.
- There is a peak for the "YES" category around the 0.7 threshold, where "YES" predictions reach their highest frequency.
- The frequency of "NO" predictions reaches its lowest value at around the same threshold (0.7).
Interpretation and Relevance to the Pima Tribe Diabetes Database:
This dataset focuses on diagnosing diabetes in females from the Pima tribe by considering various health parameters. The objective is to predict the likelihood of a patient having diabetes based on predictor variables.
From the "Distribution of Outcome Versus Prediction" chart, we can see how the distribution of diabetes diagnosis predictions changes based on various probability thresholds:
- Probability Threshold and Clinical Decisions:
- Determining the probability threshold is crucial in a medical context as it can influence clinical decisions. A lower threshold means more patients will be labeled as having diabetes (increasing Sensitivity). However, this also raises the risk of misdiagnosis (reducing Specificity).
- Conversely, a higher threshold means fewer patients will be labeled as having diabetes, reducing the risk of misdiagnosis but also increasing the risk of missing a diagnosis in patients who actually have diabetes.
- Relevance in Clinical Practice:
- Given the serious consequences of diabetes, choosing the right threshold is critical. If the consequences of missing a diabetes diagnosis (False Negative) are considered more serious than misdiagnosing someone with diabetes (False Positive), then a lower threshold may be chosen to increase sensitivity.
- By understanding the distribution of predictions at various thresholds, clinicians can make more accurate and evidence-based decisions on how to diagnose patients.
- Chart Conclusion:
- The chart shows how the distribution of predictions changes with various probability thresholds for diabetes diagnosis.
- At lower thresholds, more patients are labeled as not having diabetes. Conversely, at higher thresholds, more patients are labeled as having diabetes.
- This illustrates how we can adjust the model's sensitivity and specificity based on the chosen threshold, which in turn can influence clinical decisions.
Taking all this information into account, it becomes clear that the proper selection of a threshold is crucial in the diagnosis and treatment of diabetes among women from the Pima tribe. This decision must be made while considering the clinical consequences of prediction errors and specific patient needs.
Residual Table and Outlier Data Check

Residual analysis is a critical step in assessing the quality of a logistic regression model (or any other regression model). In the context of a logistic regression model to predict diabetes, residuals measure how well the model's predictions align with the actual data. There are several types of residuals and other statistics used to evaluate model fit, including:
- Residual: This is the difference between the actual observed outcome and the outcome predicted by the model.
- Pearson Residual: This is a type of normalized residual frequently used in logistic regression.
- Deviance Residual: This measures how well the model predicts a particular outcome.
- Leverage: This measures how far a data point is from the data mean. Data points with high leverage can have a significant impact on model estimates.
- Studentized Pearson Residual: This is the Pearson Residual divided by its standard deviation.
- Studentized Deviance Residual: Similar to deviance residuals, but it has been normalized.
- Likelihood Residual: This measures how well the model predicts an outcome, similar to the deviance residual.
- Cook's Distance: A measure that identifies data points that might have a significant impact on model estimates.
- DFITS: Another statistic that measures the influence of a data point on the model.
From the data above, the "Diagnostic" column provides additional information about specific data points:
- Outlier: This data point has very high or very low residuals, indicating that the model's prediction for this point is far from the actual observation.
- Extreme: This indicates that the data point has a significant impact on model estimates. This could be due to a combination of rare features or values far from the mean.
In practice, it is crucial to check and understand outliers and extreme data points as they can have a significant impact on the model and the interpretation of its results.
Diagnostic Interpretation:
- Outlier: An outlier refers to a response value (in this case, the likelihood of having diabetes or 'Y') that does not align with the model's prediction. This means that for that data point, our model predicts significantly differently from what actually happened. In other words, an outlier is a case where the difference between the actual observed outcome and the outcome predicted by the model is very significant. This may be due to specific characteristics of the patient that are not well-covered by the model or data errors.
- Extreme: While outliers pertain to response values, extremes focus more on predictor values (in this case, variables like 'Pregnancies', 'Glucose', 'BloodPressure', etc.). An extreme refers to a data point where one or more predictor variables have very high or low values compared to the rest of the sample. This may indicate that the patient has conditions or characteristics rarely found in the dataset. Extremes can have a significant impact on the model, especially if the model heavily relies on those predictor variables for making predictions. In the table, extremes are indicated by high leverage values, which show how far a data point is from the data mean in terms of predictor variables.
Conclusion:
When evaluating a model, it's crucial to differentiate between outliers and extremes. Outliers point out where our model may not be making accurate predictions, whereas extremes signify data points with rare characteristics that could potentially affect the model's performance. Both types of data points should be carefully scrutinized as they offer insights into the weaknesses of the model and areas that may require improvement or further data review.
Probit Regression
The interpretation of the results from regression analysis using Probit regression is largely analogous to that of Logistic regression.
Information on the Type of Regression Used, Regression Method, Response and Predictors, and Probit Regression Equation
Statistical Values for Model Accuracy and Regression Coefficient Estimates
Deviance Analysis and Classification Table (Confusion Matrix) and Other Metrics
Receiver Operating Characteristic (ROC) Curve and Statistics for the Area Under the ROC Curve
Performance Metrics at Various Probability Levels (Sensitivity, Specificity, Accuracy)
Outcome Distribution vs Prediction
Residual Table and Outlier Data Examination
Conclusion
Based on the analysis conducted on diabetes data, several key points can be concluded:
- Descriptive Analysis: We found that variables such as Glucose, Blood Pressure, and BMI have significant variations in their distribution, emphasizing their importance in understanding the risk of diabetes. This corroborates previous literature associating these factors with diabetes.
- Predictive Model: Analysis of performance metrics at various probability thresholds indicates that by altering the probability threshold, we can modify the sensitivity and specificity of the model. This is crucial in a medical context, where we might prioritize identifying all positive cases (high sensitivity) over avoiding false positives (high specificity).
- Residual Analysis: Several data points were identified as outliers or extremes. Outliers highlight where the model may not be making accurate predictions, while extremes mark data points with rare characteristics that may influence model performance.
- Scientific Report Writing:
Writing Results and Discussions in Scientific Research
Descriptive Analysis
Based on the descriptive analysis, variables like Glucose, Blood Pressure, and BMI exhibit significant variations. This confirms that these factors play a crucial role in diabetes risk, in line with prior literature.
Predictive Model
In performance metric analysis, it was found that by altering the probability threshold, we can adjust the model's sensitivity and specificity. This demonstrates the model's flexibility in catering to diverse clinical needs. For instance, in the context of diabetes prevention, we might want to prioritize identifying all positive cases rather than avoiding false positives.
Residual Analysis
The residual analysis reveals the presence of outliers and extremes. Outliers indicate where our model may not be predicting accurately, whereas extremes signify patients with rarely encountered characteristics. Both types of data points highlight areas where the model may need improvement or where the data may require a re-examination.
CONCLUSION
Based on the analysis conducted, it was found that variables like Glucose, Blood Pressure, and BMI play a vital role in diabetes risk. The developed predictive model exhibits flexibility in meeting different clinical needs; however, the residual analysis indicates that there are some areas where the model may require enhancement. As a next step, researchers may consider revisiting the data or developing a model using a different approach to improve prediction accuracy and reliability.





