Blog

  • Predicting Diabetes: Comparing Logistic Regression, SVMs, and Random Forests

    While exploring Kaggle’s vast data science resources, I discovered an intriguing diabetes dataset and decided to develop a predictive model. The dataset structure is elegantly simple, featuring 8 independent variables and a target variable called Outcome, which identifies the presence or absence of diabetes. My objective is to create a robust model that can accurately predict diabetes based on these variables.

    Dataset Overview

    This comprehensive medical dataset contains diagnostic measurements specifically collected for diabetes prediction based on various health indicators. It encompasses 768 female patient records, with each record containing 8 distinct health parameters. The Outcome variable serves as the binary classifier, indicating diabetes presence (1) or absence (0). This dataset serves as an excellent resource for training and evaluating machine learning classification models in the context of diabetes prediction.

    • Pregnancies (Integer): Total pregnancy count for each patient.
    • Glucose (Integer): Post 2-hour oral glucose tolerance test plasma concentration (mg/dL).
    • BloodPressure (Integer): Measured diastolic blood pressure (mm Hg).
    • SkinThickness (Integer): Measured triceps skin fold thickness (mm).
    • Insulin (Integer): Measured 2-hour serum insulin levels (mu U/ml).
    • BMI (Float): Calculated body mass index using weight(kg)/height(m)^2.
    • DiabetesPedigreeFunction (Float): Calculated genetic diabetes predisposition score based on family history.
    • Age (Integer): Patient’s age in years.
    • Outcome (Binary): Target variable indicating diabetes (1) or no diabetes (0).

    This valuable dataset, adapted from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), is frequently utilized in data science research focusing on healthcare analytics and medical diagnostics.

    Exploratory Data Analysis

    Initial data examination reveals a significant concern: multiple variables including Blood Pressure, Glucose, Skin Thickness, Insulin, and BMI contain zero values. These likely represent measurement errors, necessitating their removal for accurate analysis.

    diabetes <- read.csv('diabetes_dataset.csv') #Read Data
    summary(diabetes)
    Diabetes Summary

    Post removal of zero values, our dataset reduces to 392 observations. Though smaller, this sample size remains adequate for developing a reliable predictive model.

    diabetes <- diabetes %>%
      filter(BloodPressure != 0) %>%
      filter(Glucose != 0) %>%
      filter(SkinThickness != 0) %>%
      filter(Insulin != 0) %>%
      filter(BMI != 0)
    hist(diabetes$Age)
    hist(diabetes$DiabetesPedigreeFunction)
    hist(diabetes$BMI)
    hist(diabetes$Insulin)
    hist(diabetes$SkinThickness)
    hist(diabetes$BloodPressure)
    hist(diabetes$Glucose)
    hist(diabetes$Pregnancies)
    Histograms

    Subsequently, we analyze the distribution patterns of all independent variables.

    The analysis reveals that most variables, except pregnancy, age, and insulin, deviate from normal distribution. The non-normal distribution of pregnancy data aligns with logical expectations.

    Data Preprocessing

    We implement these specific transformations:

    • Insulin: Box-Cox Transform
    • Age: Box-Cox Transform
    • Age: Square Root Transform
    #PreProcess Data to Get everything into Normal Distribution
    
    #Applying Box-Cox Transform on Age
    
    boxcox(diabetes$Age ~ 1)
    diabetes$boxcoxAge <- (diabetes$Age^-1.4 - 1)/-1.4
    hist(diabetes$boxcoxAge)
    
    #Applying Box-Cox Transform on Insulin
    
    boxcox(diabetes$Insulin ~ 1)
    diabetes$boxcoxInsulin <- (diabetes$Insulin^0.05 -1)/0.05
    hist(diabetes$boxcoxInsulin)
    
    #Applying Box-Cox Transform on Pregnancies
    diabetes$Pregnancies
    diabetes$sqrtPregnancies <- sqrt(diabetes$Pregnancies)
    hist(diabetes$sqrtPregnancies)

    Finally, we apply max-min scaling to normalize all values between 0 and 1, effectively preventing any data artifacts from influencing our analysis.

    #Storing relevant variables in a new dataframe and scaling the data
    
    diabetes.clean <- diabetes %>%
      dplyr::select(
        Outcome,
        DiabetesPedigreeFunction,
        sqrtPregnancies,
        SkinThickness,
        boxcoxInsulin,
        boxcoxAge,
        BMI,
        Glucose,
        BloodPressure
      )
      
    
    preproc <- preProcess(diabetes.clean, method = "range")
    scaled.diabetes.clean <- predict(preproc, diabetes.clean)
    
    head(scaled.diabetes.clean)
    str(scaled.diabetes.clean)
    scaled.diabetes.clean$Outcome <- as.factor(scaled.diabetes.clean$Outcome)
    

    Looking for correlations in the variables

    Upon analyzing the correlation matrix, the data science analysis reveals minimal significant correlations among the variables, suggesting we can proceed with treating them as independent predictors in our modeling approach.

    # Looking for Correlations within the Data
    
    num.cols <- sapply(scaled.diabetes.clean, is.numeric)
    cor.data <- cor(scaled.diabetes.clean[,num.cols])
    cor.data
    corrplot(cor.data, method = 'color')
    Corrplot

    Splitting the data into training and testing set

    For model development and evaluation, we implement a standard data partitioning strategy, allocating 70% of the observations to the training dataset and reserving the remaining 30% for testing purposes.

    #Splitting Data into train and test sets
    
    sample <- sample.split(scaled.diabetes.clean$Outcome, SplitRatio = 0.7)
    train = subset(scaled.diabetes.clean, sample == TRUE)
    test = subset(scaled.diabetes.clean, sample == FALSE)

    Building a Model

    Our predictive modeling approach incorporates three distinct machine learning techniques: Logistic Regression, Support Vector Machines, and Random Forests.

    Logistic Regression

    The logistic regression implementation yields a respectable accuracy of 77.12%. The model identifies Age, BMI, and Glucose as significant predictors of diabetes, with the diabetes pedigree function showing moderate influence. This suggests that while genetic predisposition plays a role, lifestyle factors remain crucial in diabetes prevention.

    Support Vector Machines

    Despite parameter tuning efforts, the Support Vector Machines algorithm demonstrates slightly lower performance, achieving 74.9% accuracy compared to the logistic regression model.

    Random Forest

    Random Forests emerge as the superior performer among the three approaches, delivering the highest accuracy at 79.56%.

    A critical observation across all models is the notably lower proportion of Type I errors. In this medical context, false negatives pose a greater risk than false positives, making this characteristic particularly relevant.

    Comparing the Models

    A comparative analysis of model performance metrics reveals Random Forests as the top performer, though there remains room for improvement. It’s worth noting that the necessity to exclude numerous observations due to measurement inconsistencies may have impacted model performance. While this model shows promise as a preliminary diabetes screening tool with reasonable accuracy, developing a more precise predictive model would require additional data points and refined measurements.

    Code

    ### LOGISTIC REGRESSION ###
    
    log.model <- glm(formula=Outcome ~ . , family = binomial(link='logit'),data = train)
    summary(log.model)
    fitted.probabilities <- predict(log.model,newdata=test,type='response')
    fitted.results <- ifelse(fitted.probabilities > 0.5,1,0)
    misClasificError <- mean(fitted.results != test$Outcome)
    print(paste('Accuracy',1-misClasificError))
    table(test$Outcome, fitted.probabilities > 0.5)
    
    ### SVM ####
    
    svm.model <- svm(Outcome ~., data = train)
    summary(svm.model)
    predicted.svm.Outcome <- predict(svm.model, test)
    table(predicted.svm.Outcome, test[,1])
    tune.results <- tune(svm, 
                         train.x = train[2:9], 
                         train.y = train[,1], 
                         kernel = 'radial',
                         ranges = list(cost=c(1.25, 1.5, 1.75), gamma = c(0.25, 0.3, 0.35)))
    summary(tune.results)
    tuned.svm.model <- svm(Outcome ~., 
                           data = train, 
                           kernel = "radial",
                           cost = 1.25,
                           gamma = 0.25,
                           probability = TRUE)
    summary(tuned.svm.model)
    print(svm.model)
    tuned.predicted.svm.Outcome <- predict(tuned.svm.model, test)
    table(tuned.predicted.svm.Outcome, test[,1])
    Table

  • Are US Police Trigger Happy?

    Recently, I stumbled upon a Washington Post article discussing the statistics of police-involved shootings and fatalities over recent years. The article referenced a comprehensive dataset, which I managed to download before encountering the paywall. This dataset documented all fatal police shootings spanning roughly a decade. While the data extended into 2024, I’ve excluded those entries from my analysis since the year is still ongoing.

    The dataset contained several key parameters:

    1. Date
    2. Name
    3. Age
    4. Gender
    5. Armed
    6. Race
    7. City
    8. State
    9. Flee
    10. Body Camera
    11. Signs of Mental Illness
    12. Police Departments Involved

    The dataset revealed a staggering 9893 cases – an alarmingly high number of individuals who lost their lives without due process, regardless of their alleged criminal activities. Each number represents a person denied their constitutional right to a fair trial.

    After examining the data quality and addressing missing values across various columns, I had to exclude approximately 400 entries, leaving me with 9509 cases for analysis. This sample size remains statistically significant enough to draw meaningful conclusions about the patterns present in the overall dataset.

    Demographics of the Victims

    My initial analysis focused on examining the age distribution of police shooting victims. The data showed a concentration in the 25-60 age range, which aligns with general crime statistics. This age group typically shows higher involvement in criminal activities or presence in high-crime areas.

    Agedistribution

    Further investigation revealed interesting patterns when analyzing racial demographics.

    The data initially appears to reflect expected proportions, given that White Americans comprise roughly 65-70% of the total population, explaining their higher representation among police shooting victims. However, a deeper analysis reveals concerning trends: Black and Hispanic victims show a notably skewed age distribution toward younger ages, with victims predominantly in their late teens and early twenties. In contrast, White victims follow a more normal distribution pattern, typically falling in their late twenties or thirties. This raises questions about whether social changes over the past few decades have led to increased police interactions with younger people of color. While this observation warrants further investigation, additional data would be needed to draw definitive conclusions about these demographic disparities.

    Age Distribution Based On Race

    While raw numbers provide one perspective, examining the percentage of population affected by police interactions offers deeper insights. I analyzed the Washington Post dataset in conjunction with US demographic data (sourced from here) to calculate these proportional impacts.

    Percentofvictimsbyrace

    In my analysis, I excluded the “Unknown” race category due to the discrepancy between census data accuracy and the Washington Post dataset’s limitations, likely stemming from incomplete police documentation. It’s worth noting that approximately 10% of victims in the original dataset had unspecified racial classifications.

    The proportional analysis reveals striking disparities: Native American and Black populations face double the likelihood of fatal police encounters compared to white populations. Hispanic individuals experience similar rates of fatal police interactions as white populations, while Asian Americans show half the likelihood of such encounters. The “Multiple Races” category contained insufficient data points for meaningful analysis, possibly due to inconsistent reporting in police records.

    One potential explanation for Asian Americans’ lower representation in police shooting statistics could be their generally reduced frequency of police interactions. While Asian Americans are widely recognized as one of America’s most successful immigrant groups, their family structure, as documented in Pew Research findings, might be a contributing factor. However, this remains a preliminary hypothesis requiring further investigation for definitive conclusions.

    Mental Health of the Victims

    My analysis then shifted to examining the mental health status of victims.

    The findings are concerning: approximately 2,000 victims over the past decade exhibited signs of mental illness. This suggests that redirecting resources toward mental health professionals and social workers might be more effective than relying solely on law enforcement.

    Mentalhealth

    Breaking down mental health data by race reveals another pattern: white victims of police shootings are roughly three times more likely to be classified as not mentally ill, while Black, Hispanic, and Native American victims show a five-to-one ratio between those classified as not mentally ill versus those showing signs of mental illness.

    Mentalinessbyrace

    It’s crucial to note that these mental health classifications are based on behavioral signs observed during police encounters, rather than professional diagnoses or established medical histories.

    Circumstantial Trends

    I focused on two key situational factors:

    1. Whether the victims were trying to flee?
    2. Whether the victims were armed?

    Were the Victims trying to Flee?

    Fleeing Behavior

    Analysis across racial demographics indicates that approximately half of the victims were not attempting to escape during their encounters with law enforcement. This suggests that many victims were likely complying with police directives, though definitive conclusions cannot be drawn solely from this dataset.

    Fleeingmode

    When examining the intersection of mental health status and escape attempts, a notable pattern emerges: the majority of mentally ill victims were not attempting to flee. This observation raises significant concerns about the necessity of lethal force in situations where alternative intervention methods might have been viable.

    The behavioral patterns of victims, when analyzed across different racial groups, demonstrate remarkable consistency. Where sufficient data exists, the distribution of victim responses appears uniform across racial categories, suggesting that behavioral responses to police encounters transcend racial boundaries.

    Fleeingbehaviorbyrace

    This consistency prompts a critical inquiry: In cases where victims showed no intention to escape, what circumstances prevented successful arrests without resorting to lethal force?

    Were the Victims armed?

    Analysis of weapon possession among victims reveals firearms as the predominant type of armament. However, a distinct pattern emerges among mentally ill victims within Native American and Hispanic communities, where knife possession was notably more prevalent.

    Werevictimsarmed

    This finding underscores the potential benefits of enhanced firearm regulation in protecting law enforcement officers – a measure that has faced consistent opposition from the National Rifle Association.

    How have Police Shootings trended over time?

    The past decade has witnessed a concerning upward trajectory in police-involved shootings and resultant fatalities. While a ten-year span might seem relatively brief in historical context, the data reveals a disturbing average of approximately 1,000 victims annually.

    Trendovertime

    The implementation of body-worn cameras appears to have limited impact, though it’s important to acknowledge potential delays between policy implementation and observable outcomes.

    Bodycamera

    Particularly concerning is the fact that body cameras were present in only one-third of documented cases.

    Bodycamwithrace

    When analyzing body camera usage across racial categories, while overall utilization shows an increasing trend, the data suggests a concerning pattern: incidents involving body cameras correlate with decreased likelihood of racial identification in victim documentation.

    Conclusion

    While numerous aspects of this issue warrant further investigation, certain data points remain unavailable – notably, comprehensive information about all police interactions, as this dataset exclusively covers fatal encounters.

    Nevertheless, the loss of 10,000 lives over a decade, through police shootings, represents an alarming figure, particularly considering that 20% of victims displayed signs of mental illness, and roughly half were not attempting to flee.

    This analysis, while revealing, highlights the need for more comprehensive research and complete datasets to fully understand and address these critical issues.

  • Spreadsheets: Common man’s programming tool

    #include <stdio.h>
    
    int main() {
        printf("Hello, World!\n");
        return 0;
    }

    I remember sitting in my computer science class about two decades ago and my teacher teaching us how to print “Hello World”. I never became a computer scientist – nor did I become a professional programmer. But I did come to appreciate how useful programming is for most professions.

    As an experimental Materials Scientist, I use programming so often to manipulate data, to analyze data, to predict the best possible set of experiments to run – and all the while I often wonder, why the common student is taught the dry programming of Hello World, that comes with C or C++ or Python or any of the other programming languages that exist, and why students are not introduced to the power of spreadsheets. Don’t get me wrong, I don’t undermine the value of true programming languages, but in my mind, =SUM(A1:A45) has more value than printf("Hello, World!\n"); as they offer a more practical entry point. Spreadsheets may not be sexy, but for most, they’re the perfect tool – since they can reduce errors, increase automation thereby, saving time.

    Here are a few good reasons why I feel spreadsheets are quite important:

    1. Low barrier to Entry
    2. Democratization of data
    3. WYSIWYG
    4. Teaching the fundamentals of programming

    And once someone graduates past the basic spreadsheet like Microsoft Excel, then they can even access VBA (a built-in programming language within excel) or Google Apps Script (a built in programming language within Google Sheets) to enable more complex functionalities.

    Spreadsheets offer a powerful and versatile toolset for anyone who works with data. Their low barrier to entry makes them accessible, while features like formulas and conditional formatting automate tasks, saving time and reducing errors. But spreadsheets hold a hidden gem: VBA in Excel and Apps Script in Google Sheets. These built-in programming languages unlock a whole new level of automation and functionality. Imagine automating complex data analysis, generating reports with a single click, or creating custom functions tailored to your specific needs.

    The next time you find yourself drowning in data, don’t underestimate the power of your spreadsheet. With a little exploration and the help of readily available online resources, you can unlock the hidden potential of VBA or Apps Script and transform your workflow. So, ditch the “Hello World” and dive into the world of spreadsheet programming – the possibilities are endless!