5 Data Science Workflow

5.1 Introduction to Data Science Workflow

In a typical data science workflow, we move through five primary stages: data collection, data cleaning, exploration, modeling, and communication. These steps collectively transform raw data into actionable insights driven by research questions while also enabling efficient and reproducible data science practices.

Data Collection: This is the first step, where data is gathered from various sources, such as databases, APIs, or web scraping. In R, packages like readr and httr facilitate efficient data import from structured files (e.g., CSV, Excel) and web sources.
Data Cleaning: Data cleaning involves preparing raw data for analysis, handling missing values, correcting data types, and dealing with outliers. Tools like dplyr and tidyr are often used in R to perform these operations, enabling tasks like removing duplicates, imputing missing data, and restructuring data into a “tidy” format suitable for analysis.
Data Exploration: Exploratory Data Analysis (EDA) is the phase where we examine the data’s characteristics, uncover patterns, and form hypotheses. Visualizations (using pairs() or a wide range of plot types available in Base R and ggplot2) and summary statistics (summary(), skimr) help in understanding the data distribution, relationships between variables, and identifying any anomalies.
Modeling: At this stage, statistical or machine learning models are developed to predict or explain outcomes. In R, packages like caret and tidymodels streamline the modelling process, from splitting data and selecting models to tuning hyperparameters and evaluating performance.
Communication: The final stage focuses on presenting findings clearly, often through reports, dashboards, or interactive applications. Using R Markdown for reports or Shiny for interactive applications enables data scientists to effectively communicate insights to stakeholders.

Quick example in R: hypothetical local dataset “house_prices.csv”

Let’s consider a simple example of a data science project where we predict house prices based on available features:

Data Collection: Load a dataset such as house_prices.csv using read_csv().
Data Cleaning: Use dplyr to handle missing values (e.g., mutate(across(where(is.na), median) to replace with median values) and convert categorical variables to factors.
Data Exploration: Visualize relationships between features like square footage and price using ggplot(data = houses) + geom_point(aes(x = sqft, y = price)).
Modeling: Create a linear model with lm(price ~ sqft + num_bedrooms, data = houses).
Communication: Report results in an R Markdown document, showing model coefficients, predictions, and visualizations to explain findings.

Quick example in R: canonical dataset `USArrests`

Let’s consider a simple example of a data science project. Suppose we aim to determine main factors in criminality using the canonical dataset USArrests.

Data Collection: Load the dataset directly from R’s built-in datasets.

data("USArrests")

Data Cleaning: Check for any missing values or potential outliers that might represent data entry errors.

# Adding a missing value at row 1, column 1
USArrests[1, 1] <- NA

# Check for missing values
any(is.na(USArrests))

[1] TRUE

# remove missing value
USArrests_clean <- na.omit(USArrests)

Data Exploration: Visualize and summarize data to understand patterns. For instance, we might be interested in the value distribution of murder arrests.

# Quick summary of the data distribution
summary(USArrests_clean)

     Murder          Assault         UrbanPop          Rape      
 Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
 1st Qu.: 4.000   1st Qu.:109.0   1st Qu.:54.00   1st Qu.:14.90  
 Median : 7.200   Median :159.0   Median :66.00   Median :20.00  
 Mean   : 7.678   Mean   :169.4   Mean   :65.69   Mean   :21.23  
 3rd Qu.:11.100   3rd Qu.:249.0   3rd Qu.:78.00   3rd Qu.:26.20  
 Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

# Visualize the distribution of artefact "types" (represented by variables like 'Murder' here)
layout(matrix(1:2, nrow = 2), heights = c(2, 1.5))
par(mar = c(0, 4, 4, 2) + 0.1)
hist(USArrests_clean$Murder, main="Distribution of counts of murder arrests in 'USArrests'", xaxt = "n")
par(mar = c(2, 4, 0, 2) + 0.1)
boxplot(USArrests_clean$Murder, horizontal = TRUE)

This combined histogram and box plot would give a quick view of the central tendencies and spread of murder arrests across the dataset. For example, we can already observe that the mean and median of the distribution (7.677551 and 7.2, respectively) are on the lower half of the range (0.8, 17.4), which also imply a longer right tail of the distribution (i.e., the difference between maximum and mean is greater than the one between mean and minimum).

Modelling: Create a simple model to analyse relationships between variables. For instance, let’s assume we’re exploring how UrbanPop (population percentage in urban areas) might relate to Murder.

model <- lm(Murder ~ UrbanPop, data = USArrests_clean)
summary(model)


Call:
lm(formula = Murder ~ UrbanPop, data = USArrests_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3323 -3.7628 -0.7344  2.9121  9.8656 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  6.02650    2.90210   2.077   0.0433 *
UrbanPop     0.02513    0.04315   0.582   0.5630  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.359 on 47 degrees of freedom
Multiple R-squared:  0.007167,  Adjusted R-squared:  -0.01396 
F-statistic: 0.3393 on 1 and 47 DF,  p-value: 0.563

The results of a simple linear regression analysis provides a basic understanding of how one variable might be related on another, interpreted also as dependent, presumably causal relationship. In this case, there is no statistically significant evidence that urban population percentage is a relevant factor in murder arrests in our dataset.

Communication: Present findings in an R Markdown document, incorporating both visualizations and model summaries. You can interpret results to suggest whether murder arrests are more common in highly populated regions.

5.2 Data Science Workflow in R: Data Import and Preparation

5.2.1 Importing Data

A critical step in any data science workflow is importing data from external sources. R provides robust tools for importing data from a variety of formats:

CSV files: Use the read.csv() function from base R.
Excel files: Leverage the readxl package with read_excel().
Databases: Employ the DBI package to connect to relational databases and fetch data using SQL queries.

Examples using base R, `readxl` and `DBI`+`RSQLite`

# Load necessary libraries
library(readxl)
library(DBI)
library(RSQLite)

# Read a CSV file
csv_data <- read.csv("data.csv")

# Read an Excel file
excel_data <- readxl::read_excel("data.xlsx")

# Connect to a SQLite database and fetch data
con <- DBI::dbConnect(RSQLite::SQLite(), "data.db")
db_data <- DBI::dbGetQuery(con, "SELECT * FROM table_name")
DBI::dbDisconnect(con)

5.2.2 Data Cleaning

Before analysis, raw data often requires cleaning to address issues like missing values, duplicates, and inconsistencies.

Example using Base R

Removing Missing Values
Use na.omit() to remove rows with missing values or is.na() to identify them.

# Example data
data <- data.frame(A = c(1, 2, NA, 4), B = c("x", NA, "y", "z"))
print(data)

   A    B
1  1    x
2  2 <NA>
3 NA    y
4  4    z

# Remove rows with NA
clean_data <- na.omit(data)
print(clean_data)

  A B
1 1 x
4 4 z

Handling Duplicates
Use duplicated() to identify duplicate rows or unique() to retain only unique rows.

data <- data.frame(A = c(1, 2, 2, 4), B = c("x", "y", "y", "z"))
print(data)

  A B
1 1 x
2 2 y
3 2 y
4 4 z

# Remove duplicate rows
data_unique <- data[!duplicated(data), ]
print(data_unique)

  A B
1 1 x
2 2 y
4 4 z

Replacing Values
Replace specific values with ifelse() or direct indexing.

data <- data.frame(A = c(1, 2, 999, 4), B = c("x", "y", "z", "999"))
print(data)

# Replace 999 with NA
data[data == 999] <- NA
print(data)

   A    B
1  1    x
2  2    y
3 NA    z
4  4 <NA>

Example using `tidyverse`

Key functions include:
- tidyr: Tools like fill() (fill missing values) and drop_na() (remove rows with NAs).
- dplyr: Functions like distinct() to remove duplicates and mutate() to fix inconsistencies.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)

# Example dataset
data <- data.frame(id = c(1, 2, 2, 3, NA), value = c(NA, "A", "A", "B", "C"))
print(data)

  id value
1  1  <NA>
2  2     A
3  2     A
4  3     B
5 NA     C

# Clean the data
cleaned_data <- data %>%
  drop_na(id) %>% # Remove rows with missing IDs
  distinct() %>%  # Remove duplicates
  fill(value, .direction = "down") # Fill missing values downward
print(cleaned_data)

  id value
1  1  <NA>
2  2     A
3  3     B

5.2.3 Data Transformation

Transforming data is essential for reshaping and preparing it for analysis.

Example using Base R

Base R provides versatile and efficient tools for cleaning and transforming data.

A few example operations common in data science workflows are:

Filtering Rows
Subset data using logical conditions.

data <- data.frame(A = 1:5, B = letters[1:5])
print(data)

# Filter rows where A > 3
filtered_data <- data[data$A > 3, ]
print(filtered_data)

  A B
4 4 d
5 5 e

Selecting Columns
Use indexing to select specific columns.

# Select column A
selected_columns <- data[, "A", drop = FALSE]
print(selected_columns)

NOTE: the argument drop = FALSE ensures that the original data frame structure is not lost in the process (run data[, "A"] to compare).

Adding or Modifying Columns
Use the $ operator or indexing to create or modify columns.

data$new_col <- data$A * 2
print(data)

  A B new_col
1 1 a       2
2 2 b       4
3 3 c       6
4 4 d       8
5 5 e      10

Reshaping Data
Use reshape() to pivot data between wide and long formats. The wide format is where each variable corresponds to a column (table format) while the long format assign variable-value pairs to different rows, retaining one or more variables in the wide format.

# Example wide format
data <- data.frame(id = 1:2, Q1 = c(10, 20), Q2 = c(30, 40))

# Convert to long format
long_data <- reshape(data, direction = "long", varying = list(c("Q1", "Q2")), v.names = "value", timevar = "quarter")
print(long_data)

    id quarter value
1.1  1       1    10
2.1  2       1    20
1.2  1       2    30
2.2  2       2    40

Example using `tidyverse`

Use dplyr for:
* Filtering and selecting rows/columns: filter(), select().
* Creating new variables: mutate().
* Reshaping: pivot_longer() and pivot_wider().

library(dplyr)
library(tidyr)

# Example dataset
data <- data.frame(
  id = 1:3,
  Q1 = c(10, 20, 30),
  Q2 = c(15, 25, 35)
)
print(data)

# Perform all operations sequentially in a single step 
transformed_data <- data %>%
  pivot_longer(cols = starts_with("Q"), names_to = "quarter", values_to = "value") %>%
  filter(value > 15) %>%  # Filter rows where value > 15
  mutate(value_scaled = value / max(value))  # Add a new scaled column
print(transformed_data)

# A tibble: 4 × 4
     id quarter value value_scaled
  <int> <chr>   <dbl>        <dbl>
1     2 Q1         20        0.571
2     2 Q2         25        0.714
3     3 Q1         30        0.857
4     3 Q2         35        1

By mastering these data preparation steps, you ensure a clean and well-structured dataset, setting the stage for effective analysis and visualization.

5.3 Exploratory Data Analysis

5.3.1 Univariate Statistics

Numeric variables

This section equips you to explore univariate distributions of numeric variables, uncovering insights from centrality to variability with both statistical and visual techniques.

Histograms: Exploring a single variable involves visualizing its distribution to identify patterns such as central tendency, spread, and outliers. Histograms are one of the most effective tools for this.

Example: Using `ggplot2` for histograms

library(ggplot2)

# Example dataset - variable with normal distribution
data <- data.frame(value = rnorm(1000, mean = 50, sd = 10))

# Create a histogram
ggplot(data, aes(x = value)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Values", x = "Value", y = "Frequency")

Range: The range of a variable provides a simple measure of the spread of the data. It is calculated as the difference between the maximum and minimum values. Base R already has a function named range() to handle this calculation:

cat("Min:", min(data$value), "Max:", max(data$value))

Min: 16.85981 Max: 78.74143

range_val <- range(data$value)

cat("Range (Max - Min):", range_val)

Range (Max - Min): 16.85981 78.74143

Central tendency measures: the mean, median, and mode describe the centre of the distribution.
Dispersion measures: variance and standard deviation describe the spread of the data.

mean_val <- mean(data$value)
median_val <- median(data$value)
variance <- var(data$value)
std_dev <- sd(data$value)

# Print results
cat("Mean:", mean_val, "Median:", median_val, "Variance:", variance, "SD:", std_dev)

Mean: 49.93906 Median: 49.37172 Variance: 105.7793 SD: 10.28491

Example: Plotting a Histogram and Marking the Mean

Overlay the mean on a histogram to visualize its position relative to the distribution.

ggplot(data, aes(x = value)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  geom_vline(aes(xintercept = mean_val), color = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "Histogram with Mean Marked", x = "Value", y = "Frequency")

Box plots: A box plot of a single variable can be useful to visualise central tendency and dispersion measures at the same time. It offers the same information of an histogram, though in a more analytical form, by default plotting the median (strong line), the 1st and 3rd or interquartile range (box), the (pseudo)range or 1.5-interquartile range (lines), and outliers (points outside the 1.5x interquartile range).

Example: Using `ggplot2` for an univariate box plot

library(ggplot2)

# Example dataset - variable with normal distribution
data <- data.frame(value = rnorm(1000, mean = 50, sd = 20))

# Create a histogram
ggplot(data, aes(x = value)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Values", x = "Value", y = "Frequency")

Non-numeric variables

These approaches allow for comprehensive exploration of univariate statistics for categorical variables, emphasizing both numeric summaries and visual insights.

Univariate statistics for non-numeric (categorical) variables focus on summarizing and visualizing the distribution of categories. Here’s a breakdown with examples:

Frequency Tables: A frequency table lists the counts of each category, helping to understand the distribution.

# Example Dataset
data <- data.frame(category = sample(c("A", "B", "C"), size = 100, replace = TRUE))

# Frequency Table
table(data$category)


 A  B  C 
31 36 33

Bar Plots: Bar plots visually represent the frequency distribution of categories.

library(ggplot2)

# Bar Plot
ggplot(data, aes(x = category)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Category Distribution", x = "Category", y = "Count")

Proportion Visualization: Proportions provide relative frequencies, useful for comparing categorical data.

# Proportion Table
prop_table <- prop.table(table(data$category))

# Pie Chart
ggplot(data, aes(x = "", fill = category)) +
  geom_bar(width = 1) +
  coord_polar("y") +
  labs(title = "Category Proportions")

Mode: The mode is the most frequently occurring category.

# Mode Calculation
mode_category <- names(which.max(table(data$category)))
cat("Mode:", mode_category)

Mode: B

5.3.2 Bivariate statistics

Loading DartPoints dataset from archdata:

library(archdata)
data(DartPoints)

Scatter Plots: Visualize relationships between two numerical variables.

ggplot(DartPoints, aes(x = H.Length, y = Weight)) +
   geom_point() +
   labs(x = "Haft element length (mm)", y = "Weight (gm)")

Box Plots: Compare a numerical variable across categories of a categorical variable. Ideal for comparing central tendency and dispersal measurements between groups or categories.

ggplot(DartPoints, aes(x = Haft.Sh, y = H.Length)) +
   geom_boxplot() +
   scale_x_discrete(labels = c("Angular", "Excurvate", "Incurvate", "Recurvate", "Straight")) +
   labs(x = "Shape lateral haft element", y = "Haft element length (mm)")

Bar Plots with two variables (stacked): Compare counts or proportions of categorical variables.

ggplot(DartPoints, aes(x = Haft.Sh, fill = Should.Sh)) +
   geom_bar() +
   scale_x_discrete(labels = c("Angular", "Excurvate", "Incurvate", "Recurvate", "Straight")) +
   scale_fill_discrete(labels = c("Excurvate", "Incurvate", "Straight", "None")) +
   labs(x = "Shape lateral haft element", y = "Count", fill = "Shoulder shape")

Contingency Tables: Quick assessment of the distribution of counts among two categorical variables.

table(DartPoints$Haft.Sh, DartPoints$Should.Sh)

   
     E  I  S  X
  A  0  2  0  0
  E  0  3  6  0
  I  0 16  4  2
  R  0  0  1  0
  S  3 16 35  1

Mosaic Plots: A hybrid between contingency tables and stacked bar plots, mosaic plots are useful especially when counts in cross-category groups (i.e. cells in contingency table) are greater than 0. A mosaic plot represents the conditional relative frequency for a cell in the contingency table as the area of rectangular tiles. Adding a shaded color, it can also be used to visualise the deviation from the expected frequency (residual) from a Pearson Chi-square or Likelihood Ratio G² test.

mosaicplot(table(DartPoints$Haft.Sh, DartPoints$Should.Sh),
           xlab = "Shape lateral haft element",
           ylab = "Shoulder shape",
           main = "",
           shade = TRUE)

Correlation: Measure the direction and strength of linear relationships between numerical variables. In base R, the function cor() will return the Pearson correlation coefficient by default.

cor(DartPoints$H.Length, DartPoints$Weight)

[1] 0.486397

Simple linear Regression: Calculate the parameters (intercept, slope) for a linear model with the minimum distance towards data points in two numerical variables. Geometrically, such model is equivalent to a line in a two-dimensional plane.

With Base R:

model <- lm(Weight ~ H.Length, data = DartPoints,)
summary(model)


Call:
lm(formula = Weight ~ H.Length, data = DartPoints)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9356 -2.3685 -0.7377  2.1173 19.6316 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.80129    1.35913   0.590    0.557    
H.Length     0.51019    0.09715   5.252 1.02e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.696 on 89 degrees of freedom
Multiple R-squared:  0.2366,    Adjusted R-squared:  0.228 
F-statistic: 27.58 on 1 and 89 DF,  p-value: 1.018e-06

plot(DartPoints$H.Length, DartPoints$Weight,
     xlab = "Haft element length (mm)", 
     ylab = "Weight (gm)")
abline(model, col = "red", lwd = 5)
# or
abline(a = model$coefficients["(Intercept)"], 
       b = model$coefficients["H.Length"],
       col = "blue", lty = 3, lwd = 5)

With ggplot2, a linear model can be added directly to a plot with geom_smooth(method = "lm"):

ggplot(DartPoints, aes(x = H.Length, y = Weight)) +
   geom_point() +
   geom_smooth(method = "lm", color = "red")

`geom_smooth()` using formula = 'y ~ x'

The function geom_smooth() will add by default a shaded area around the line, representing the confidence interval (see argument se and level in ?geom_smooth()).

Visualizing multiple correlation pairs

Quick visualisation of a correlation matrix using cor()

cor(DartPoints[, c("Length", "Width", "Thickness")])

             Length     Width Thickness
Length    1.0000000 0.7689932 0.5890989
Width     0.7689932 1.0000000 0.5459291
Thickness 0.5890989 0.5459291 1.0000000

Build a larger correlation matrix (only numerical variables and excluding cases with missing values) and plot it using corrplot() from the corrplot package:

library(corrplot)

corrplot 0.95 loaded

selected_variables <- c("Length", "Width", "Thickness", "B.Width", "J.Width", "H.Length", "Weight")

corr_matrix <- cor(DartPoints[, selected_variables],
                   use = "complete.obs")

corrplot(corr_matrix, method = "circle")

Hypothesis Testing:

t-Test: Compare means of numerical variables across two categories. Conventionally, p-value < 0.05 means that the null hypothesis (there are no differences between the means of these variables) is very unlikely.

# consider only cases in blade shape categories "E" (excurvate) and "S" (straight)
DartPoints_IandS <- subset(DartPoints, Blade.Sh == "E" | Blade.Sh == "S")

# apply test for Weight between the blade shape two categories
t.test(Weight ~ Blade.Sh, data = DartPoints_IandS)


    Welch Two Sample t-test

data:  Weight by Blade.Sh
t = 1.6009, df = 72.799, p-value = 0.1137
alternative hypothesis: true difference in means between group E and group S is not equal to 0
95 percent confidence interval:
 -0.3652556  3.3471603
sample estimates:
mean in group E mean in group S 
       8.530952        7.040000

In this case, the evidence is insufficient for demonstrating that there is a consistent difference in weight between dart points with excurvate and straight blades.

Chi-Square Test: Test independence between categorical variables. Conventionally, p-value < 0.05 means that the null hypothesis (the variables are independent) is very unlikely.

chisq.test(table(DartPoints$Haft.Sh, DartPoints$Haft.Or))

Warning in chisq.test(table(DartPoints$Haft.Sh, DartPoints$Haft.Or)):
Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  table(DartPoints$Haft.Sh, DartPoints$Haft.Or)
X-squared = 101.85, df = 16, p-value = 1.556e-14

In this case, the evidence supports, with 95% confidence, that haft shape and orientation are not independent.

Quasi-multivariate approaches:

Visualise multiple subsets of a bivariate relationship by splitting plots by a categorical variable.

“Faceting” scatter plots with ggplot2:

ggplot(DartPoints, aes(x = Thickness, y = Weight)) +
   geom_point() +
   facet_wrap(~ Blade.Sh)

Visualise multiple pairwise bivariate relationships with pairs() (only numeric variables):

selected_variables <- c("Length", "Width", "Thickness", "B.Width", "J.Width", "H.Length", "Weight")

pairs(DartPoints[, selected_variables])

Example of further customisation:

reg <- function(x, y, ...) {
  points(x,y, ...)
  abline(lm(y~x)) 
 }

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
 usr <- par("usr"); on.exit(par(usr))
 par(usr = c(0, 1, 0, 1))
 r <- abs(cor(x, y, use = "complete.obs"))
 txt <- format(c(r, 0.123456789), digits = digits)[1]
 txt <- paste0(prefix, txt)
 text(0.5, 0.5, txt, cex = 1.1, font = 4)
}

pairs(DartPoints[, selected_variables], 
      upper.panel = reg, 
      lower.panel = panel.cor,
      cex = 1.5, pch = 19, col = adjustcolor(4, .4))

Logistic Regression: a statistical method used for binary classification problems. It estimates the probability of an observation being in one or another category (binary variable or categorical variable with two possible values) based on one or more independent (explanatory) variables. As the linear regression, the analysis involves calculating the parameters of an equation corresponding to a geometric object, in this case a sigmoid or logistic curve.

# consider only cases in blade shape categories "E" (excurvate) and "S" (straight)
DartPoints_IandS <- subset(DartPoints, Blade.Sh == "E" | Blade.Sh == "S")

model <- glm(Blade.Sh ~ Length + Width + J.Width, data = DartPoints_IandS, family = "binomial")
summary(model)


Call:
glm(formula = Blade.Sh ~ Length + Width + J.Width, family = "binomial", 
    data = DartPoints_IandS)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.71823    1.49160   1.822   0.0684 .
Length      -0.08122    0.03421  -2.374   0.0176 *
Width        0.23553    0.09825   2.397   0.0165 *
J.Width     -0.25868    0.13064  -1.980   0.0477 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 113.63  on 81  degrees of freedom
Residual deviance: 104.02  on 78  degrees of freedom
AIC: 112.02

Number of Fisher Scoring iterations: 4

A logistic regression model with significant coefficients (p-values < 0.05) can be considered good classifiers, and could help us predict or infer the binary classification from additional combinations of explanatory variables. However, a truly good predictor will normally require a higher number of cases when building up the model.

To visualize a logistic regression model in R, you can use ggplot2 to create a curve showing the predicted probabilities alongside the observed binary outcomes in relation to one explanatory variable (e.g., Length).

# Prepare data for visualization
# Generate a sequence of Length values and keep Width and J.Width fixed at their mean
new_data <- data.frame(
  Length = seq(min(DartPoints_IandS$Length), max(DartPoints_IandS$Length), length.out = 100),
  Width = mean(DartPoints_IandS$Width, na.rm = TRUE),
  J.Width = mean(DartPoints_IandS$J.Width, na.rm = TRUE)
)

# Add predicted probabilities
new_data$predicted_prob <- predict(model, newdata = new_data, type = "response")

# Plot the predicted probabilities
ggplot(new_data, aes(x = Length, y = predicted_prob)) +
   geom_line(color = "blue") +
   labs(
      title = "Predicted Probability of Blade Shape 'E' by Length",
      x = "Length",
      y = "Predicted Probability"
   ) +
   theme_minimal()

5.4 (EXTRA)Basic Machine Learning Concepts

Introduction to Machine Learning
- Overview of supervised and unsupervised learning.
- Example: Differentiating between regression and classification tasks.
K-Nearest Neighbors (KNN)
- Understanding KNN for classification.
- Example: Implementing KNN using class package.
Clustering
- Introduction to clustering techniques (e.g., k-means clustering).
- Example: Performing k-means clustering with kmeans and visualizing clusters.

5.5 (EXTRA)Model Evaluation

Train/Test Split
- Splitting data into training and testing sets.
- Example: Using caret package to split data and train models.
Model Performance Metrics
- Evaluating model performance: accuracy, confusion matrix, ROC curve.
- Example: Calculating and interpreting metrics using caret and pROC.

Hands-on Practice

Building a Simple Data Science Workflow: data import, cleaning, exploration, and basic regression and hypothesis testing.
1. Import a dataset of your choosing. Consider the ones included in the archdata package (install and load the package, then consult ?archdata).
2. Check if the given dataset has any missing values, normally appearing as NA in R. Notice that some databases might use specific conventions, such as coding missing values as an odd extreme number (e.g., -999) or a specific text (e.g., indeterminate).
3. Get a general overview of the data using one or more options shown above, depending on the dataset structure and variable types. For example, using pairs() with numeric variables or applying table() to categorical variables.
4. Consider visualising and testing the relationships between the most interesting variables. For example, does the distribution of the numerical variable A varies along the categories of the categorical variable B? Use a combination of plots, pair-wise simple linear regression models, and t-tests or Chi-Square tests.
Q&A and Troubleshooting
- Addressing challenges in implementing basic data science methods in R.

5.1 Introduction to Data Science Workflow

Quick example in R: hypothetical local dataset “house_prices.csv”

Quick example in R: canonical dataset USArrests

5.2 Data Science Workflow in R: Data Import and Preparation

5.2.1 Importing Data

Examples using base R, readxl and DBI+RSQLite

5.2.2 Data Cleaning

Example using Base R

Example using tidyverse

5.2.3 Data Transformation

Example using Base R

Example using tidyverse

5.3 Exploratory Data Analysis

5.3.1 Univariate Statistics

Numeric variables

Example: Using ggplot2 for histograms

Example: Plotting a Histogram and Marking the Mean

Example: Using ggplot2 for an univariate box plot

Non-numeric variables

5.3.2 Bivariate statistics

5.4 (EXTRA)Basic Machine Learning Concepts

5.5 (EXTRA)Model Evaluation

Quick example in R: canonical dataset `USArrests`

Examples using base R, `readxl` and `DBI`+`RSQLite`

Example using `tidyverse`

Example using `tidyverse`

Example: Using `ggplot2` for histograms

Example: Using `ggplot2` for an univariate box plot