data("USArrests")
5 Data Science Workflow
5.1 Introduction to Data Science Workflow
In a typical data science workflow, we move through five primary stages: data collection, data cleaning, exploration, modeling, and communication. These steps collectively transform raw data into actionable insights driven by research questions while also enabling efficient and reproducible data science practices.
Data Collection: This is the first step, where data is gathered from various sources, such as databases, APIs, or web scraping. In R, packages like
readr
andhttr
facilitate efficient data import from structured files (e.g., CSV, Excel) and web sources.Data Cleaning: Data cleaning involves preparing raw data for analysis, handling missing values, correcting data types, and dealing with outliers. Tools like
dplyr
andtidyr
are often used in R to perform these operations, enabling tasks like removing duplicates, imputing missing data, and restructuring data into a “tidy” format suitable for analysis.Data Exploration: Exploratory Data Analysis (EDA) is the phase where we examine the data’s characteristics, uncover patterns, and form hypotheses. Visualizations (using
pairs()
or a wide range of plot types available in Base R andggplot2
) and summary statistics (summary()
,skimr
) help in understanding the data distribution, relationships between variables, and identifying any anomalies.Modeling: At this stage, statistical or machine learning models are developed to predict or explain outcomes. In R, packages like
caret
andtidymodels
streamline the modelling process, from splitting data and selecting models to tuning hyperparameters and evaluating performance.Communication: The final stage focuses on presenting findings clearly, often through reports, dashboards, or interactive applications. Using R Markdown for reports or Shiny for interactive applications enables data scientists to effectively communicate insights to stakeholders.
Quick example in R: hypothetical local dataset “house_prices.csv”
Let’s consider a simple example of a data science project where we predict house prices based on available features:
- Data Collection: Load a dataset such as
house_prices.csv
usingread_csv()
. - Data Cleaning: Use
dplyr
to handle missing values (e.g.,mutate(across(where(is.na), median)
to replace with median values) and convert categorical variables to factors. - Data Exploration: Visualize relationships between features like square footage and price using
ggplot(data = houses) + geom_point(aes(x = sqft, y = price))
. - Modeling: Create a linear model with
lm(price ~ sqft + num_bedrooms, data = houses)
. - Communication: Report results in an R Markdown document, showing model coefficients, predictions, and visualizations to explain findings.
Quick example in R: canonical dataset USArrests
Let’s consider a simple example of a data science project. Suppose we aim to determine main factors in criminality using the canonical dataset USArrests
.
- Data Collection: Load the dataset directly from R’s built-in datasets.
- Data Cleaning: Check for any missing values or potential outliers that might represent data entry errors.
# Adding a missing value at row 1, column 1
1, 1] <- NA
USArrests[
# Check for missing values
any(is.na(USArrests))
[1] TRUE
# remove missing value
<- na.omit(USArrests) USArrests_clean
- Data Exploration: Visualize and summarize data to understand patterns. For instance, we might be interested in the value distribution of murder arrests.
# Quick summary of the data distribution
summary(USArrests_clean)
Murder Assault UrbanPop Rape
Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
1st Qu.: 4.000 1st Qu.:109.0 1st Qu.:54.00 1st Qu.:14.90
Median : 7.200 Median :159.0 Median :66.00 Median :20.00
Mean : 7.678 Mean :169.4 Mean :65.69 Mean :21.23
3rd Qu.:11.100 3rd Qu.:249.0 3rd Qu.:78.00 3rd Qu.:26.20
Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
# Visualize the distribution of artefact "types" (represented by variables like 'Murder' here)
layout(matrix(1:2, nrow = 2), heights = c(2, 1.5))
par(mar = c(0, 4, 4, 2) + 0.1)
hist(USArrests_clean$Murder, main="Distribution of counts of murder arrests in 'USArrests'", xaxt = "n")
par(mar = c(2, 4, 0, 2) + 0.1)
boxplot(USArrests_clean$Murder, horizontal = TRUE)
This combined histogram and box plot would give a quick view of the central tendencies and spread of murder arrests across the dataset. For example, we can already observe that the mean and median of the distribution (7.677551 and 7.2, respectively) are on the lower half of the range (0.8, 17.4), which also imply a longer right tail of the distribution (i.e., the difference between maximum and mean is greater than the one between mean and minimum).
- Modelling: Create a simple model to analyse relationships between variables. For instance, let’s assume we’re exploring how
UrbanPop
(population percentage in urban areas) might relate toMurder
.
<- lm(Murder ~ UrbanPop, data = USArrests_clean)
model summary(model)
Call:
lm(formula = Murder ~ UrbanPop, data = USArrests_clean)
Residuals:
Min 1Q Median 3Q Max
-6.3323 -3.7628 -0.7344 2.9121 9.8656
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.02650 2.90210 2.077 0.0433 *
UrbanPop 0.02513 0.04315 0.582 0.5630
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.359 on 47 degrees of freedom
Multiple R-squared: 0.007167, Adjusted R-squared: -0.01396
F-statistic: 0.3393 on 1 and 47 DF, p-value: 0.563
The results of a simple linear regression analysis provides a basic understanding of how one variable might be related on another, interpreted also as dependent, presumably causal relationship. In this case, there is no statistically significant evidence that urban population percentage is a relevant factor in murder arrests in our dataset.
- Communication: Present findings in an R Markdown document, incorporating both visualizations and model summaries. You can interpret results to suggest whether murder arrests are more common in highly populated regions.
5.2 Data Science Workflow in R: Data Import and Preparation
5.2.1 Importing Data
A critical step in any data science workflow is importing data from external sources. R provides robust tools for importing data from a variety of formats:
- CSV files: Use the
read.csv()
function from base R.
- Excel files: Leverage the
readxl
package withread_excel()
.
- Databases: Employ the
DBI
package to connect to relational databases and fetch data using SQL queries.
Examples using base R, readxl
and DBI
+RSQLite
# Load necessary libraries
library(readxl)
library(DBI)
library(RSQLite)
# Read a CSV file
<- read.csv("data.csv")
csv_data
# Read an Excel file
<- readxl::read_excel("data.xlsx")
excel_data
# Connect to a SQLite database and fetch data
<- DBI::dbConnect(RSQLite::SQLite(), "data.db")
con <- DBI::dbGetQuery(con, "SELECT * FROM table_name")
db_data ::dbDisconnect(con) DBI
5.2.2 Data Cleaning
Before analysis, raw data often requires cleaning to address issues like missing values, duplicates, and inconsistencies.
Example using Base R
- Removing Missing Values
Usena.omit()
to remove rows with missing values oris.na()
to identify them.
# Example data
<- data.frame(A = c(1, 2, NA, 4), B = c("x", NA, "y", "z"))
data print(data)
A B
1 1 x
2 2 <NA>
3 NA y
4 4 z
# Remove rows with NA
<- na.omit(data)
clean_data print(clean_data)
A B
1 1 x
4 4 z
- Handling Duplicates
Useduplicated()
to identify duplicate rows orunique()
to retain only unique rows.
<- data.frame(A = c(1, 2, 2, 4), B = c("x", "y", "y", "z"))
data print(data)
A B
1 1 x
2 2 y
3 2 y
4 4 z
# Remove duplicate rows
<- data[!duplicated(data), ]
data_unique print(data_unique)
A B
1 1 x
2 2 y
4 4 z
- Replacing Values
Replace specific values withifelse()
or direct indexing.
<- data.frame(A = c(1, 2, 999, 4), B = c("x", "y", "z", "999"))
data print(data)
A B
1 1 x
2 2 y
3 999 z
4 4 999
# Replace 999 with NA
== 999] <- NA
data[data print(data)
A B
1 1 x
2 2 y
3 NA z
4 4 <NA>
Example using tidyverse
Key functions include:
- tidyr
: Tools like fill()
(fill missing values) and drop_na()
(remove rows with NAs).
- dplyr
: Functions like distinct()
to remove duplicates and mutate()
to fix inconsistencies.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
# Example dataset
<- data.frame(id = c(1, 2, 2, 3, NA), value = c(NA, "A", "A", "B", "C"))
data print(data)
id value
1 1 <NA>
2 2 A
3 2 A
4 3 B
5 NA C
# Clean the data
<- data %>%
cleaned_data drop_na(id) %>% # Remove rows with missing IDs
distinct() %>% # Remove duplicates
fill(value, .direction = "down") # Fill missing values downward
print(cleaned_data)
id value
1 1 <NA>
2 2 A
3 3 B
5.2.3 Data Transformation
Transforming data is essential for reshaping and preparing it for analysis.
Example using Base R
Base R provides versatile and efficient tools for cleaning and transforming data.
A few example operations common in data science workflows are:
- Filtering Rows
Subset data using logical conditions.
<- data.frame(A = 1:5, B = letters[1:5])
data print(data)
A B
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
# Filter rows where A > 3
<- data[data$A > 3, ]
filtered_data print(filtered_data)
A B
4 4 d
5 5 e
- Selecting Columns
Use indexing to select specific columns.
# Select column A
<- data[, "A", drop = FALSE]
selected_columns print(selected_columns)
A
1 1
2 2
3 3
4 4
5 5
NOTE: the argument drop = FALSE
ensures that the original data frame structure is not lost in the process (run data[, "A"]
to compare).
- Adding or Modifying Columns
Use the$
operator or indexing to create or modify columns.
$new_col <- data$A * 2
dataprint(data)
A B new_col
1 1 a 2
2 2 b 4
3 3 c 6
4 4 d 8
5 5 e 10
- Reshaping Data
Usereshape()
to pivot data between wide and long formats. The wide format is where each variable corresponds to a column (table format) while the long format assign variable-value pairs to different rows, retaining one or more variables in the wide format.
# Example wide format
<- data.frame(id = 1:2, Q1 = c(10, 20), Q2 = c(30, 40))
data
# Convert to long format
<- reshape(data, direction = "long", varying = list(c("Q1", "Q2")), v.names = "value", timevar = "quarter")
long_data print(long_data)
id quarter value
1.1 1 1 10
2.1 2 1 20
1.2 1 2 30
2.2 2 2 40
Example using tidyverse
Use dplyr
for:
* Filtering and selecting rows/columns: filter()
, select()
.
* Creating new variables: mutate()
.
* Reshaping: pivot_longer()
and pivot_wider()
.
library(dplyr)
library(tidyr)
# Example dataset
<- data.frame(
data id = 1:3,
Q1 = c(10, 20, 30),
Q2 = c(15, 25, 35)
)print(data)
id Q1 Q2
1 1 10 15
2 2 20 25
3 3 30 35
# Perform all operations sequentially in a single step
<- data %>%
transformed_data pivot_longer(cols = starts_with("Q"), names_to = "quarter", values_to = "value") %>%
filter(value > 15) %>% # Filter rows where value > 15
mutate(value_scaled = value / max(value)) # Add a new scaled column
print(transformed_data)
# A tibble: 4 × 4
id quarter value value_scaled
<int> <chr> <dbl> <dbl>
1 2 Q1 20 0.571
2 2 Q2 25 0.714
3 3 Q1 30 0.857
4 3 Q2 35 1
By mastering these data preparation steps, you ensure a clean and well-structured dataset, setting the stage for effective analysis and visualization.
5.3 Exploratory Data Analysis
5.3.1 Univariate Statistics
Numeric variables
This section equips you to explore univariate distributions of numeric variables, uncovering insights from centrality to variability with both statistical and visual techniques.
- Histograms: Exploring a single variable involves visualizing its distribution to identify patterns such as central tendency, spread, and outliers. Histograms are one of the most effective tools for this.
Example: Using ggplot2
for histograms
library(ggplot2)
# Example dataset - variable with normal distribution
<- data.frame(value = rnorm(1000, mean = 50, sd = 10))
data
# Create a histogram
ggplot(data, aes(x = value)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Values", x = "Value", y = "Frequency")
- Range: The range of a variable provides a simple measure of the spread of the data. It is calculated as the difference between the maximum and minimum values. Base R already has a function named
range()
to handle this calculation:
cat("Min:", min(data$value), "Max:", max(data$value))
Min: 16.85981 Max: 78.74143
<- range(data$value)
range_val
cat("Range (Max - Min):", range_val)
Range (Max - Min): 16.85981 78.74143
- Central tendency measures: the mean, median, and mode describe the centre of the distribution.
- Dispersion measures: variance and standard deviation describe the spread of the data.
<- mean(data$value)
mean_val <- median(data$value)
median_val <- var(data$value)
variance <- sd(data$value)
std_dev
# Print results
cat("Mean:", mean_val, "Median:", median_val, "Variance:", variance, "SD:", std_dev)
Mean: 49.93906 Median: 49.37172 Variance: 105.7793 SD: 10.28491
Example: Plotting a Histogram and Marking the Mean
Overlay the mean on a histogram to visualize its position relative to the distribution.
ggplot(data, aes(x = value)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
geom_vline(aes(xintercept = mean_val), color = "red", linetype = "dashed", linewidth = 1) +
labs(title = "Histogram with Mean Marked", x = "Value", y = "Frequency")
- Box plots: A box plot of a single variable can be useful to visualise central tendency and dispersion measures at the same time. It offers the same information of an histogram, though in a more analytical form, by default plotting the median (strong line), the 1st and 3rd or interquartile range (box), the (pseudo)range or 1.5-interquartile range (lines), and outliers (points outside the 1.5x interquartile range).
Example: Using ggplot2
for an univariate box plot
library(ggplot2)
# Example dataset - variable with normal distribution
<- data.frame(value = rnorm(1000, mean = 50, sd = 20))
data
# Create a histogram
ggplot(data, aes(x = value)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Distribution of Values", x = "Value", y = "Frequency")
Non-numeric variables
These approaches allow for comprehensive exploration of univariate statistics for categorical variables, emphasizing both numeric summaries and visual insights.
Univariate statistics for non-numeric (categorical) variables focus on summarizing and visualizing the distribution of categories. Here’s a breakdown with examples:
- Frequency Tables: A frequency table lists the counts of each category, helping to understand the distribution.
# Example Dataset
<- data.frame(category = sample(c("A", "B", "C"), size = 100, replace = TRUE))
data
# Frequency Table
table(data$category)
A B C
31 36 33
- Bar Plots: Bar plots visually represent the frequency distribution of categories.
library(ggplot2)
# Bar Plot
ggplot(data, aes(x = category)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Category Distribution", x = "Category", y = "Count")
- Proportion Visualization: Proportions provide relative frequencies, useful for comparing categorical data.
# Proportion Table
<- prop.table(table(data$category))
prop_table
# Pie Chart
ggplot(data, aes(x = "", fill = category)) +
geom_bar(width = 1) +
coord_polar("y") +
labs(title = "Category Proportions")
- Mode: The mode is the most frequently occurring category.
# Mode Calculation
<- names(which.max(table(data$category)))
mode_category cat("Mode:", mode_category)
Mode: B
5.3.2 Bivariate statistics
Loading DartPoints
dataset from archdata
:
library(archdata)
data(DartPoints)
- Scatter Plots: Visualize relationships between two numerical variables.
ggplot(DartPoints, aes(x = H.Length, y = Weight)) +
geom_point() +
labs(x = "Haft element length (mm)", y = "Weight (gm)")
- Box Plots: Compare a numerical variable across categories of a categorical variable. Ideal for comparing central tendency and dispersal measurements between groups or categories.
ggplot(DartPoints, aes(x = Haft.Sh, y = H.Length)) +
geom_boxplot() +
scale_x_discrete(labels = c("Angular", "Excurvate", "Incurvate", "Recurvate", "Straight")) +
labs(x = "Shape lateral haft element", y = "Haft element length (mm)")
- Bar Plots with two variables (stacked): Compare counts or proportions of categorical variables.
ggplot(DartPoints, aes(x = Haft.Sh, fill = Should.Sh)) +
geom_bar() +
scale_x_discrete(labels = c("Angular", "Excurvate", "Incurvate", "Recurvate", "Straight")) +
scale_fill_discrete(labels = c("Excurvate", "Incurvate", "Straight", "None")) +
labs(x = "Shape lateral haft element", y = "Count", fill = "Shoulder shape")
- Contingency Tables: Quick assessment of the distribution of counts among two categorical variables.
table(DartPoints$Haft.Sh, DartPoints$Should.Sh)
E I S X
A 0 2 0 0
E 0 3 6 0
I 0 16 4 2
R 0 0 1 0
S 3 16 35 1
- Mosaic Plots: A hybrid between contingency tables and stacked bar plots, mosaic plots are useful especially when counts in cross-category groups (i.e. cells in contingency table) are greater than 0. A mosaic plot represents the conditional relative frequency for a cell in the contingency table as the area of rectangular tiles. Adding a shaded color, it can also be used to visualise the deviation from the expected frequency (residual) from a Pearson Chi-square or Likelihood Ratio G2 test.
mosaicplot(table(DartPoints$Haft.Sh, DartPoints$Should.Sh),
xlab = "Shape lateral haft element",
ylab = "Shoulder shape",
main = "",
shade = TRUE)
- Correlation: Measure the direction and strength of linear relationships between numerical variables. In base R, the function
cor()
will return the Pearson correlation coefficient by default.
cor(DartPoints$H.Length, DartPoints$Weight)
[1] 0.486397
- Simple linear Regression: Calculate the parameters (intercept, slope) for a linear model with the minimum distance towards data points in two numerical variables. Geometrically, such model is equivalent to a line in a two-dimensional plane.
With Base R:
<- lm(Weight ~ H.Length, data = DartPoints,)
model summary(model)
Call:
lm(formula = Weight ~ H.Length, data = DartPoints)
Residuals:
Min 1Q Median 3Q Max
-7.9356 -2.3685 -0.7377 2.1173 19.6316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.80129 1.35913 0.590 0.557
H.Length 0.51019 0.09715 5.252 1.02e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.696 on 89 degrees of freedom
Multiple R-squared: 0.2366, Adjusted R-squared: 0.228
F-statistic: 27.58 on 1 and 89 DF, p-value: 1.018e-06
plot(DartPoints$H.Length, DartPoints$Weight,
xlab = "Haft element length (mm)",
ylab = "Weight (gm)")
abline(model, col = "red", lwd = 5)
# or
abline(a = model$coefficients["(Intercept)"],
b = model$coefficients["H.Length"],
col = "blue", lty = 3, lwd = 5)
With ggplot2
, a linear model can be added directly to a plot with geom_smooth(method = "lm")
:
ggplot(DartPoints, aes(x = H.Length, y = Weight)) +
geom_point() +
geom_smooth(method = "lm", color = "red")
`geom_smooth()` using formula = 'y ~ x'
The function geom_smooth()
will add by default a shaded area around the line, representing the confidence interval (see argument se
and level
in ?geom_smooth()
).
- Visualizing multiple correlation pairs
Quick visualisation of a correlation matrix using cor()
cor(DartPoints[, c("Length", "Width", "Thickness")])
Length Width Thickness
Length 1.0000000 0.7689932 0.5890989
Width 0.7689932 1.0000000 0.5459291
Thickness 0.5890989 0.5459291 1.0000000
Build a larger correlation matrix (only numerical variables and excluding cases with missing values) and plot it using corrplot()
from the corrplot
package:
library(corrplot)
corrplot 0.95 loaded
<- c("Length", "Width", "Thickness", "B.Width", "J.Width", "H.Length", "Weight")
selected_variables
<- cor(DartPoints[, selected_variables],
corr_matrix use = "complete.obs")
corrplot(corr_matrix, method = "circle")
- Hypothesis Testing:
t-Test: Compare means of numerical variables across two categories. Conventionally, p-value < 0.05 means that the null hypothesis (there are no differences between the means of these variables) is very unlikely.
# consider only cases in blade shape categories "E" (excurvate) and "S" (straight)
<- subset(DartPoints, Blade.Sh == "E" | Blade.Sh == "S")
DartPoints_IandS
# apply test for Weight between the blade shape two categories
t.test(Weight ~ Blade.Sh, data = DartPoints_IandS)
Welch Two Sample t-test
data: Weight by Blade.Sh
t = 1.6009, df = 72.799, p-value = 0.1137
alternative hypothesis: true difference in means between group E and group S is not equal to 0
95 percent confidence interval:
-0.3652556 3.3471603
sample estimates:
mean in group E mean in group S
8.530952 7.040000
In this case, the evidence is insufficient for demonstrating that there is a consistent difference in weight between dart points with excurvate and straight blades.
Chi-Square Test: Test independence between categorical variables. Conventionally, p-value < 0.05 means that the null hypothesis (the variables are independent) is very unlikely.
chisq.test(table(DartPoints$Haft.Sh, DartPoints$Haft.Or))
Warning in chisq.test(table(DartPoints$Haft.Sh, DartPoints$Haft.Or)):
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: table(DartPoints$Haft.Sh, DartPoints$Haft.Or)
X-squared = 101.85, df = 16, p-value = 1.556e-14
In this case, the evidence supports, with 95% confidence, that haft shape and orientation are not independent.
- Quasi-multivariate approaches:
Visualise multiple subsets of a bivariate relationship by splitting plots by a categorical variable.
“Faceting” scatter plots with ggplot2
:
ggplot(DartPoints, aes(x = Thickness, y = Weight)) +
geom_point() +
facet_wrap(~ Blade.Sh)
Visualise multiple pairwise bivariate relationships with pairs()
(only numeric variables):
<- c("Length", "Width", "Thickness", "B.Width", "J.Width", "H.Length", "Weight")
selected_variables
pairs(DartPoints[, selected_variables])
Example of further customisation:
<- function(x, y, ...) {
reg points(x,y, ...)
abline(lm(y~x))
}
<- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
panel.cor <- par("usr"); on.exit(par(usr))
usr par(usr = c(0, 1, 0, 1))
<- abs(cor(x, y, use = "complete.obs"))
r <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
txt text(0.5, 0.5, txt, cex = 1.1, font = 4)
}
pairs(DartPoints[, selected_variables],
upper.panel = reg,
lower.panel = panel.cor,
cex = 1.5, pch = 19, col = adjustcolor(4, .4))
- Logistic Regression: a statistical method used for binary classification problems. It estimates the probability of an observation being in one or another category (binary variable or categorical variable with two possible values) based on one or more independent (explanatory) variables. As the linear regression, the analysis involves calculating the parameters of an equation corresponding to a geometric object, in this case a sigmoid or logistic curve.
# consider only cases in blade shape categories "E" (excurvate) and "S" (straight)
<- subset(DartPoints, Blade.Sh == "E" | Blade.Sh == "S")
DartPoints_IandS
<- glm(Blade.Sh ~ Length + Width + J.Width, data = DartPoints_IandS, family = "binomial")
model summary(model)
Call:
glm(formula = Blade.Sh ~ Length + Width + J.Width, family = "binomial",
data = DartPoints_IandS)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.71823 1.49160 1.822 0.0684 .
Length -0.08122 0.03421 -2.374 0.0176 *
Width 0.23553 0.09825 2.397 0.0165 *
J.Width -0.25868 0.13064 -1.980 0.0477 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 113.63 on 81 degrees of freedom
Residual deviance: 104.02 on 78 degrees of freedom
AIC: 112.02
Number of Fisher Scoring iterations: 4
A logistic regression model with significant coefficients (p-values < 0.05) can be considered good classifiers, and could help us predict or infer the binary classification from additional combinations of explanatory variables. However, a truly good predictor will normally require a higher number of cases when building up the model.
To visualize a logistic regression model in R, you can use ggplot2
to create a curve showing the predicted probabilities alongside the observed binary outcomes in relation to one explanatory variable (e.g., Length
).
# Prepare data for visualization
# Generate a sequence of Length values and keep Width and J.Width fixed at their mean
<- data.frame(
new_data Length = seq(min(DartPoints_IandS$Length), max(DartPoints_IandS$Length), length.out = 100),
Width = mean(DartPoints_IandS$Width, na.rm = TRUE),
J.Width = mean(DartPoints_IandS$J.Width, na.rm = TRUE)
)
# Add predicted probabilities
$predicted_prob <- predict(model, newdata = new_data, type = "response")
new_data
# Plot the predicted probabilities
ggplot(new_data, aes(x = Length, y = predicted_prob)) +
geom_line(color = "blue") +
labs(
title = "Predicted Probability of Blade Shape 'E' by Length",
x = "Length",
y = "Predicted Probability"
+
) theme_minimal()
5.4 (EXTRA)Basic Machine Learning Concepts
- Introduction to Machine Learning
- Overview of supervised and unsupervised learning.
- Example: Differentiating between regression and classification tasks.
- Overview of supervised and unsupervised learning.
- K-Nearest Neighbors (KNN)
- Understanding KNN for classification.
- Example: Implementing KNN using
class
package.
- Understanding KNN for classification.
- Clustering
- Introduction to clustering techniques (e.g., k-means clustering).
- Example: Performing k-means clustering with
kmeans
and visualizing clusters.
- Introduction to clustering techniques (e.g., k-means clustering).
5.5 (EXTRA)Model Evaluation
- Train/Test Split
- Splitting data into training and testing sets.
- Example: Using
caret
package to split data and train models.
- Splitting data into training and testing sets.
- Model Performance Metrics
- Evaluating model performance: accuracy, confusion matrix, ROC curve.
- Example: Calculating and interpreting metrics using
caret
andpROC
.
- Evaluating model performance: accuracy, confusion matrix, ROC curve.