# Messy and hard to follow
<- read.csv("data.csv")
data $cleaned <- na.omit(data$column)
data$scaled <- (data$cleaned - mean(data$cleaned)) / sd(data$cleaned)
datahist(data$scaled)
4 Best practices in programming
4.1 Code organisation
4.1.1 Modular programming
Importance of modularity
Breaking down your code into functions and modules enhances readability, maintainability, and reusability. This approach helps isolate specific tasks and allows for easier debugging and testing.
The Single-responsibility Principle (SRP) is a key concept in software design that ensures each function or module in your code has one, and only one, reason to change. For students learning R, this means breaking down your code into smaller, well-defined functions where each function does one specific task. This makes your code easier to understand, test, and maintain (“Single-Responsibility Principle - Wikipedia” n.d.).
For example, instead of writing one long script that loads data, cleans it, and plots it, you can write separate functions for each task: - load_data()
handles loading the data. - clean_data()
takes care of data cleaning. - plot_data()
is responsible for plotting.
By adhering to SRP, you reduce the chance of introducing bugs when modifying or extending your code. If the way data is loaded changes, you only need to adjust load_data()
without worrying about unintended side effects on clean_data()
or plot_data()
.
Example: Creating in-script custom functions
# Custom function to calculate mean
<- function(data) {
calculate_mean mean(data, na.rm = TRUE)
}
# Usage
<- c(1, 2, 3, NA, 5)
numbers calculate_mean(numbers)
Example: Wrapping messy code into clear functions
Messy Code Example:
Wrapped in functions clearly defined:
# Function to load data
<- function(filepath) {
load_data read.csv(filepath)
}
# Function to clean data
<- function(data, column) {
clean_data na.omit(data[[column]])
}
# Function to scale data
<- function(data) {
scale_data - mean(data)) / sd(data)
(data
}
# Function to plot histogram
<- function(data) {
plot_histogram hist(data)
}
# Main execution sequence
<- function() {
main <- load_data("data.csv")
raw_data <- clean_data(raw_data, "column")
cleaned_data <- scale_data(cleaned_data)
scaled_data plot_histogram(scaled_data)
}
# Run the main function
main()
By organizing the operations into clear, single-responsibility functions, the code becomes more readable, maintainable, and easier to debug.
Example: Creating and importing custom R scripts
Creating a script: Save the following function in a file named
utility_functions.R
.# utility_functions.R <- function(data) { calculate_sum sum(data, na.rm = TRUE) }
Importing the script:
source("utility_functions.R") # loads all functions defined inside the script <- c(1, 2, 3, NA, 5) numbers calculate_sum(numbers)
4.1.2 Code structuring
Structuring a data science project
A well-organized project structure separates code, data, and outputs, which facilitates efficient project management.
Example: Setting up a basic project structure in R and RStudio
Folder Structure:
my_project/ ├── data/ │ └── raw_data.csv ├── scripts/ │ ├── 01_load_data.R │ └── 02_analyse_data.R │ └── 03_visualise_data.R ├── outputs/ │ └── analysis_results.csv └── workflow.R └── my_project.Rproj
Script Example:
01_load_data.R
<- function(file_name, dir = "data", save_rds = FALSE) { load_data <- paste(dir, file_name, sep = "/") file_path # load raw data <- read.csv(paste(file_path, "csv", sep = ".")) raw_data if (save_rds) { # save a copy as a R dataset (.rds) saveRDS(raw_data, file = paste(file_path, "rds", sep = ".")) } }
02_analyse_data.R
<- function() { analyse_data <- summary(raw_data) summary_stats print(summary_stats) # Save the results to an output file write.csv(summary_stats, "outputs/analysis_results.csv") }
03_visualise_data.R
<- function(dataset, plot_name, dir = "outputs", width = 480, height = 480) { visualise_data <- paste(dir, plot_name, sep = "/") file_path <- paste(file_path, "png", sep = ".") file_name png(file_name, width = width, height = height) pairs(dataset) # matrix of scatterplots dev.off() }
workflow.R
source("01_load_data.R") source("02_analyse_data.R") source("03_visualise_data.R") <- load_data("raw_data") dt analyze_data(dt) visualise_data(dt, "Variables overview", width = 560, height = 560)
This structure ensures clarity, with each component of the project clearly demarcated, promoting better workflow and collaboration.
4.2 Writing clean and readable code
Code is effectively written for a computer to “understand” and execute (i.e., machine readable). However, humans revise, learn, and expand code (geek humans, but humans all the same!). Therefore, when writing code, you should remember a few practices that will make it easier to read and understand by humans (i.e., more human-readable).
4.2.1 Naming conventions
Bad practices:
Non-descriptive variable names
Using vague names makes it hard to understand the purpose of a variable.# Bad naming <- 22.5 a <- function(x) { b mean(x) }
Inconsistent naming styles
Mixing different styles (e.g., camelCase, snake_case, and inconsistent capitalization) leads to confusion.# Inconsistent naming <- function(X) { calcMean mean(X) }<- function(x) { calculate_mean mean(x) }
Best practices:
Using descriptive and consistent names
Choose clear and consistent names to convey purpose.# Good naming conventions <- 22.5 average_temperature <- function(values) { calculate_mean mean(values) }
For more guidance, refer to the tidyverse style guide (“Tidyverse Style Guide” n.d.).
4.2.2 Commenting and documentation
Bad practices:
Lack of comments
Skipping comments leads to difficulty in understanding the purpose or logic.# No comments <- read.csv("data.csv") data <- data[data$score > 10, ] result
Over-commenting obvious code
Commenting on trivial lines unnecessarily clutters the code.# Adding 1 to x <- x + 1 x
Best practices:
Meaningful comments and using
roxygen2
Document non-obvious logic and function purpose clearly.#' Filter data by score #' #' This function filters data for scores greater than 10. #' @param data A dataframe with a score column #' @return Filtered dataframe <- function(data) { filter_high_scores $score > 10, ] data[data }
For more about the
roxygen2
package, see its “vignette” (“Learn More” n.d.).
4.2.3 Avoiding magic numbers and hardcoding
Magic numbers refer to unique numeric values embedded directly in the code without context or explanation. These numbers often appear arbitrary and can make the code difficult to understand and maintain. They are problematic because:
1. Lack of clarity: Without meaningful names, the purpose of the number is unclear to others (or to you in the future).
2. Hard to update: If the same number appears in multiple places, updating it becomes error-prone and tedious.
3. Reduced readability: Magic numbers obscure the intent of the code, making it harder to follow and debug.
Bad practices:
Using magic numbers
Hardcoding values without explanation can confuse users about their significance.# Hardcoded magic number for (i in 1:7) { print(i) }
Best practices:
Defining constants
Clearly define constants for better clarity and easy updates.# Using constants <- 7 DAYS_IN_WEEK for (i in 1:DAYS_IN_WEEK) { print(i) }
4.2.4 Writing efficient and scalable code
Efficient and scalable R code is crucial when dealing with large datasets or computationally intensive tasks. Scalability refers to the code’s ability to handle increasing amounts of data or complexity without significant performance degradation.
Vectorization
Vectorization involves replacing explicit loops with vectorized operations, which are more efficient and concise. This approach leverages R’s optimized internal functions to operate on entire vectors or matrices in one go, reducing execution time.
Examples:
Base R:
# Loop approach <- numeric(1000) result for (i in 1:1000) { <- i^2 result[i] } # Vectorized approach <- (1:1000)^2 result
Using
dplyr
:library(dplyr) <- tibble(x = 1:1000) data # Vectorized operation with dplyr <- data %>% data mutate(square = x^2)
Memory Management
Efficient memory management is essential when working with large datasets to prevent memory exhaustion and improve performance. data.table
is a package specifically designed for handling large datasets efficiently by minimizing memory usage and speeding up data operations.
Example:
Using
data.table
:library(data.table) # Creating a large data.table <- data.table(x = rnorm(1e7), y = rnorm(1e7)) dt # Efficiently computing on large datasets := mean(x)] dt[, mean_x
data.table
optimizes memory usage by modifying data in place and avoids unnecessary copies, making it ideal for large-scale data processing.
4.3 Refactoring
Refactoring refers to the process of improving existing code without changing its external behaviour. The goal is to make the code cleaner, more efficient, and easier to maintain. Refactoring enhances readability, reduces complexity, and simplifies future modifications.
4.3.1 Why Refactor?
- Improves code readability: Well-structured code is easier to understand.
- Enhances maintainability: Simplified codebase reduces the time required for future updates or debugging.
- Promotes reusability: Modularized and clean code is more reusable in different contexts.
4.3.2 Examples of Refactoring in R
1. Removing Redundant Code
Before:
<- read.csv("data.csv")
data <- na.omit(data)
data_cleaned <- data_cleaned[data_cleaned$score > 10, ] data_cleaned
After:
<- read.csv("data.csv") |>
data na.omit() |>
subset(score > 10)
# or using dplyr
<- read.csv("data.csv") %>%
data na.omit() %>%
filter(score > 10)
Using dplyr
pipelines makes the code concise and easier to read.
2. Breaking Down Long Functions
Before:
<- function(data) {
calculate_metrics <- mean(data$score)
mean_value <- sd(data$score)
sd_value return(list(mean = mean_value, sd = sd_value))
}
After:
<- function(data) {
calculate_mean mean(data$score)
}
<- function(data) {
calculate_sd sd(data$score)
}
<- function(data) {
calculate_metrics list(mean = calculate_mean(data), sd = calculate_sd(data))
}
Breaking down long functions into smaller ones improves readability and reusability.
3. Replacing Hardcoded Values with Constants
Before:
if (length(data) > 1000) {
print("Large dataset")
}
After:
<- 1000
THRESHOLD if (length(data) > THRESHOLD) {
print("Large dataset")
}
Using constants improves clarity and ease of updating values.
4.4 EXTRA: Testing and validation
4.4.1 Writing Unit Tests
Unit tests are essential for ensuring code correctness by verifying that each function behaves as expected. They help catch errors early and make code more robust by facilitating easy modifications and refactoring.
Example: Writing basic unit tests in R (testthat
)
Install
testthat
package:install.packages("testthat")
Create a test script (
test_calculate_mean.R
):library(testthat) <- function(x) { calculate_mean if (!is.numeric(x)) stop("Input must be numeric") mean(x, na.rm = TRUE) } test_that("calculate_mean works correctly", { expect_equal(calculate_mean(c(1, 2, 3)), 2) expect_equal(calculate_mean(c(NA, 2, 3)), 2.5) expect_error(calculate_mean("string")) })
Run the tests:
test_file("test_calculate_mean.R")
4.4.2 Data Validation
Validating data inputs and outputs is crucial for maintaining data integrity, especially in data processing pipelines. It ensures that the data conforms to expected formats and values, avoiding downstream errors.
Example: Implementing data validation checks in data processing scripts
Given a dataset:
<- data.frame(
data age = c(25, -5, 30),
salary = c(50000, 0, 60000),
name = c("Alice", NA, "Bob")
)
Check if each variable comply with certain rules:
* age
is non-negative,
* salary
is positive,
* name
is not missing.
stopifnot(data$age >= 0, data$salary > 0, !is.na(data$name))
# or with custom error messages:
stopifnot(data$age >= 0, "Age values must be non-negative")
stopifnot(data$salary > 0, "Salary values must be positive")
stopifnot(!is.na(data$name), "Names must not be NA")
For handling larger rule sets and getting a more structured feedback, consider using validate
package:
library(validate)
<- validator(
rules >= 0,
age > 0,
salary !is.na(name)
)
# Validate the data
<- confront(data, rules)
check summary(check)
name items passes fails nNA error warning expression
1 V1 3 2 1 0 FALSE FALSE age - 0 >= -1e-08
2 V2 3 2 1 0 FALSE FALSE salary > 0
3 V3 3 2 1 0 FALSE FALSE !is.na(name)
4.5 EXTRA: Error handling and debugging
Error handling and debugging are essential for building robust R code that gracefully manages unexpected issues. Effective error handling prevents crashes and improves user feedback, while debugging tools help identify and resolve problems during code development.
4.5.1 Error Handling Techniques
Using tryCatch
in R
tryCatch
is a versatile function for handling errors, warnings, and messages in R. It allows you to specify actions for different types of errors, ensuring that code can handle unexpected conditions gracefully.
Writing Meaningful Error Messages
Meaningful error messages help both developers and users understand what went wrong and guide them toward solutions. Avoid generic messages like "Error occurred"
and instead specify the exact problem.
Example: Implementing error handling in a data processing script
<- function(data) {
process_data tryCatch({
# check
if (!"score" %in% names(data)) stop("Missing 'score' column in data")
# filter
<- data[data$score > 10, ] # Process data
data return(data)
},error = function(e) {
cat("Error in processing data:", e$message, "\n")
return(NULL)
})
}
# Usage
<- process_data(iris) result
Error in processing data: Missing 'score' column in data
In this example, tryCatch
captures errors from read.csv()
or the missing column and provides a custom message.
4.5.2 Debugging tools
Introduction to Debugging Tools: browser
in R
The browser
function in R allows you to pause code execution at a specific point, enabling step-by-step inspection of variables and function behaviour. This is especially useful for identifying unexpected values or conditions.
Example: Walkthrough of a debugging session in R
<- function(data) {
calculate_mean browser() # Debugging breakpoint
mean(data$score, na.rm = TRUE)
}
# Sample data
<- data.frame(score = c(10, NA, 20, 15))
data calculate_mean(data)
When browser()
is called, R will pause execution and open an interactive environment. Here, you can inspect data$score
, check conditions, and proceed line by line, allowing you to diagnose issues effectively.
4.6 EXTRA: Creating and releasing packages
Reusable and shareable code helps in building efficient workflows, saving time, and reducing redundancy across projects. Reusable code can be encapsulated in functions, libraries, or R packages to simplify future use and sharing.
4.6.1 Creating packages
Writing functions and libraries that can be reused in multiple projects helps standardize tasks and reduces the risk of introducing errors by duplicating code. R packages are an ideal way to organize reusable functions, data, and documentation, making it easy to use them across different projects.
Example: Creating a Simple R Package 1. Set up the package
Use the usethis
package to initiate a new package:
::: {.cell}
install.packages("usethis")
::create_package("path/to/myPackage") usethis
:::
Add a function
Create a function file in theR/
folder, for exampleR/hello.R
:<- function(name = "world") { hello paste("Hello,", name) }
Document the function
Useroxygen2
to add documentation:#' Say Hello #' #' @param name A character string. Default is "world". #' @return A greeting message. #' @export <- function(name = "world") { hello paste("Hello,", name) }
Build and test
Load and test your package:::document() devtools::load_all() devtoolshello("R user")
4.6.2 Releasing/sharing packages
Once your code is reusable as package, share it to help others benefit from your work. Publishing packages on CRAN or GitHub makes them accessible to other users and developers, while sharing R Markdown notebooks helps others reproduce and understand your analyses.
GitHub: Initialize a Git repository in the package directory or a new project in RStudio as a R package, set up a GitHub connection, and add the package files. Push changes to GitHub and release with a new tag.
CRAN: Follow CRAN submission guidelines.
Sharing packages or code through these platforms allows the broader R community to contribute, use, and improve your work.