4  Best practices in programming

4.1 Code organisation

4.1.1 Modular programming

Importance of modularity
Breaking down your code into functions and modules enhances readability, maintainability, and reusability. This approach helps isolate specific tasks and allows for easier debugging and testing.

Single-responsability Principle

The Single-responsibility Principle (SRP) is a key concept in software design that ensures each function or module in your code has one, and only one, reason to change. For students learning R, this means breaking down your code into smaller, well-defined functions where each function does one specific task. This makes your code easier to understand, test, and maintain (“Single-Responsibility Principle - Wikipedia n.d.).

For example, instead of writing one long script that loads data, cleans it, and plots it, you can write separate functions for each task: - load_data() handles loading the data. - clean_data() takes care of data cleaning. - plot_data() is responsible for plotting.

By adhering to SRP, you reduce the chance of introducing bugs when modifying or extending your code. If the way data is loaded changes, you only need to adjust load_data() without worrying about unintended side effects on clean_data() or plot_data().

Example: Creating in-script custom functions

# Custom function to calculate mean
calculate_mean <- function(data) {
  mean(data, na.rm = TRUE)
}

# Usage
numbers <- c(1, 2, 3, NA, 5)
calculate_mean(numbers)

Example: Wrapping messy code into clear functions

Messy Code Example:

# Messy and hard to follow
data <- read.csv("data.csv")
data$cleaned <- na.omit(data$column)
data$scaled <- (data$cleaned - mean(data$cleaned)) / sd(data$cleaned)
hist(data$scaled)

Wrapped in functions clearly defined:

# Function to load data
load_data <- function(filepath) {
  read.csv(filepath)
}

# Function to clean data
clean_data <- function(data, column) {
  na.omit(data[[column]])
}

# Function to scale data
scale_data <- function(data) {
  (data - mean(data)) / sd(data)
}

# Function to plot histogram
plot_histogram <- function(data) {
  hist(data)
}

# Main execution sequence
main <- function() {
  raw_data <- load_data("data.csv")
  cleaned_data <- clean_data(raw_data, "column")
  scaled_data <- scale_data(cleaned_data)
  plot_histogram(scaled_data)
}

# Run the main function
main()

By organizing the operations into clear, single-responsibility functions, the code becomes more readable, maintainable, and easier to debug.

Example: Creating and importing custom R scripts

  1. Creating a script: Save the following function in a file named utility_functions.R.

    # utility_functions.R
    calculate_sum <- function(data) {
      sum(data, na.rm = TRUE)
    }
  2. Importing the script:

    source("utility_functions.R") # loads all functions defined inside the script
    numbers <- c(1, 2, 3, NA, 5)
    calculate_sum(numbers)

4.1.2 Code structuring

Structuring a data science project
A well-organized project structure separates code, data, and outputs, which facilitates efficient project management.

Example: Setting up a basic project structure in R and RStudio

  1. Folder Structure:

    my_project/  
    ├── data/  
    │   └── raw_data.csv  
    ├── scripts/  
    │   ├── 01_load_data.R  
    │   └── 02_analyse_data.R  
    │   └── 03_visualise_data.R  
    ├── outputs/  
    │   └── analysis_results.csv  
    └── workflow.R  
    └── my_project.Rproj  
  2. Script Example:

    • 01_load_data.R

      load_data <- function(file_name, dir = "data", save_rds = FALSE) {
        file_path <- paste(dir, file_name, sep = "/")
        # load raw data
        raw_data <- read.csv(paste(file_path, "csv", sep = "."))
        if (save_rds) {
          # save a copy as a R dataset (.rds)
          saveRDS(raw_data, file = paste(file_path, "rds", sep = "."))
        }
      }
    • 02_analyse_data.R

      analyse_data <- function() {
        summary_stats <- summary(raw_data)
        print(summary_stats)
        # Save the results to an output file
        write.csv(summary_stats, "outputs/analysis_results.csv")
      }
    • 03_visualise_data.R

      visualise_data <- function(dataset, plot_name, dir = "outputs", width = 480, height = 480) {
        file_path <- paste(dir, plot_name, sep = "/")
        file_name <- paste(file_path, "png", sep = ".")
        png(file_name, width = width, height = height)
        pairs(dataset) # matrix of scatterplots
        dev.off()
      }
    • workflow.R

      source("01_load_data.R")
      source("02_analyse_data.R")
      source("03_visualise_data.R")
      
      dt <- load_data("raw_data")
      
      analyze_data(dt)
      
      visualise_data(dt, "Variables overview", width = 560, height = 560)

This structure ensures clarity, with each component of the project clearly demarcated, promoting better workflow and collaboration.

  • Grolemund (n.d.)
  • Gillespie and Lovelace (n.d.)
  • “Welcome Advanced R (n.d.)

4.2 Writing clean and readable code

Code is effectively written for a computer to “understand” and execute (i.e., machine readable). However, humans revise, learn, and expand code (geek humans, but humans all the same!). Therefore, when writing code, you should remember a few practices that will make it easier to read and understand by humans (i.e., more human-readable).

4.2.1 Naming conventions

Bad practices:

  1. Non-descriptive variable names
    Using vague names makes it hard to understand the purpose of a variable.

    # Bad naming
    a <- 22.5
    b <- function(x) {
      mean(x)
    }
  2. Inconsistent naming styles
    Mixing different styles (e.g., camelCase, snake_case, and inconsistent capitalization) leads to confusion.

    # Inconsistent naming
    calcMean <- function(X) {
      mean(X)
    }
    calculate_mean <- function(x) {
      mean(x)
    }

Best practices:

  1. Using descriptive and consistent names
    Choose clear and consistent names to convey purpose.

    # Good naming conventions
    average_temperature <- 22.5
    calculate_mean <- function(values) {
      mean(values)
    }

    For more guidance, refer to the tidyverse style guide (“Tidyverse Style Guide” n.d.).

4.2.2 Commenting and documentation

Bad practices:

  1. Lack of comments
    Skipping comments leads to difficulty in understanding the purpose or logic.

    # No comments
    data <- read.csv("data.csv")
    result <- data[data$score > 10, ]
  2. Over-commenting obvious code
    Commenting on trivial lines unnecessarily clutters the code.

    # Adding 1 to x
    x <- x + 1

Best practices:

  1. Meaningful comments and using roxygen2
    Document non-obvious logic and function purpose clearly.

    #' Filter data by score
    #'
    #' This function filters data for scores greater than 10.
    #' @param data A dataframe with a score column
    #' @return Filtered dataframe
    filter_high_scores <- function(data) {
      data[data$score > 10, ]
    }

    For more about the roxygen2 package, see its “vignette” (“Learn More” n.d.).

4.2.3 Avoiding magic numbers and hardcoding

Magic numbers in programming

Magic numbers refer to unique numeric values embedded directly in the code without context or explanation. These numbers often appear arbitrary and can make the code difficult to understand and maintain. They are problematic because:
1. Lack of clarity: Without meaningful names, the purpose of the number is unclear to others (or to you in the future).
2. Hard to update: If the same number appears in multiple places, updating it becomes error-prone and tedious.
3. Reduced readability: Magic numbers obscure the intent of the code, making it harder to follow and debug.

Bad practices:

  1. Using magic numbers
    Hardcoding values without explanation can confuse users about their significance.

    # Hardcoded magic number
    for (i in 1:7) {
      print(i)
    }

Best practices:

  1. Defining constants
    Clearly define constants for better clarity and easy updates.

    # Using constants
    DAYS_IN_WEEK <- 7
    for (i in 1:DAYS_IN_WEEK) {
      print(i)
    }
  • “Tidyverse Style Guide” (n.d.)
  • Alboukadel (2020)
  • “R Variables and Constants (n.d.)
  • “Style Guide · Advanced R.” (n.d.)
  • “R Variables and Constants (With Examples)” (n.d.)
  • “Google’s R Style Guide (n.d.)

4.2.4 Writing efficient and scalable code

Efficient and scalable R code is crucial when dealing with large datasets or computationally intensive tasks. Scalability refers to the code’s ability to handle increasing amounts of data or complexity without significant performance degradation.

Vectorization

Vectorization involves replacing explicit loops with vectorized operations, which are more efficient and concise. This approach leverages R’s optimized internal functions to operate on entire vectors or matrices in one go, reducing execution time.

Examples:

  • Base R:

    # Loop approach
    result <- numeric(1000)
    for (i in 1:1000) {
      result[i] <- i^2
    }
    
    # Vectorized approach
    result <- (1:1000)^2
  • Using dplyr:

    library(dplyr)
    data <- tibble(x = 1:1000)
    
    # Vectorized operation with dplyr
    data <- data %>%
      mutate(square = x^2)

Memory Management

Efficient memory management is essential when working with large datasets to prevent memory exhaustion and improve performance. data.table is a package specifically designed for handling large datasets efficiently by minimizing memory usage and speeding up data operations.

Example:

  • Using data.table:

    library(data.table)
    
    # Creating a large data.table
    dt <- data.table(x = rnorm(1e7), y = rnorm(1e7))
    
    # Efficiently computing on large datasets
    dt[, mean_x := mean(x)]

data.table optimizes memory usage by modifying data in place and avoids unnecessary copies, making it ideal for large-scale data processing.

  • Anderson (n.d.)
  • “R Dtplyr: How to Efficiently Process Huge Datasets with a Data.table Backend (n.d.)
  • Lipkin (2022)

4.3 Refactoring

Refactoring refers to the process of improving existing code without changing its external behaviour. The goal is to make the code cleaner, more efficient, and easier to maintain. Refactoring enhances readability, reduces complexity, and simplifies future modifications.

4.3.1 Why Refactor?

  • Improves code readability: Well-structured code is easier to understand.
  • Enhances maintainability: Simplified codebase reduces the time required for future updates or debugging.
  • Promotes reusability: Modularized and clean code is more reusable in different contexts.

4.3.2 Examples of Refactoring in R

1. Removing Redundant Code

Before:

data <- read.csv("data.csv")
data_cleaned <- na.omit(data)
data_cleaned <- data_cleaned[data_cleaned$score > 10, ]

After:

data <- read.csv("data.csv") |>
  na.omit() |>
  subset(score > 10)
# or using dplyr
data <- read.csv("data.csv") %>%
  na.omit() %>%
  filter(score > 10)

Using dplyr pipelines makes the code concise and easier to read.

2. Breaking Down Long Functions

Before:

calculate_metrics <- function(data) {
  mean_value <- mean(data$score)
  sd_value <- sd(data$score)
  return(list(mean = mean_value, sd = sd_value))
}

After:

calculate_mean <- function(data) {
  mean(data$score)
}

calculate_sd <- function(data) {
  sd(data$score)
}

calculate_metrics <- function(data) {
  list(mean = calculate_mean(data), sd = calculate_sd(data))
}

Breaking down long functions into smaller ones improves readability and reusability.

3. Replacing Hardcoded Values with Constants

Before:

if (length(data) > 1000) {
  print("Large dataset")
}

After:

THRESHOLD <- 1000
if (length(data) > THRESHOLD) {
  print("Large dataset")
}

Using constants improves clarity and ease of updating values.

  • “Introduction to R - Refactoring the Workshop (n.d.)
  • “Programming with R: Best Practices for Writing R Code (n.d.)

4.4 EXTRA: Testing and validation

4.4.1 Writing Unit Tests

Unit tests are essential for ensuring code correctness by verifying that each function behaves as expected. They help catch errors early and make code more robust by facilitating easy modifications and refactoring.

Example: Writing basic unit tests in R (testthat)

  1. Install testthat package:

    install.packages("testthat")
  2. Create a test script (test_calculate_mean.R):

    library(testthat)
    
    calculate_mean <- function(x) {
      if (!is.numeric(x)) stop("Input must be numeric")
      mean(x, na.rm = TRUE)
    }
    
    test_that("calculate_mean works correctly", {
      expect_equal(calculate_mean(c(1, 2, 3)), 2)
      expect_equal(calculate_mean(c(NA, 2, 3)), 2.5)
      expect_error(calculate_mean("string"))
    })
  3. Run the tests:

    test_file("test_calculate_mean.R")

4.4.2 Data Validation

Validating data inputs and outputs is crucial for maintaining data integrity, especially in data processing pipelines. It ensures that the data conforms to expected formats and values, avoiding downstream errors.

Example: Implementing data validation checks in data processing scripts

Given a dataset:

data <- data.frame(
  age = c(25, -5, 30),
  salary = c(50000, 0, 60000),
  name = c("Alice", NA, "Bob")
)

Check if each variable comply with certain rules:
* age is non-negative,
* salary is positive,
* name is not missing.

stopifnot(data$age >= 0, data$salary > 0, !is.na(data$name))

# or with custom error messages: 
stopifnot(data$age >= 0, "Age values must be non-negative")
stopifnot(data$salary > 0, "Salary values must be positive")
stopifnot(!is.na(data$name), "Names must not be NA")

For handling larger rule sets and getting a more structured feedback, consider using validate package:

library(validate)

rules <- validator(
age >= 0,
salary > 0,
!is.na(name)
)

# Validate the data
check <- confront(data, rules)
summary(check)
  name items passes fails nNA error warning        expression
1   V1     3      2     1   0 FALSE   FALSE age - 0 >= -1e-08
2   V2     3      2     1   0 FALSE   FALSE        salary > 0
3   V3     3      2     1   0 FALSE   FALSE      !is.na(name)
  • “Prerequisites” (n.d.)
  • “Best Practices for Durable R Code (n.d.)
  • “The Validation Set Approach in R Programming - GeeksforGeeks (n.d.)
  • Kennedy (2019)

4.5 EXTRA: Error handling and debugging

Error handling and debugging are essential for building robust R code that gracefully manages unexpected issues. Effective error handling prevents crashes and improves user feedback, while debugging tools help identify and resolve problems during code development.

4.5.1 Error Handling Techniques

Using tryCatch in R
tryCatch is a versatile function for handling errors, warnings, and messages in R. It allows you to specify actions for different types of errors, ensuring that code can handle unexpected conditions gracefully.

Writing Meaningful Error Messages
Meaningful error messages help both developers and users understand what went wrong and guide them toward solutions. Avoid generic messages like "Error occurred" and instead specify the exact problem.

Example: Implementing error handling in a data processing script

process_data <- function(data) {
  tryCatch({
    # check
    if (!"score" %in% names(data)) stop("Missing 'score' column in data")
    # filter
    data <- data[data$score > 10, ]  # Process data
    return(data)
  },
  error = function(e) {
    cat("Error in processing data:", e$message, "\n")
    return(NULL)
  })
}

# Usage
result <- process_data(iris)
Error in processing data: Missing 'score' column in data 

In this example, tryCatch captures errors from read.csv() or the missing column and provides a custom message.

4.5.2 Debugging tools

Introduction to Debugging Tools: browser in R
The browser function in R allows you to pause code execution at a specific point, enabling step-by-step inspection of variables and function behaviour. This is especially useful for identifying unexpected values or conditions.

Example: Walkthrough of a debugging session in R

calculate_mean <- function(data) {
  browser()  # Debugging breakpoint
  mean(data$score, na.rm = TRUE)
}

# Sample data
data <- data.frame(score = c(10, NA, 20, 15))
calculate_mean(data)

When browser() is called, R will pause execution and open an interactive environment. Here, you can inspect data$score, check conditions, and proceed line by line, allowing you to diagnose issues effectively.

  • “Exceptions and Debugging · Advanced R.” (n.d.)
  • “22 Debugging Advanced R (n.d.)
  • “License (GPL-3)” (n.d.)
  • “4  Debugging Pipelines – The {Targets} R Package User Manual” (n.d.)
  • Albert Rapp (2024)

4.6 EXTRA: Creating and releasing packages

Reusable and shareable code helps in building efficient workflows, saving time, and reducing redundancy across projects. Reusable code can be encapsulated in functions, libraries, or R packages to simplify future use and sharing.

4.6.1 Creating packages

Writing functions and libraries that can be reused in multiple projects helps standardize tasks and reduces the risk of introducing errors by duplicating code. R packages are an ideal way to organize reusable functions, data, and documentation, making it easy to use them across different projects.

Example: Creating a Simple R Package 1. Set up the package
Use the usethis package to initiate a new package:

::: {.cell}

install.packages("usethis")
usethis::create_package("path/to/myPackage")

:::

  1. Add a function
    Create a function file in the R/ folder, for example R/hello.R:

    hello <- function(name = "world") {
      paste("Hello,", name)
    }
  2. Document the function
    Use roxygen2 to add documentation:

    #' Say Hello
    #'
    #' @param name A character string. Default is "world".
    #' @return A greeting message.
    #' @export
    hello <- function(name = "world") {
      paste("Hello,", name)
    }
  3. Build and test
    Load and test your package:

    devtools::document()
    devtools::load_all()
    hello("R user")

4.6.2 Releasing/sharing packages

Once your code is reusable as package, share it to help others benefit from your work. Publishing packages on CRAN or GitHub makes them accessible to other users and developers, while sharing R Markdown notebooks helps others reproduce and understand your analyses.

GitHub: Initialize a Git repository in the package directory or a new project in RStudio as a R package, set up a GitHub connection, and add the package files. Push changes to GitHub and release with a new tag.

CRAN: Follow CRAN submission guidelines.

Sharing packages or code through these platforms allows the broader R community to contribute, use, and improve your work.

  • “Making Your First R Package (n.d.)
  • (n.d.)
  • “Writing R Packages in Rstudio (n.d.)
  • StatistikinDD (2021)
Hands-on Practice
  • Refactoring Code (30min)
    • First attempt: Refactoring a sample script to follow best practices (clean code, modularity, documentation).
    • Second attempt: try using a Large Language Model (LLM) to refactor.
  • Collaborative Exercise (40min)
    • Simulating a collaborative workflow with Git: making and reviewing pull requests.
    • Groups of two or three
    • Re-use one of the repositories created in GitHub (Session 2) and populated by R code and output files (Session 3).
    • Mutual reviews and change suggestions.
    • Discussion, accepting/rejecting changes, and merge decision
  • Open discussion (10min)
    • Addressing common challenges in applying best practices to real-world projects.