Good Coding Practices for Data Analysts

Heather Turner
Research Software Engineering Fellow
University of Warwick

@HeathrTurnr

16 November 2022

heatherturner.net/talks/NHS-R2022

1 / 20

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Goals

In theory, writing scripts for data analysis makes our work

Transparent
Reproducible/reusable
Maintainable

In practice, need to adopt good coding and software engineering practices!

2 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Note other ways of defining "good", e.g. efficiency - not covered here!

Useful whether work on own or part of a team

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Transparency

3 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Organize code as you would like to find it!

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Project Organization

Organize your project as you would like to find it!

Organize files by type (data, code, etc) to make it easy to navigate.
Name files to reflect the content/function.

example_project
│
└─── data
│   │   patient_outcomes.csv
│
└─── outputs
│   │   summarized_outcomes.csv
│   
└─── reports
│   │   study_report.Rmd
│   │   study_report.docx
│
└─── scripts
│   │   functions.R
│   │   analysis.R

4 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Documentation

Put a README at the top level of your project folder
Comment your code to describe its purpose

# Patient exposure and event rate
patient_summary <- patient_outcomes |>
    group_by(STUDYID, COUNTRY, CENTRE, PT) |>
    summarise(d_exposure = max(d_exposure, na.rm = TRUE),
              exposure = (d_exposure/30.4), # calculate exposure per month
              event_count = sum(!is.na(EVENT)),
              event_rate = event_count/exposure)

5 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Documentation

Put a README at the top level of your project folder
Comment your code to describe its purpose

# Patient exposure and event rate
patient_summary <- patient_outcomes |>
    group_by(STUDYID, COUNTRY, CENTRE, PT) |>
    summarise(d_exposure = max(d_exposure, na.rm = TRUE),
              exposure = (d_exposure/30.4), # calculate exposure per month
              event_count = sum(!is.na(EVENT)),
              event_rate = event_count/exposure)

In RStudio, use Ctrl/⌘ + Shift + R to insert a section

# Pre-processing ----------------------------------------------------------

5 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

What would you tell a colleague if you were passing on the project and sat next to them?

Commenting, README

could add slide on sectioning with picture of RStudio outline, if time/space
could add comments on .Rmd vs .R (chunk names.main text can replace some comments; markdown sectioning replaces comment sectioning)

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Readable code

Use meaningful names
Keep line length <80 characters and use white space around operators
Use one chunk of code per objective
Prefer readability over maximum efficiency

6 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Readable code

Use meaningful names
Keep line length <80 characters and use white space around operators
Use one chunk of code per objective
Prefer readability over maximum efficiency

Efficient but complex

df$lag_value <- c(NA, df$value[-nrow(df)])
df$lag_value[which(!duplicated(df$group))] <- NA

Going further on transparency

Style guides
- Naming conventions, e.g. snake_case vs camelCase
- Indentation
- See e.g. The Tidyverse Style Guide
Code review
Pair programming
Function documentation using the docstring package

7 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

docstring: Create help files from special roxygen2 comments.

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Reproducibility/Reusability

8 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Project-oriented workflow

In addition to organizing files within a project directory...

Set the working directory to the project root
- Use RStudio Projects
- Use here::set_here() to tag the project root with a .here file
Use file paths relative to the project root, to make your project portable
- The here package makes this easy, e.g.
```
ggsave(here("figs", "mpg_hp.png"))
```
- If you need to use paths from outside the project, set these once at the start

9 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

General issue of hard-coding

Can use askpass:::askpass()

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Parameterized R Markdown/Quarto

---
title: "`r params$data` Dataset"
output: html_document
params:
  data: sleep
---

Summary of the `r params$data` dataset:

```{r summary-data, echo = FALSE}
report_data <- get(params$data)
summary(report_data)
```

---
title: "`r params$data` Dataset"
format: html
params:
  data: sleep
---

Summary of the `r params$data` dataset:

```{r}
#| label: summary-data
#| echo: false
report_data <- get(params$data)
summary(report_data)
```

10 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Data analyst's reusable component

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Render with custom parameters

rmarkdown::render("rmarkdown.Rmd", 
  params = list(data = "sleep"))

quarto::quarto_render("quarto.qmd", 
  execute_params = list(data = "women"))

11 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Defensive programming

Validate inputs, e.g.

# check a Excel file exists at given path
xlsx <- normalizePath(xlsx, winslash = "/", mustWork = TRUE)
# check a threshold is valid
stopifnot(is.numeric(threshold) && threshold >= 0)

The assertthat and validate packages can be useful here.

12 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Defensive programming

Validate inputs, e.g.

# check a Excel file exists at given path
xlsx <- normalizePath(xlsx, winslash = "/", mustWork = TRUE)
# check a threshold is valid
stopifnot(is.numeric(threshold) && threshold >= 0)

The assertthat and validate packages can be useful here.

Check results of filters and joins

tab1 <- patient_outcomes |>
    filter(as.Date(DATE) == report_date & PT == patient)
if (!nrow(tab1))
    warning("No records for ", patient, " on ", report_date)

12 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

assertthat extension of stopifnot with more helpful error messages validate for validating input data

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Package management

Most basic:

Add a requirements.txt at the root of the project.
Put library() calls at the top of .R and .Rmd files.

More advanced tools to specify and restore working environment:

One-off analysis: use groundhog to specify R, packages & dependencies by a date.
Repeated analysis: use automagic to install package versions specified in deps.yaml.
Production code: use renv to specify version R, packages & dependencies.

13 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

one-off report: groundhog reusable scripts: automagic production code: renv

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Maintainability

14 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Choose dependencies carefully

Using a (non-base) package is always a trade-off:

For (e.g.)	Against
Better readability	Package update can break code
Faster implementation	Dependent on maintainer to fix bugs
Better error handling	More setup to reproduce analysis

15 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Choose dependencies carefully

Using a (non-base) package is always a trade-off:

For (e.g.)	Against
Better readability	Package update can break code
Faster implementation	Dependent on maintainer to fix bugs
Better error handling	More setup to reproduce analysis

How much of the functionality are you using?
How mature/well-maintained is the package?
Are you using it across multiple projects?

15 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Not minimize (conflicts with transparency) Select often-changing packages with care Avoid trivial dependencies

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Don't Repeat Yourself

Copy-pasting is error-prone and leads to over-complex code.

Use custom functions instead, e.g.

# convert counts to percentages in 2-way table with row/column totals
make_perc_tab <- function(tab){
    nr <- nrow(tab)
    nc <- ncol(tab)
    tab/tab[nr, nc] * 100
}

16 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Don't Repeat Yourself

Copy-pasting is error-prone and leads to over-complex code.

Use custom functions instead, e.g.

# convert counts to percentages in 2-way table with row/column totals
make_perc_tab <- function(tab){
    nr <- nrow(tab)
    nc <- ncol(tab)
    tab/tab[nr, nc] * 100
}

Makes it easier to re-use or iterate, e.g.

tab_list <- list(tab1, tab2, tab3)
out <- lapply(tab_list, make_perc_tab)

16 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply

make_perc_tab <- function(tab){ tab/sum(tab) * 100 }

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Version control

Version control systems (e.g. git) allow us to record changes made to files in a directory.

Screenshot of commit history of a git repository, showing three commits: "added README file to start off", "added data for KHK project", "pre-processed KHK data"

17 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Version control

Version control systems (e.g. git) allow us to record changes made to files in a directory.

Screenshot of commit history of a git repository, showing three commits: "added README file to start off", "added data for KHK project", "pre-processed KHK data"

Avoid saving multiple variants or commenting out old code
Commits can be restored temporarily or permanently
Syncing with a remote repository (e.g. on GitHub) provides a backup

17 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Acts like a log, with comments on changes made Facilitates merging work from collaborators

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Testing

Tests can be used to custom functions act as expected, e.g.

log_2 <- function(x) log(x, 2)

library(testthat)
test_that("log_2 returns log to base 2", {
  expect_equal(log_2(2^3), 3)
})

## Test passed 🌈

18 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Testing

Tests can be used to custom functions act as expected, e.g.

log_2 <- function(x) log(x, 2)

library(testthat)
test_that("log_2 returns log to base 2", {
  expect_equal(log_2(2^3), 3)
})

## Test passed 🌈

Can create a test suite and run as test_file("tests.R").

Helps to detect issues introduced by changes to the code.

18 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Pipelines/package development

Can also check Rmd output, e.g. by comparing hashes (MD5 checksums) of HTML files, possibly also testthat::expect_snapshot (untested!)

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Going further on maintainability

Package development
- Functions, documentation and tests in a shareable format
- Easier to use across projects
Using a repository host, e.g. GitHub
- Use issues: note and discuss changes to make
- Teamwork: work asynchronously and merge changes
- Publish your code
- Encourage external contribution

19 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022 @HeathrTurnr

Resources

Good enough practices in scientific computing, Wilson et al, PLOS Computat. Biol., 2017.
The Turing Way : A Handbook for Reproducible Data Science, Arnold et al, 2022.
What They Forgot to Teach You About R, Bryan and Hester, 2021.
Why should I use the here package when I'm already using projects?, Barrett, 2018.
How to use Quarto for Parameterized Reporting, Mahoney, 2022.
Managing R script dependencies: automagic and renv, Cámara-Menoyo, 2022.
How to Use Git/GitHub with R, Keyes, 2021.
Happy Git and GitHub for the useR Bryan et al, 2022.

20 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Good Coding Practices for Data Analysts

Heather Turner Research Software Engineering FellowUniversity of Warwick

@HeathrTurnr16 November 2022 heatherturner.net/talks/NHS-R2022

Goals

Transparency

Project Organization

Documentation

Documentation

Readable code

Readable code

Going further on transparency

Reproducibility/Reusability

Project-oriented workflow

Parameterized R Markdown/Quarto

Render with custom parameters

Defensive programming

Defensive programming

Package management

Maintainability

Choose dependencies carefully

Choose dependencies carefully

Don't Repeat Yourself

Don't Repeat Yourself

Version control

Version control

Testing

Testing

Going further on maintainability

Resources

Goals

Help

Heather Turner
Research Software Engineering Fellow
University of Warwick

@HeathrTurnr

16 November 2022

heatherturner.net/talks/NHS-R2022