+ - 0:00:00
Notes for current slide
Notes for next slide

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Note other ways of defining "good", e.g. efficiency - not covered here!

Useful whether work on own or part of a team

Good Coding Practices for Data Analysts

Heather Turner
Research Software Engineering Fellow
University of Warwick

1 / 20

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Goals

In theory, writing scripts for data analysis makes our work

  • Transparent
  • Reproducible/reusable
  • Maintainable

In practice, need to adopt good coding and software engineering practices!

2 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Note other ways of defining "good", e.g. efficiency - not covered here!

Useful whether work on own or part of a team

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Organize code as you would like to find it!

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Project Organization

Organize your project as you would like to find it!

  • Organize files by type (data, code, etc) to make it easy to navigate.
  • Name files to reflect the content/function.
example_project
β”‚
└─── data
β”‚ β”‚ patient_outcomes.csv
β”‚
└─── outputs
β”‚ β”‚ summarized_outcomes.csv
β”‚
└─── reports
β”‚ β”‚ study_report.Rmd
β”‚ β”‚ study_report.docx
β”‚
└─── scripts
β”‚ β”‚ functions.R
β”‚ β”‚ analysis.R
4 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Documentation

  • Put a README at the top level of your project folder
  • Comment your code to describe its purpose
# Patient exposure and event rate
patient_summary <- patient_outcomes |>
group_by(STUDYID, COUNTRY, CENTRE, PT) |>
summarise(d_exposure = max(d_exposure, na.rm = TRUE),
exposure = (d_exposure/30.4), # calculate exposure per month
event_count = sum(!is.na(EVENT)),
event_rate = event_count/exposure)
5 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Documentation

  • Put a README at the top level of your project folder
  • Comment your code to describe its purpose
# Patient exposure and event rate
patient_summary <- patient_outcomes |>
group_by(STUDYID, COUNTRY, CENTRE, PT) |>
summarise(d_exposure = max(d_exposure, na.rm = TRUE),
exposure = (d_exposure/30.4), # calculate exposure per month
event_count = sum(!is.na(EVENT)),
event_rate = event_count/exposure)
  • In RStudio, use Ctrl/⌘ + Shift + R to insert a section
# Pre-processing ----------------------------------------------------------
5 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

What would you tell a colleague if you were passing on the project and sat next to them?

Commenting, README

  • could add slide on sectioning with picture of RStudio outline, if time/space
  • could add comments on .Rmd vs .R (chunk names.main text can replace some comments; markdown sectioning replaces comment sectioning)

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Readable code

  • Use meaningful names
  • Keep line length <80 characters and use white space around operators
  • Use one chunk of code per objective
  • Prefer readability over maximum efficiency
6 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Readable code

  • Use meaningful names
  • Keep line length <80 characters and use white space around operators
  • Use one chunk of code per objective
  • Prefer readability over maximum efficiency

Efficient but complex

df$lag_value <- c(NA, df$value[-nrow(df)])
df$lag_value[which(!duplicated(df$group))] <- NA

More readable, slightly less efficienct

df |>
group_by(group) |>
mutate(lag_value = dplyr::lag(value))
6 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Naming, code style, one chunk of code per objective Readable vs max efficiency

Favour readability over maximum efficiency.

Ideal should be understandable without a comment

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Going further on transparency

  • Style guides
  • Code review
  • Pair programming
  • Function documentation using the docstring package
7 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

docstring: Create help files from special roxygen2 comments.

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Reproducibility/Reusability

8 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Project-oriented workflow

In addition to organizing files within a project directory...

  1. Set the working directory to the project root
    • Use RStudio Projects
    • Use here::set_here() to tag the project root with a .here file
  2. Use file paths relative to the project root, to make your project portable
    • The here package makes this easy, e.g.
      ggsave(here("figs", "mpg_hp.png"))
    • If you need to use paths from outside the project, set these once at the start
9 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

General issue of hard-coding

Can use askpass:::askpass()

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Parameterized R Markdown/Quarto

---
title: "`r params$data` Dataset"
output: html_document
params:
data: sleep
---
Summary of the `r params$data` dataset:
```{r summary-data, echo = FALSE}
report_data <- get(params$data)
summary(report_data)
```
---
title: "`r params$data` Dataset"
format: html
params:
data: sleep
---
Summary of the `r params$data` dataset:
```{r}
#| label: summary-data
#| echo: false
report_data <- get(params$data)
summary(report_data)
```
10 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Data analyst's reusable component

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Render with custom parameters

rmarkdown::render("rmarkdown.Rmd",
params = list(data = "sleep"))
quarto::quarto_render("quarto.qmd",
execute_params = list(data = "women"))
11 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Defensive programming

Validate inputs, e.g.

# check a Excel file exists at given path
xlsx <- normalizePath(xlsx, winslash = "/", mustWork = TRUE)
# check a threshold is valid
stopifnot(is.numeric(threshold) && threshold >= 0)

The assertthat and validate packages can be useful here.

12 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Defensive programming

Validate inputs, e.g.

# check a Excel file exists at given path
xlsx <- normalizePath(xlsx, winslash = "/", mustWork = TRUE)
# check a threshold is valid
stopifnot(is.numeric(threshold) && threshold >= 0)

The assertthat and validate packages can be useful here.

Check results of filters and joins

tab1 <- patient_outcomes |>
filter(as.Date(DATE) == report_date & PT == patient)
if (!nrow(tab1))
warning("No records for ", patient, " on ", report_date)
12 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

assertthat extension of stopifnot with more helpful error messages validate for validating input data

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Package management

Most basic:

  1. Add a requirements.txt at the root of the project.
  2. Put library() calls at the top of .R and .Rmd files.

More advanced tools to specify and restore working environment:

  1. One-off analysis: use groundhog to specify R, packages & dependencies by a date.
  2. Repeated analysis: use automagic to install package versions specified in deps.yaml.
  3. Production code: use renv to specify version R, packages & dependencies.
13 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

one-off report: groundhog reusable scripts: automagic production code: renv

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Choose dependencies carefully

Using a (non-base) package is always a trade-off:

For (e.g.) Against
Better readability Package update can break code
Faster implementation Dependent on maintainer to fix bugs
Better error handling More setup to reproduce analysis
15 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Choose dependencies carefully

Using a (non-base) package is always a trade-off:

For (e.g.) Against
Better readability Package update can break code
Faster implementation Dependent on maintainer to fix bugs
Better error handling More setup to reproduce analysis
  • How much of the functionality are you using?
  • How mature/well-maintained is the package?
  • Are you using it across multiple projects?
15 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Not minimize (conflicts with transparency) Select often-changing packages with care Avoid trivial dependencies

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Don't Repeat Yourself

Copy-pasting is error-prone and leads to over-complex code.

Use custom functions instead, e.g.

# convert counts to percentages in 2-way table with row/column totals
make_perc_tab <- function(tab){
nr <- nrow(tab)
nc <- ncol(tab)
tab/tab[nr, nc] * 100
}
16 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Don't Repeat Yourself

Copy-pasting is error-prone and leads to over-complex code.

Use custom functions instead, e.g.

# convert counts to percentages in 2-way table with row/column totals
make_perc_tab <- function(tab){
nr <- nrow(tab)
nc <- ncol(tab)
tab/tab[nr, nc] * 100
}

Makes it easier to re-use or iterate, e.g.

tab_list <- list(tab1, tab2, tab3)
out <- lapply(tab_list, make_perc_tab)
16 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply

make_perc_tab <- function(tab){ tab/sum(tab) * 100 }

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Version control

Version control systems (e.g. git) allow us to record changes made to files in a directory.

Screenshot of commit history of a git repository, showing three commits: "added README file to start off", "added data for KHK project", "pre-processed KHK data"

17 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Version control

Version control systems (e.g. git) allow us to record changes made to files in a directory.

Screenshot of commit history of a git repository, showing three commits: "added README file to start off", "added data for KHK project", "pre-processed KHK data"

  • Avoid saving multiple variants or commenting out old code
  • Commits can be restored temporarily or permanently
  • Syncing with a remote repository (e.g. on GitHub) provides a backup
17 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Acts like a log, with comments on changes made Facilitates merging work from collaborators

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Testing

Tests can be used to custom functions act as expected, e.g.

log_2 <- function(x) log(x, 2)
library(testthat)
test_that("log_2 returns log to base 2", {
expect_equal(log_2(2^3), 3)
})
## Test passed 🌈
18 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Testing

Tests can be used to custom functions act as expected, e.g.

log_2 <- function(x) log(x, 2)
library(testthat)
test_that("log_2 returns log to base 2", {
expect_equal(log_2(2^3), 3)
})
## Test passed 🌈

Can create a test suite and run as test_file("tests.R").

Helps to detect issues introduced by changes to the code.

18 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Pipelines/package development

Can also check Rmd output, e.g. by comparing hashes (MD5 checksums) of HTML files, possibly also testthat::expect_snapshot (untested!)

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Going further on maintainability

  • Package development
    • Functions, documentation and tests in a shareable format
    • Easier to use across projects
  • Using a repository host, e.g. GitHub
    • Use issues: note and discuss changes to make
    • Teamwork: work asynchronously and merge changes
    • Publish your code
    • Encourage external contribution
19 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

https://mgimond.github.io/rug_2019_12/Index.html

heatherturner.net/talks/NHS-R2022   @HeathrTurnr

Goals

In theory, writing scripts for data analysis makes our work

  • Transparent
  • Reproducible/reusable
  • Maintainable

In practice, need to adopt good coding and software engineering practices!

2 / 20

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?

Note other ways of defining "good", e.g. efficiency - not covered here!

Useful whether work on own or part of a team

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, β†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow