Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Note other ways of defining "good", e.g. efficiency - not covered here!
Useful whether work on own or part of a team
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
In theory, writing scripts for data analysis makes our work
In practice, need to adopt good coding and software engineering practices!
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Note other ways of defining "good", e.g. efficiency - not covered here!
Useful whether work on own or part of a team
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Organize code as you would like to find it!
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Organize your project as you would like to find it!
example_projectβββββ dataβ β patient_outcomes.csvβββββ outputsβ β summarized_outcomes.csvβ ββββ reportsβ β study_report.Rmdβ β study_report.docxβββββ scriptsβ β functions.Rβ β analysis.RBackground: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
# Patient exposure and event ratepatient_summary <- patient_outcomes |> group_by(STUDYID, COUNTRY, CENTRE, PT) |> summarise(d_exposure = max(d_exposure, na.rm = TRUE), exposure = (d_exposure/30.4), # calculate exposure per month event_count = sum(!is.na(EVENT)), event_rate = event_count/exposure)Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
# Patient exposure and event ratepatient_summary <- patient_outcomes |> group_by(STUDYID, COUNTRY, CENTRE, PT) |> summarise(d_exposure = max(d_exposure, na.rm = TRUE), exposure = (d_exposure/30.4), # calculate exposure per month event_count = sum(!is.na(EVENT)), event_rate = event_count/exposure)
# Pre-processing ----------------------------------------------------------Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
What would you tell a colleague if you were passing on the project and sat next to them?
Commenting, README
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Efficient but complex
df$lag_value <- c(NA, df$value[-nrow(df)])df$lag_value[which(!duplicated(df$group))] <- NA
More readable, slightly less efficienct
df |> group_by(group) |> mutate(lag_value = dplyr::lag(value))
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Naming, code style, one chunk of code per objective Readable vs max efficiency
Favour readability over maximum efficiency.
Ideal should be understandable without a comment
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
snake_case vs camelCaseBackground: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
docstring: Create help files from special roxygen2 comments.
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
In addition to organizing files within a project directory...
here::set_here() to tag the project root with a .here fileggsave(here("figs", "mpg_hp.png"))
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
General issue of hard-coding
Can use askpass:::askpass()
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
---title: "`r params$data` Dataset"output: html_documentparams: data: sleep---
Summary of the `r params$data` dataset:
```{r summary-data, echo = FALSE}report_data <- get(params$data)summary(report_data)```
---title: "`r params$data` Dataset"format: htmlparams: data: sleep---
Summary of the `r params$data` dataset:
```{r}#| label: summary-data#| echo: falsereport_data <- get(params$data)summary(report_data)```
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Data analyst's reusable component
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
rmarkdown::render("rmarkdown.Rmd", params = list(data = "sleep"))
quarto::quarto_render("quarto.qmd", execute_params = list(data = "women"))
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Validate inputs, e.g.
# check a Excel file exists at given pathxlsx <- normalizePath(xlsx, winslash = "/", mustWork = TRUE)# check a threshold is validstopifnot(is.numeric(threshold) && threshold >= 0)
The assertthat and validate packages can be useful here.
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Validate inputs, e.g.
# check a Excel file exists at given pathxlsx <- normalizePath(xlsx, winslash = "/", mustWork = TRUE)# check a threshold is validstopifnot(is.numeric(threshold) && threshold >= 0)
The assertthat and validate packages can be useful here.
Check results of filters and joins
tab1 <- patient_outcomes |> filter(as.Date(DATE) == report_date & PT == patient)if (!nrow(tab1)) warning("No records for ", patient, " on ", report_date)Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
assertthat extension of stopifnot with more helpful error messages validate for validating input data
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Most basic:
requirements.txt at the root of the project.library() calls at the top of .R and .Rmd files.More advanced tools to specify and restore working environment:
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
one-off report: groundhog reusable scripts: automagic production code: renv
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Using a (non-base) package is always a trade-off:
| For (e.g.) | Against |
|---|---|
| Better readability | Package update can break code |
| Faster implementation | Dependent on maintainer to fix bugs |
| Better error handling | More setup to reproduce analysis |
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Using a (non-base) package is always a trade-off:
| For (e.g.) | Against |
|---|---|
| Better readability | Package update can break code |
| Faster implementation | Dependent on maintainer to fix bugs |
| Better error handling | More setup to reproduce analysis |
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Not minimize (conflicts with transparency) Select often-changing packages with care Avoid trivial dependencies
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Copy-pasting is error-prone and leads to over-complex code.
Use custom functions instead, e.g.
# convert counts to percentages in 2-way table with row/column totalsmake_perc_tab <- function(tab){ nr <- nrow(tab) nc <- ncol(tab) tab/tab[nr, nc] * 100}Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Copy-pasting is error-prone and leads to over-complex code.
Use custom functions instead, e.g.
# convert counts to percentages in 2-way table with row/column totalsmake_perc_tab <- function(tab){ nr <- nrow(tab) nc <- ncol(tab) tab/tab[nr, nc] * 100}
Makes it easier to re-use or iterate, e.g.
tab_list <- list(tab1, tab2, tab3)out <- lapply(tab_list, make_perc_tab)Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply
make_perc_tab <- function(tab){ tab/sum(tab) * 100 }
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Version control systems (e.g. git) allow us to record changes made to files in a directory.

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Version control systems (e.g. git) allow us to record changes made to files in a directory.

Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Acts like a log, with comments on changes made Facilitates merging work from collaborators
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Tests can be used to custom functions act as expected, e.g.
log_2 <- function(x) log(x, 2)
library(testthat)test_that("log_2 returns log to base 2", { expect_equal(log_2(2^3), 3)})
## Test passed πBackground: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Tests can be used to custom functions act as expected, e.g.
log_2 <- function(x) log(x, 2)
library(testthat)test_that("log_2 returns log to base 2", { expect_equal(log_2(2^3), 3)})
## Test passed πCan create a test suite and run as test_file("tests.R").
Helps to detect issues introduced by changes to the code.
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Pipelines/package development
Can also check Rmd output, e.g. by comparing hashes (MD5 checksums) of HTML files, possibly also testthat::expect_snapshot (untested!)
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
Good enough practices in scientific computing, Wilson et al, PLOS Computat. Biol., 2017.
The Turing Way : A Handbook for Reproducible Data Science, Arnold et al, 2022.
What They Forgot to Teach You About R, Bryan and Hester, 2021.
Why should I use the here package when I'm already using projects?, Barrett, 2018.
How to use Quarto for Parameterized Reporting, Mahoney, 2022.
Managing R script dependencies: automagic and renv, CΓ‘mara-Menoyo, 2022.
How to Use Git/GitHub with R, Keyes, 2021.
Happy Git and GitHub for the useR Bryan et al, 2022.
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
heatherturner.net/talks/NHS-R2022 @HeathrTurnr
In theory, writing scripts for data analysis makes our work
In practice, need to adopt good coding and software engineering practices!
Background: experience in research and industry, both writing my own scripts and reviewing other peoples - what am I looking for?
Note other ways of defining "good", e.g. efficiency - not covered here!
Useful whether work on own or part of a team
Keyboard shortcuts
| β, β, Pg Up, k | Go to previous slide |
| β, β, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |