Zero to Hero

Forwards Teaching Team

@R_Forwards
https://forwards.github.io/

updated: 2020-03-10

1 / 66

What can R do?2 / 66

Data Management

Data Import

Reshaping

Data Analysis

Explaratory Data Analysis

Statistical Modelling

Advanced Statistics

3 / 66

Reporting

Dashboard

Markdown

Programming

Connect to other languages

Packages

4 / 66

The R Ecosystem

Base

base

create R objects
summaries
math functions

recommended

statistics
graphics
example data

Contributed Packages

CRAN

cran.r-project.org

main repos
~13000 pkgs

bioconductor.org

bioinformatics
>1500 pkgs

GitHub

github.com

devel pkgs
GitHub-only pkgs

5 / 66

Using R6 / 66

R Demo

We can type commands directly into the R console

3 + 4
?"+" #look up help for "+"
x <- 3 + 4 ; y <- log(x) 
ls() # list of objects in the current workspace
rm(x)
data() # find out what standard data sets there are
plot(iris) # plot Fisher's iris data

7 / 66

Using RStudio8 / 66

RStudio IDE

9 / 66

R Studio Features

Features provided by RStudio include:

syntax highlighting, code completion, smart indentation
interactively send code chunks from editor to R
organise multiple scripts, help files, plots
search code and help files

10 / 66

RStudio Shortcuts from the R Console

RStudio provides a few shortcuts to help write code in the R console

↑/↓ go back/forward through history one command at a time
Ctrl/⌘ + ↑ review recent history and select command
Tab view possible completions for part-written expression

Code completion (using Tab) is also provided in the source editor

11 / 66

R Scripts

Text files saved with a .R suffix are recognised as R code.

Code can be sent directly from the source editor as follows

Ctrl/⌘ + ↵ or run current line
- run multiple lines by selecting first
Ctrl/⌘ + Shift + ↵ or . Run the script from start to finish.

12 / 66

RStudio Demo

View(iris)
# showing code completion, running code, indentation
sum(3, 4)
summary(iris, maxsum = 2, 
        digits = 2)
pbirthday(40, classes = 365, coincident = 3)

13 / 66

Show how to change background so not white

Why R Scipts

Writing an R script for an analysis has several advantages over a graphical user interface (GUI)

it provides a record of the exact approach used in an analysis
it enables the analysis to be easily reproduced and modified
it allows greater control

14 / 66

Good Practices for R Scripts

Organising your R script well will help you and others understand and use it.

Add comment or two at start to describe purpose of script
Load required data and packages at the start
Avoid hard-coding: define variables such as file paths early on
Give functions and variable meaningful names
use ### or #--- to separate sections (in RStudio Code > Insert Section)

15 / 66

Your Turn!

In RStudio, open a new R Script.

Try typing some of the code from the R/R Studio demo and using some of the shortcuts to complete your commands and run them.

16 / 66

Working with Data in R17 / 66

Data Structures

Data structures are the building blocks of code. In R there are four main types of structure:

vectors and factors
matrices and arrays
lists
data frames

The first and the last are sufficient to get started.

18 / 66

Vectors

A single number is a special case of a numeric vector. Vectors of length greater than one can be created using the concatenate function, c.

x <- c(1, 3, 6)

The elements of the vector must be of the same type: common types are numeric, character and logical

x <- 1:3
x

# [1] 1 2 3

y <- c("red", "yellow", "green")
y

# [1] "red"    "yellow" "green"

z <- c(TRUE, FALSE)

Missing values (of any type) are represented by the symbol NA.

19 / 66

Data Frames

Data sets are stored in R as data frames. These are structured as a list of objects, typically vectors, of the same length

str(iris)

# 'data.frame':    150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here Species is a factor, a special data structure for categorial variables.

20 / 66

Creating a Data Frame

x <- 1:3
y <- c("red", "yellow", "green")
mydata <- data.frame(x, y)
mydata

#   x      y
# 1 1    red
# 2 2 yellow
# 3 3  green

21 / 66

Install Packages

Most day-to-day work will require at least one contributed package.

CRAN packages can be installed from the Packages tab

Alternatively, they can be installed directly via install.packages()

install.packages("survival", dependencies = TRUE)

22 / 66

Using Packages

To use an installed package in your code, you must first load it from your package library

library(survival)

Sometimes an RStudio feature will require a contributed package. A pop-up will ask permission to install the package the first time, after that RStudio will load it automatically.

23 / 66

Data Input via Import Dataset

Using the Import Dataset dialog in RStudio

we can import files stored locally or online in the following formats:

.txt/.csv via read_delim/read_csv from readr.
.xlsx via read_excel from readxl.
.sav/.por , .sas7bdat and .dta via read_spss, read_sas and read_stata respectively from haven.

Most of these functions also allow files to be compressed, e.g. as .zip.

24 / 66

Your Turn

Use the Import Dataset button, import a data set from the data sets folder of the workshop materials. RStudio will ask to install packages as required.

Try changing some of the import options to see how that changes the preview of the data and the import code.

Install the skimr package and use the skim function to summarise the data set.

25 / 66

Tibbles

The functions used by Import Dataset return data frames of class "tbl_df", aka tibbles. The main differences are

	data.frame	tibble
Printing (default)	Whole table	10 rows; columns to fit Prints column type
Subsetting	`dat[, 1]`, `dat$X1`, `dat[[1]]` all return vector	`dat[,1]` returns tibble `dat$X1`, `dat[[1]]` return vector
Strings	Converted to factor (default)	Left as character
Variable names	Made syntactically valid e.g. `Full name` -> `Full.name`	Left as is use e.g. dat$`Full name `

26 / 66

Data Input via Code

The rio package provides a common interface to the functions used by Import Dataset as well as many others.

The data format is automatically recognised from the file extension. To read the data in as a tibble, we use the setclass argument.

library(rio)
compsci <- import("compsci.csv", setclass = "tibble")
cyclist <- import("cyclist.xlsx", setclass = "tibble")

See ?rio for the underlying functions used for each format and the corresponding optional arguments, e.g. the skip argument to read_excel to skip a certain number of rows.

27 / 66

Workflow28 / 66

R Studio Projects

An Rstudio project is a context for work on a specific project

automatically sets working directory to project root
has separate workspace and command history
works well with version control via git or svn

Create a project from a new/existing directory via the File Menu or the New Project button

Switch project, or open a different project in a new RStudio instance via the Project menu.

29 / 66

Project Structure

An R project structure helps you kickstart the project you undertake as fast as possible, using a standardized form.

A typical project can be organised into folders such as data, scripts, results and reports.

The scripts should be named to reflect the content/function:

requirements.R = includes required packages
functions.R = includes custom functions
custom_theme.R = includes a custom theme to create graphs
analysis.R = main analysis script

A README file can be used to describe the project and the files.

30 / 66

R Markdown Documents

R markdown documents (.Rmd) intersperse code chunks (R, Python, Julia, C++, SQL) with markdown text

YAML header

---
title: "Report"
output: html_document
---

Markdown text

## First section 
This report summarises the `cars` dataset.

R code chunk

```{r summary-cars}
summary(cars)
```

31 / 66

Rendering

The .Rmd file can be rendered to produce a document (HTML, PDF, docx) integrating the code output.

32 / 66

Demo with standard .Rmd template (File > New R markdown)

Markdown Reference

You can use markdown to format text, create bulleted lists, tables and more.

RStudio provides a quick reference:

33 / 66

Your Turn

Copy the infant directory from the workshop materials to your laptop. Open infant.Rproj, which is an RStudio project file.

From the Files tab, open the infant.Rmd R markdown file.

In the setup chunk, write some code to load the rio package.

In the chunk labelled import-data, write some code that will import the infant.xlsx file and create a tibble named infant. Hint use the import function from the rio package.

Run the code in the import-data chunk (the setup chunk is run automatically). Use View() in the console to inspect the result.

34 / 66

Data Wrangling35 / 66

Data Pre-processing

The imported data are a subset of records from a US Child Health and Development Study, corresponding to live births of a single male foetus.

The data requires a lot of pre-processing

converting numeric variables to factors
converting coded values to missing values
filtering rows and selecting columns

The dplyr package can be used to help with many of these steps

36 / 66

dplyr

The dplyr package provides the following key functions to operate on data frames

filter()
arrange()
select() (and rename())
distinct()
mutate() (and transmute())
summarise()

37 / 66

`filter()`

filter() selects rows of data by criteria

library(dplyr)
infant <- filter(infant, smoke != 9 & age > 18)

Building block	R code
Binary comparisons	`>`, `<`, `==`, `<=`, `>=`, `!=`
Logical operators	or `\|` and `&`, not `!`
Value matching	e.g. `x %in% 6:9`
Missing indicator	e.g. `is.na(x)`

38 / 66

`select()`

select() selects variables from the data frame.

Select named columns:

select(infant, gestation, sex)

Select a sequence of columns, along with an individual column

select(infant, pluralty:gestation, parity)

Select all columns except specified columns

select(infant, -(id:outcome), -sex)

39 / 66

`mutate()`

mutate() computes new columns based on existing columns. Re-using an existing name replaces the old variable. For example we can convert the mother's weight from pounds to kilograms

mutate(infant, 
       wt = ifelse(wt == 999, NA, wt),
       wt = wt * 0.4536)

To only keep the computed columns, use transmute() in place of mutate().

40 / 66

Your Turn

In the select-variables chunk, use select() to remove the redundant variables pluralty, outcome and sex, as well as characteristics of the father drace, ... , dwt. Over-write the infant data frame.

The variable gestation gives the length of the pregnancy in days. In the filter-variables chunk, filter the data to exclude extremely premature babies (gestation less than 28 weeks) and extremely late babies (gestation more than 52 weeks).

The value 999 has been used to code an unknown value in the ht variable. In the mutate-variables chunk, use mutate to update ht, replacing 999 with NA.

41 / 66

Chaining

We can use %>% to pipe the data from one step to the next

infant <- infant %>% 
    filter(smoke != 9 & age > 18) %>%
    select(-(id:outcome), -sex) %>%
    mutate(wt = ifelse(wt == 999, NA, wt),
           wt = wt * 0.4536)

Any function with data as the first argument can be added to the data pipeline.

42 / 66

`summarise()`

summarise() function is for computing single number summaries of variables, so typically comes at the end of a pipeline or in a separate step

stats <- summarise(infant,
                   `Mean mother's weight (kg)` = mean(wt, na.rm = TRUE),
                   `Sample size` = sum(!is.na(wt)),
                   `Total sample size` = n())
stats

# # A tibble: 1 x 3
#   `Mean mother's weight (kg)` `Sample size` `Total sample size`
#                         <dbl>         <int>               <int>
# 1                        58.4          1167                1203

In an R markdown document we can convert the result to a markdown table using kable from the knitr package

kable(stats, align = "ccc")

43 / 66

Grouped Operations

group_by() sets grouping on a data frame. This is most useful for summarise()

infant <- infant %>% filter(!race %in% c(10, 99)) %>%
    mutate(race = recode(race, `6` = "Latino", `7` = "Black", `8` = "Asian",
                         `9` = "Mixed", .default = "White"))
res <- infant %>% group_by(`Ethnic group` = race) %>%
    summarise(`Mean weight (kg)` = mean(wt, na.rm = TRUE))
kable(res, align = "lc")

Ethnic group	Mean weight (kg)
Asian	49.84
Black	62.63
Latino	54.12
Mixed	58.46
White	57.86

44 / 66

Your Turn

cut() (in the base package) can be used to create a factor from a continuous variable:

cut(c(0.12, 1.37, 0.4, 2.3), breaks = c(0, 1, 2, 3))

# [1] (0,1] (1,2] (0,1] (2,3]
# Levels: (0,1] (1,2] (2,3]

where (0, 1] is the set of numbers > 0 and ≤ 1, etc.

The infant birth weight bwt is given in ounces. In the table-bwt chunk use mutate() to create a new factor called Birth weight (g), by converting bwt to grams through multiplication by 28.35 and then converting the result to a factor with the following categories: (1500, 2000], (2000, 2500], (2500, 3000], (3000, 3500], and (3500, 5000].

An infant is categorised as low weight if its birth weight is ≤ 2500 grams, regardless of gestation. Use group_by() to group the data by the new weight factor, then pipe to summarise() to count the number of infants in each weight category.

45 / 66

Plotting in R46 / 66

Plots

In RStudio, graphs are displayed in the Plots window. The plot is sized to fit the window and will be rescaled if the size of the window is changed.

Back and forward arrows allow you to navigate through graphs that have been plotted.

Graphs can be saved in various formats using the Export drop down menu, which also has an option to copy to the clipboard.

First we consider "no-frills" plots, for quick exploratory plots.

47 / 66

Boxplots

boxplot(infant$bwt)
with(infant, boxplot(bwt ~ race))

Left: boxplot of of birth weight. Right: boxplots of birth weight for each ethnic group.

48 / 66

Histogram/Density

hist(infant$bwt)
plot(density(infant$bwt))

Left: histogram of of birth weight. Right: density curve of birth weight.

49 / 66

Scatterplots

infant <- infant %>%
    mutate(gestation = ifelse(gestation == 999, NA, gestation))
plot(bwt ~ gestation, data = infant)

# plot(infant$gestation, infant$bwt); plot(x = infant$gestation, y = infant$bwt)

The two observations with very low gestation look suspect, so we will exclude them from the rest of the analysis

infant <- infant %>%
    filter(gestation > 200)

50 / 66

ggplot2

ggplot2 is a package that produces plots based on the grammar of graphics (Leland Wilkinson), the idea that you can build every graph from the same components

a data set
a co-ordinate system
and geoms visual marks that represent data points

To display values, you map variables in the data to visual properties of the geom (aesthetics), like size, colour, x-axis and y-axis.

It provides a unified system for graphics, simplifies tasks such as adding legends, and is fully customisable.

51 / 66

ggplot2 boxplots

library(ggplot2)
ggplot(infant, aes(y = bwt * 28.35, x = race)) +
    geom_boxplot() + 
    ylab("Birth Weight (g)")

52 / 66

ggplot2 with Colour

infant <- mutate(infant, smoke = recode_factor(smoke, 
                                               `1` = "currently", `2`= "until pregnancy", 
                                               `3` = "used to", `0` = "never"))
p <- ggplot(infant, aes(y = bwt * 28.35, x = gestation, color = smoke)) +
    geom_point() + labs(x = "Gestation", y = "Birth Weight (g)", color = "Smoking status")
p

53 / 66

ggplot2 Facetting

p + facet_wrap(~ smoke) + guides(color = FALSE)

54 / 66

Plots in R Markdown

The chunk options enable you to arrange plots (caption used as alt text in HTML slides)

```{r, fig.show = "hold", fig.align = "center", fig.width = 4, fig.height = 4,
 ¬ fig.cap = "Left: ggplot histogram. Right: ggplot density curve", out.width = "30%"}
ggplot(infant, aes(x = bwt * 28.35)) + geom_histogram(bins = 20)
ggplot(infant, aes(x = bwt * 28.35)) + geom_density()
```

Left: ggplot histogram. Right: ggplot density curve

55 / 66

Your Turn

Inspect the association between birth weight (bwt) and mother’s age (age) with a ggplot. Use geom_smooth(method = "lm") to add a linear smooth with confidence interval.

Does the association depend on smoking status? (Hint: assign one or more aesthetics to smoke).

56 / 66

Linear Modelling57 / 66

Simple Linear Modelling

The lm() function can be used to fit linear models, e.g. a simple linear model of bwt vs gestation

model1 <- lm(bwt ~ gestation, data = infant)
model1

# 
# Call:
# lm(formula = bwt ~ gestation, data = infant)
# 
# Coefficients:
# (Intercept)    gestation  
#     -22.171        0.507

Diagnostic plots can be created by plotting the model object

plot(model1)

(result on next slide).

58 / 66

Top left: residuals vs fitted scatterplot. Top right: Square root of standardized residuals vs fitted values. Bottom left: Q-Q plot. Bottom right: standardised residuals vs leverage.

Top left: residuals vs fitted scatterplot. Top right: Square root of standardized residuals vs fitted values. Bottom left: Q-Q plot. Bottom right: standardised residuals vs leverage.

59 / 66

Standard summary

summary(model1)

# 
# Call:
# lm(formula = bwt ~ gestation, data = infant)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -49.42 -10.98   0.09  10.05  53.58 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -22.1707     8.8608    -2.5    0.012 *  
# gestation     0.5074     0.0316    16.0   <2e-16 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 16.6 on 1173 degrees of freedom
# Multiple R-squared:  0.18,    Adjusted R-squared:  0.179 
# F-statistic:  257 on 1 and 1173 DF,  p-value: <2e-16

60 / 66

broom package

tidy() and glance() from broom extract coefficient and model statistics

library(broom)
tidy(model1)

# # A tibble: 2 x 5
#   term        estimate std.error statistic  p.value
#   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
# 1 (Intercept)  -22.2      8.86       -2.50 1.25e- 2
# 2 gestation      0.507    0.0316     16.0  1.80e-52

glance(model1)

# # A tibble: 1 x 11
#   r.squared adj.r.squared sigma statistic  p.value    df logLik
#       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl>
# 1     0.180         0.179  16.6      257. 1.80e-52     2 -4965.
# # … with 4 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
# #   df.residual <int>

augment() can be used to augment the original data with fitted values etc.

61 / 66

Multiple Regression

Additional terms can be added to the model via +

y ~ a + b + c

We can use update to update a fitted model, changing any aspect of the model call. We can update the formula by adding or subtracting terms.

update(fit, . ~ . + d - c)

All linear models have an intercept by default, it can be removed with -1.

anova() can applied to a single model to assess the significance of each term or nested models to assess the significance of additional terms.

62 / 66

Your Turn

Use lm() to model birth weight against both gestation and the key maternal characteristics weight (wt), height (ht), parity and race. Use anova() to inspect the results.

Now add the smoking variables: smoking status (smoke), time since quitting (time) and number of cigarettes smoked per day (number). Inspect the results - are all the smoking variables significant?

Use anova(fit1, fit2) to produce compare the two models.

63 / 66

Learning more/getting support

RStudio cheatsheets (rmarkdown, dplyr, ggplot2) https://www.rstudio.com/resources/cheatsheets/

R for Data Science (data handling, basic programming and modelling, R markdown) http://r4ds.had.co.nz

RStudio Community (friendly forum focused on "tidyverse" packages, including dplyr, ggplot2) https://community.rstudio.com/

64 / 66

Going further

Quick-R (basic "how to"s from data input to advanced statistics) http://www.statmethods.net/

Task views http://cran.r-project.org/web/views/ provide an overview of R’s support for different topics, e.g. Bayesian inference, survival analysis, ...

http://www.rseek.org/ – search the web, online manuals, task views, mailing lists and functions.

Package vignettes, many books, e.g. on https://bookdown.org/.

65 / 66

License

Package Development Workshop by Forwards is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Zero to Hero

Forwards Teaching Team

@R_Forwards https://forwards.github.io/

updated: 2020-03-10

What can R do?

The R Ecosystem

Using R

R Demo

Using RStudio

RStudio IDE

R Studio Features

RStudio Shortcuts from the R Console

R Scripts

RStudio Demo

Why R Scipts

Good Practices for R Scripts

Your Turn!

Working with Data in R

Data Structures

Vectors

Data Frames

Creating a Data Frame

Install Packages

Using Packages

Data Input via Import Dataset

Your Turn

Tibbles

Data Input via Code

Workflow

R Studio Projects

Project Structure

R Markdown Documents

Rendering

Markdown Reference

Your Turn

Data Wrangling

Data Pre-processing

dplyr

filter()

select()

mutate()

Your Turn

Chaining

summarise()

Grouped Operations

Your Turn

Plotting in R

Plots

Boxplots

Histogram/Density

Scatterplots

ggplot2

ggplot2 boxplots

ggplot2 with Colour

ggplot2 Facetting

Plots in R Markdown

Your Turn

Linear Modelling

Simple Linear Modelling

Standard summary

broom package

Multiple Regression

Your Turn

Learning more/getting support

Going further

License

What can R do?

Help

@R_Forwards
https://forwards.github.io/

`filter()`

`select()`

`mutate()`

`summarise()`