Advanced Rmarkdown YAML headers

Rnotebooks are scientific notebooks for R, somewhat like jupyter for anyone coming from python
R
Author

Richard J. Acton

Published

April 24, 2019

Rnotebooks - What and Why?

For anyone unfamiliar with Rnotebooks here is a quick overview of why you might want to use them more experienced users can skip ahead. Rnotebooks are scientific notebooks for R, somewhat like jupyter for anyone coming from python but baked right into the Rstudio IDE which offers some benefits over the browser based interface of jupyter. It permits you to organise your code, notes, reasoning and references in one place. Combining Rnotebooks with a version management system such as git gives a robustness similar paper lab book records when it comes to seeing what you did and when coupled with dynamism, portability, share-ability and ease of backup of electronic working. Rnotebooks use a simple flavour of markdown with options to render output to HTML and PDF (via LaTeX) formats. Rnotebooks also have big pluses for reproducibility, creating an Rnotebook that does, explains and references your analysis makes it very easy to give to another at least somewhat competent R user and have them re-run your analysis - potentially with their own variants. Reproducibility and verifiability are substantial issues in scientific computing, including my own field of biology. A recent article in PeerJ provides a nice discussion of these issues and a look at what the future of scientific computing notebooks might resemble.

Basic Structure

Raw Rmarkdown looks like this:

---
title: "Example Rnotebook" # a yaml header with document properties and options
---

# Introduction

Mardown formatted text, with __Bold__ and *exciting!* Scientific claims
about $in-line math^2$ at least according to @Smith2007.


::: {.cell}

```{.r .cell-code}
print("some actual R code in chunks")
```
:::


saved with a .Rmd file extension

Rstudio of course adds nice syntax highlighting, and various bells and whistles.

The acctual YAML header stuff

Inline R

You can use inline R in the YAML header of an Rnotebook to produce dynamic content. This takes the general form:

option: "<some R>"

For example you can include the current date with:

date: "2024-03-16"

I’m partial to the YYYY-MM-DD format due to it’s unambiguousness and nice sorting behaviour but you can of course employ format() to render the date in other ways.

Params

The params option allows you to add arguments to your Rnotebook. The params you add to your header are accessible from within the notebook from the immutable params list. Rstudio makes the contents of this list available in interactive sessions so you can use them whilst working on your code not just when you build the notebook. Note that you can reference params in other options (see).

---
params:
  includeThing: TRUE # set the default to TRUE
---
::: {.cell}

```{.r .cell-code}
print(
    paste0(
        "Some R that is only evaluated if and included in the notebook if",
        " params$includeThing is true"
    )
)
```

::: {.cell-output .cell-output-stdout}

```
[1] "Some R that is only evaluated if and included in the notebook if params$includeThing is true"
```


:::
:::

Document Format Options

For an HTML output these are a few of my favourite options. There are numerous additional options described in the outputs section of the manual, setting the depth of the table of contents for example.

---
output:
  html_document:
    df_print: paged       # print paged tables - like the default 'html_notebook' format
    fig_caption: yes
    number_sections: yes  # prepend x.y style numbering to you sections
    toc: yes              # Add a table of contents
    toc_float: yes        # have to TOC float at the side of your HTML page so you do have to keep scrolling to the top
---

For a PDF output pdf_document can be used instead of html_document though my preferred table format for PDF is df_print: kable. More advanced LaTeX customisations can also be used in conjunction with PDF outputs.

Bibliograghy and Citation YAML options

Placing a bibliography option in your Rnotebook’s header and pointing it to a bibtex file containing your citation information permits you to create citations in Rnotebooks using the following syntax: @Smith2016 for an in-line citation e.g. ‘work by Smith et al. 2016 showed that cheese…’ or [@Smith2016] for a reference like this: ‘assertion (Smith et al. 2016)’, or even lists of citations to be contracted where possible given the citation style e.g. [@Smith2016; @Jones2018], (note the semi-colon list separator) yielding something like this: ‘assertion [1-2]’

I frequently use a header that contains code like this:

---
bibliography: "bib.bib"
params:
  bib: "~/Documents/bibtex/library.bib"
---

The reason I do this is my bibliography has the same path relative to my home directory on my laptop, desktop and computing clusters but the absolute paths differ and these headers seem to prefer absolute paths. Thus, if I compose a notebook on one system it won’t execute on another unless I change the path or use a set-up like this to do so dynamically when building the notebook.

I also frequently set the path to my working directory as a parameter to my Rnotebooks and use relative paths to any files I want to load/write in the body of the Rnotebook so as to achieve similar portability between the different system’s I work on as I get with my bibliography files.

The following chunk sets the working directory for when you ‘knit’ your Rnotebook into the desired format in the first line and for interactive sessions in the second.

::: {.cell}

```{.r .cell-code}
knitr::opts_knit$set(root.dir = normalizePath(params$pwd))
setwd(params$pwd)
```
:::

A note on generating your bibtex file(s). I currently use Mendeley as my refernce manager and it has a nice bibtext output option which is automatically updated whenever you sync (On balance I would probably recomend Zotero to someone starting out afresh with reference management but its bibtex output is not quite as convenient as Mendeley’s)

If you have multiple bibliography files this can be done:

bibliography: [multiple.bib, dotbib.bib, files.bib]

Including a csl option allows you to specify a citation style using the .csl format. The specific citations styles of numerous journals in .csl format can be found here. Including the link-citations: yes option will create hyperlinks from the in-text references to the full citations at the end of the document.

By default the bibliography is placed at the very end of your document, so simply placing a # References header at the end of your document helps to separate your bibliography from the body of your text and puts an entry for it in the table of contents. If however you have some appendices to add after your references placing this HTML snippet in your Rnotebook should set the position at which the references will be rendered: <div id="refs"></div>. Helpfully this will set the postion in both HTML and PDF outputs. (This may not work with older versions of pandoc).

Executing an Rnotebook with params

Whilst you can render your Rnotebook with a one line R command from your terminal if you have a lot of params it can get unwieldy, you may also want to be able to reproduce your render at a later time or even submit it as a job to a batch computing manager. To do this you can create simple bash scripts like the one below to render your Rnotebook.

#!/bin/bash
R --no-save --no-restore <<EOF
rmarkdown::render(
  'notebook.Rmd',
  output_file = 'notebook.html',
  params = list(
      bib = "path/to/some/bib.bib"
  )
)
EOF

The --no-save option prevents R from saving your notebook’s R session, and the --no-restore option prevents your Rnotebook from loading whatever random previous R session files you have lying around in your working directory into it’s session.

Full Length Example YAML header

---
title: "A thing I'm Working on - Ideally with a more descriptive title"
author: "Richard J. Acton"
date: "2024-03-16"
output: # Specifying multiple outputs appears to favour the first
  pdf_document:
    toc: yes
    fig_caption: yes
    df_print: kable
  html_document:
    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_float: yes
    df_print: paged
  html_notebook: # This determines the RStuido preview format
    fig_caption: yes
    number_sections: yes
    toc: yes
    toc_float: yes
bibliography: "/root/Documents/bibtex/library.bib"
csl: "/root/Documents/bibtex/genomebiology.csl"
link-citations: yes # make citations hyperlinks
linkcolor: blue
---

Resources

Feedback is always welcome, especially if you spot any mistakes.

:::