library(targets)
library(tarchetypes)
tar_script()
I was recently introduced to the R package {targets}
as part of learning how to develop software. As I continue to use it, I’ve been blown away by how much easier my workflow has become—even outside of software development—and I regret not having known of it sooner. It would have been nice to use this workflow when I was doing my Master’s thesis, which involved a large amount of complex and repeated analyses. So here’s my notes on the targets workflow so far: why use it, how to use it (especially if you’ve never used it before), and the resources I found most useful.
Why use a targets workflow?
Most of us doing statistical analysis work on complex projects with moving parts. What I mean by this is: say I have a workflow that involves loading the data, some exploratory data analysis, fitting the data to a model, generating a plot of the data and the fitted model, and presenting the model coefficients in a table. It’s nothing too complicated, but it is complex: if one part changes, the rest of the project changes too.
For example, perhaps the data is updated and we want to re-run the whole analysis from start to finish. Or perhaps we decide that the model isn’t a good fit, so we fit a few other models and see which one is better; in this instance, only the model, plot, and table would change, but the rest of the project stays the same.
All this is fine if you have well-annotated code and know which parts of the script to re-run. But what if you accidentally run the wrong parts of the script that are computationally intensive and time-consuming? Or what if it takes a while to re-render your table because you’ve included so many variables, and you really didn’t have to because you only wanted the plot and not the table? Or even if you’re just tweaking a few variables and then making a plot: if you miss re-naming a variable earlier on in your script, your plot might not be generated correctly because it depends on that variable.
It’s in situations like these where {targets}
can help you. Using a targets workflow means you’ll have it all laid out in front of you instead of needing to look for specific bits of code in your script and any dependencies. It’ll take the headache out of figuring out which parts of the analysis depends on which variables and functions. Then when you change a variable name (or any part of the project, really), you’ll know which parts of the analysis would need to be re-run, minus the headaches.
To summarise, I find a targets workflow especially useful in the following scenarios:
- if you’re doing something complex that has a lot of moving parts;
- if said moving parts will change as your project progresses; and/or
- if you’d like to re-run your analysis with different inputs (e.g. a new dataset), or to create different outputs (e.g. plots and tables)—especially if you’re re-running your analysis a lot!
How to use targets
When you’re first starting to program in R—especially if you’re mainly doing statistical analysis—a targets workflow might seem an unnecessarily complex way of going about things. When I first came across the targets workflow, I felt overwhelmed and didn’t understand how it could be useful for my workflow. So the way I’d recommend going about it is to take an existing project, and to convert the scripts in that project into a targets workflow.
The quick walkthrough in the user manual is an excellent example of how to use the targets workflow–I highly recommend reading it and going through the worked example. It does a much better job than I can in giving a reproducible example of how to use targets. Then I recommend converting your project to a targets workflow by using that walkthrough, or this quick tutorial.
I’ll walk you through a conceptual overview of how I use targets. However, I highly recommend you first work through the two tutorials above.
An example workflow
So you can follow along, I’ve created a repo that demos a simple targets workflow here.
First, I make sure that I understand my workflow from beginning to end. This overview is the main advantage of using a targets workflow: you’ll have a bird’s eye view of your analysis and figure out where changes are necessary. In this example, I’ve broken down my analysis to the following steps:
- read in csv file of data;
- summarise data and generate a table;
- fit models; and finally,
- generate a plot of the data and fitted model.
Once I’ve done that and would like to start creating a targets workflow, I’ll run the following in my console:
This creates a file in your project directory called _targets.R
. This is the high-level overview of the project, so I open that and write out something like the following:
# _targets.R file
# Set-up ------------------------------
library(targets)
library(tarchetypes)
# Load packages needed for this project
tar_option_set(
packages = c("tidyverse",
"purrr")
)
# Set up a workspace when our code errors
tar_option_set(workspace_on_error = TRUE)
# Load functions to be used in our project
source("R/functions.R")
# Target objects ------------------------------
tar_plan(
# Load data
tar_file(
path_to_data,"./raw/data.csv"
),
tar_target(
data,read_csv(path_to_data)
),
# Create a summary table
tar_target(
data_summary_table,summarise_data(data)
),
# Fit a model to the data
tar_target(
model,model_data(data)
),
# Create plot of the data and fitted model
tar_target(
plot,plot_data_with_model(data)
),
# Create table of model coefficients
tar_target(
table_of_model,tabulate_model(model)
)
)
Where summary_table
, model_data
, plot_data
, and table_of_model
are functions I’ve written up myself and saved in a separate script called functions.R
.
The advantage of this workflow is you can change the one part that need changing in your workflow, and the {targets}
package will change the rest for you. I can’t emphasise how life-changing this was for me.
Here’s an example. If your data changes, you just need to change the first target object path_to_data
to point to the new data file. You can then re-run the whole analysis using tar_make()
and your resulting summary table, fitted model, plot, and table of model coefficients will be re-created with the new data you’ve pointed to. This is so much easier than having a script and needing to change data
to new_data
for every instance in which it’s referred to, or by re-running the whole script yourself. How convenient is that?
You can also have an overview of your workflow and what needs to be updated by running tar_visnetwork()
.
Final thoughts
Using a targets workflow requires you to change the way you approach programming. If you’re trained as a statistical programmer, I assume you’re used to writing your code in scripts. A targets workflow requires you to write your code as functions instead (otherwise known as “functional programming”).
It is a steep learning curve and it did take me a while to get my head around why and how I should use a targets workflow. It’s a different way of thinking about how to program.
Despite this, I do highly recommend using a targets workflow instead of having multiple R scripts, for all the reasons I’ve stated above. Utilising this workflow is a step towards reproducible research, which is always great.
Resources for learning targets
I think the best way to understand why a targets workflow is useful is to actually do a project within a targets workflow. So here are some resources I found helpful:
- The targets R package user manual with a quick walkthrough to get you started. I highly recommend this one if you’re trying to set up a targets workflow by yourself.
- For a more hands-on tutorial, I recommend this Carpentries workshop by Joel Nitta. I really liked that this gave you a bare-bones overview of what a targets workflow looks like. I used this the first time I tried to set up a targets workflow, but I also had the benefit of having someone walk me through using targets.
More readings on targets: