Container-driven Reproducible Research Computing Workshop

Welcome to the Container-Driven Reproducible Research Computing Workshop!

In this workshop, we aim to solve common issues in data science like software installation, dependency management, and performance limitations of local machines. We will explore how to create reproducible and user-friendly research environments using development containers from inside a remote computing environment powered by Indiana University's Jetstream2.

Workshop Overview

I. Downloading Visual Studio Code

Install and set up Visual Studio Code (VS Code) as our main interface.

II. Accessing a Remote Computing Instance

Set up SSH keys and connect to a powerful Jetstream2 compute instance.

III. Creating and Managing Projects

Use VS Code to create and customize containerized environments.
Deploy these environments with tools like JupyterLab and RStudio.
Develop and package a small project using R, demonstrating the power of containerization for reproducibility.

IV. Remote Computing and Resource Management

Understand and utilize resources provided by NSF ACCESS and Jetstream2 for your research.

V. Distributing Research

Learn how to share your reproducible research environments through platforms like GitHub and Zenodo.

Before we begin, we will need to install Visual Studio Code (VS Code for short!) which is a powerful text editor that can be extended to be used as an integrated development environment for many languages.

Go to the Visual Studio Code website.
Download the executable file for your operating system.
Open it and run through the installation.
Once installed, open up VS Code.
Click on the 4 square button to open up the "Extensions" explorer as seen here (alternatively, press ctrl/⌘ + shift + x):

In the search bar search for "Dev Containers". Click on the extension authored by Microsoft and install:

We have set up some computing instances for you to use on the Jetstream2 cluster that is run by Indiana University. In order to connect to your Jetstream2 compute instance we will need to first make sure that you can access your Secure Shell (SSH), set up your SSH keys, and then connect to the remote compute instance.

Setting up SSH agent

In order to even get started with connecting to a remote server, we first need to make sure that the tools necessary to do so are up and running. Namely, we need to enable the SSH agent which handles authentication to remote connections. Once we do this, you'll be able to add the SSH key you've been given in order to connect to your computing instance!

There are 2 different sets of instructions to follow depending on your operating system but the end result will be the same!

Windows

Press Start and search for "Windows Powershell". Click on it to open a new shell.
Next, run the following code to create a directory called .ssh which will exist at the location found using the command echo $HOME in Powershell:

new-item $HOME\.ssh -ItemType Directory

Verify that ssh-agent is running by searching for "Services" in the Start Menu:

Search for "OpenSSH Agent" and make sure that the Status is "Running" and Startup Type is "Automatic".

If this is not the case, right-click on the "OpenSSH Authentication Agent" entry -> select "Properties" -> Under "Service Status" select "Start" -> From the "Startup Type" drop down menu, select "Automatic".

In the penultimate step, you will need to download the container_workshop.pub public key and the container_workshop private key that were sent to you before the workshop. Save them in your "Downloads" folder. Once you've downloaded both of the keys, we will move them to the .ssh folder using the following command in powershell:

Move-Item -Path $HOME\Downloads\container_workshop.pub -Destination $HOME\.ssh\
Move-Item -Path $HOME\Downloads\container_workshop -Destination $HOME\.ssh\

Lastly, verify that your private key is added to your ssh-agent keyring by typing the following command in powershell:

ssh-add $HOME\.ssh\container_workshop

macOS/Linux

Find your terminal application and open up a new shell:

macOS: ⌘ + space, then search "terminal"
Ubuntu: ctrl + alt + t

Verify that your ssh-agent is running by using the following command:

eval "$(ssh-agent -s)"

In the penultimate step, you will need to download the container_workshop.pub public key and the container_workshop private key that were sent to you before the workshop. Save them in your "Downloads" folder. Once you've downloaded both of the keys, we will move them to the .ssh folder using the following command in terminal:

mv ~/Download/container_workshop.pub ~/.ssh/
mv ~/Download/container_workshop ~/.ssh/

Lastly, verify that your private key is added to your ssh-agent keyring by typing the following command in terminal:

ssh-add ~/.ssh/container_workshop

Now that we have set up your SSH key and agent, it's time to connect to your Jetstream2 instance!

Before we begin, head over to this spreadsheet and claim a Jetstream2 instance by writing your name next to it... Keep the spreadsheet open as you will need to copy the instance information later!: Jetstream2 Instances
Your window should look like this to start. In the bottom left hand corner you should see a button. This is your remote connections manager. Click on it.

The command palette will open. Go down the commands and select "Connect Current Window to Host..."

Select the 2nd to last option "Add New SSH Host..."

In the dialogue box, type in "ssh exouser@" and paste the instance name that you sign-up for in the spreadsheet:

Select your default configuration file to update (under your computer's username):

Click on "Config" to modify the name of the "Host" to be more user friendly and save it. Name it whatever you like e.g. my-awesome-jetstream2-instance. In addition specify an IdentityFile which will be the name of the private key we saved (container_workshop). Your final configuration should look similar to this:

Host my-awesome-jetstream2-instance
    HostName container-workshop.mth230010.projects.jetstream-cloud.org
    User exouser
    IdentityFile container_workshop

Click on the button again for Remote Connections. Select "Connect Current Window to Host...". Your newly configured host will pop up. Select it and let your instance load up.

Creating our starter files

Now that you are connected to a server, we can set up the development container!

In your terminal, run the following code where you should replace <project-name> with the name of your project/directory you wish to create:

copier copy gh:UCSB-PSTAT/devcontainer-template <project-name>

Answer the questions by selecting from the options provided. We will be using R in this workshop.

🎤 What is the name of your project? (Must be unique and use lowercase, dashes -, underscores _ ONLY)
   my-awesome-project
🎤 What language(s) will you use in this project?
   R
🎤 Do you want to install Visual Studio Code extensions for Jupyter notebooks using R?
   Yes
🎤 Install RStudio Server? This is optional if using VS Code and R extensions for development.
   Yes
🎤 Install Quarto? Quarto is optional publishing system compatible with R.
   Yes
🎤 Do you want to include example files?
   Yes

Copying from template version 1.4.1
    create  .
    create  .devcontainer
    create  .devcontainer/Dockerfile
    create  .devcontainer/devcontainer.json
    create  README.md
    create  example.Rmd
    create  .copier-answers.yml

Run the command below to view your new project folder (again, replace <project-name> with the name of your project/directory you created):

tree -a <project-name>

<project-name>
├── .copier-answers.yml
├── .devcontainer
│   ├── devcontainer.json
│   └── Dockerfile
├── example.Rmd
├── example-R.qmd
└── README.md

You now have a new directory in which you can open up a development container! To do so, click on "Open Folder" in the left menu and navigate to your project folder's name:

Click OK. This will open up your project's folder (in the example, it will open "my-awesome-project".

From here, you can click to "Reopen in Container" or click on the bottom left green button and select "Reopen in Container":

Your container will be built. You can click on the bottom right hand dialogue to view the build process log file.

So while the container builds, let's take a step back and break this down...

Overview of what we have...

In essence, we use the VS Code text editor for the following 3 things:

Text/Code editing
Connecting to remote servers in the cloud
Connecting to containers whether on a remote server or on your local machine

Today, we are working inside a remote Jetstream2 instance and you can currently watch your container being built on there. Containers help isolate any required system packages, programming language packages, and tools that your project requires from the rest of your system. This means that if your project has specific versioning requirements, these can be baked into your container via the container configuration files. But what are those files? They are...

Dockerfile - the main configuration file that contains all your system, programming language, and tool setup written as code.
devcontainer.json - a file that integrates the container building and running process with VS Code and is part of the "Development Container" standard.

On their own, containers are usually managed via terminal using terminal commands of the containerization software in question. However, with the Dev Containers standard, we can easily delegate running and connecting to containers to the VS Code UI. We will discuss how to make adjustments to these files in a little bit.

We can take a look at the 2 files and note that they can be non-trivial to put together. This is why we created the devcontainer template that we used today to generate the project files. This creates an easily extendable configuration with well documented container files that can be modified to your liking as the complexity of your project develops which we will talk about next.

By now, your VS Code instances should look a little something like this:

Click on the "+" icon next to the "Dev Containers" dialogue to open up a new terminal instance which will have a "Jupyter Token" pop-up once launched:

We can do most of our editing in VS Code with extensions for Python, R, and other languages; however, our container comes with additional tool options that can be more conducive for data analysis such as JupyterLab and RStudio. Here, we will show how to access these container tools.

Open up a new VS Code Terminal by pressing the + button as seen below:

You will see the following pop up showing a Jupyter server token. Double click it and right-click to "Copy" the token:

Head into "Ports". Look for port 8888 (which will be labeled as "Jupyterlab"). Mouse over the "Forwarded Address" box and click on "Open in Browser":

You will be taken to the following login page. Here you can input your copied Jupyter token. You will only need to do this once, next time it will not ask you for the token!

From here, you will be taken to the JupyterLab landing page. You will see a number of options for coding but what we will be using is RStudio Server. You can click on the button to start up the server in a separate window which will have the familiar RStudio interface but inside your browser being run on your Jetstream2 instance!

To show off the usage of development containers for reproducibility, we will do a small sentiment analysis on the famous "To be or not to be" speech from William Shakespeare's Hamlet. Containers give us a lot of flexibility on the type of packages and tools we can install using commands we are familiar with (pip install ..., mamba install ..., install.packages(...)). However, to create a reproducible project that can be shared and easily setup and built we need to make changes to our actual Dockerfile, the file that defines the entire computational environment.

Let's open up the example.Rmd file which has some starter code and functions for us to use. We can do this by selecting "Files" in the bottom right pane and selecting example.Rmd.

Inside this file you will notice a cell called {r setup ...}. Let's add a few packages underneath the knitr options:

```{r setup, include=FALSE}

library(ggplot2)
library(syuzhet)

knitr::opts_chunks$set(echo = TRUE)
```

Once we run this chunk, you'll notice that we have ggplot2 but not syuzhet. We will need to add this package to our container files so that in the future, when this project is shared with others, it can be built and run without additional tweaking. More than that, we will do so in a manner that specifies the exact package version we wish to install. This is because by default, R will install the latest version of packages. There are times when doing so can break an installation due to either specific version requirements, dependency issues, or upstream changes in other packages.

First, let's track down the specific package in CRAN: https://cran.r-project.org/
Under "Software" click on "Packages":

Click on "Table of available packages, sorted by name". In the following webpage, scroll or ctrl + f to find syuzhet. Click on the package name.

In the following webpage, take note of the latest available version. Depending on your project requirements, you may potentially need to use an older version than that, but for our project we just need to specify the latest stable version that is currently available.

Next, we will include this package version as part of the installation process of the container inside our Dockerfile. The Dockerfile created from our template has annotations for where we can place additional packages to install. Scroll down your Dockerfile and insert an additional package underneath the comment with package instructions:

R -q -e 'remotes::install_version("syuzhet", version="1.0.7", repos="cloud.r-project.org")' && \

Save your Dockerfile (ctrl + s). You may have a pop-up saying that your configuration files have changed and that you need to rebuild your container. Either click on "Rebuild" in the dialog OR click the bottom left green remotes button and select "Rebuild Container":

Now that we have included the syuzhet package, we can finish up this little project! Below is the famous "To be or not to be" speech made by Hamlet. We will analyze and plot the sentiment of each line of this speech.

To be, or not to be, that is the question: 
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them: to die, to sleep
No more; and by a sleep, to say we end
The heart-ache, and the thousand natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; aye, there's the rub,
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause. There's the respect
That makes Calamity of so long life:
For who would bear the Whips and Scorns of time,
The Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law's delay,
The insolence of Office, and the spurns
That patient merit of th'unworthy takes,
When he himself might his Quietus make
With a bare Bodkin? Who would Fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovered country, from whose bourn
No traveller returns, puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of?
Thus conscience does make cowards of us all,
And thus the native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And enterprises of great pitch and moment,
With this regard their Currents turn awry,
And lose the name of Action. Soft you now,
The fair Ophelia? Nymph, in thy Orisons
Be all my sins remember'd.

First, let's make a new cell in R

```{r text_processing}

```

Inside the cell, copy/paste Hamlet's speech as a string:

hamlet <- ("To be, or not to be, ...
...
Be all my sins remember'd.")

Now let's perform a string split along each newline character (\n). We also need to "unlist" the output since it gets processed as a list of lists:

hamlet_processed <- strsplit(hamlet, "\n", perl=TRUE)
hamlet_processed <- unlist(hamlet_processed)
hamlet_processed

We can now calculate the sentiment values on each sentence in the character vector using the get_sentiment function from syuzhet:

sentiment <- get_sentiment(hamlet_processed)

We will now convert the calculated sentiment values and create a dataframe out of them for plotting:

df <- data.frame(lineno=1:length(sentiment), sentiment=sentiment)

Finally, we create a nicely formatted plot using ggplot which will produce a plot!:

ggplot(df) +
  geom_line(aes(x=lineno, y=sentiment)) +
  labs(x="Line Number", y="Syuzhet Sentiment")

Jetstream2

Let's talk a bit about the remote computing we were using today and how you could get access to it. Compute time on these instances is made available through the National Science Foundation's Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (NSF ACCESS) program which exists "...to help researchers and educators, with or without supporting grants, to utilize the nation's advanced computing systems and services – at no cost."

While NSF ACCESS provides time in the form of credits, the actual compute instances we are using are through Indiana University's Jetstream2 supercomputing system. Jetstream2 aims to make research computing easy by providing access to instances, remote desktop, and resource management all through the browser. NSF ACCESS is not limited to Jetstream2 as there is a variety of resource providers available to choose from. That said, if you want direct support, UCSB PSTAT provides support for Jetstream2 development container images that we used today!

To get started, visit the ACCESS website and then:

Sign-up for an ACCESS account.
Complete required submission paperwork (more on that below).
Once approved, head over to the Jetstream2 website and submit your approved allocation there.
From there, you will be able to login and access your Jetstream2 allocation as well as add additional people under your allocation for usage (especially useful for labs/groups with larger units)

The number of units that you can apply to can be summarized as follows:

For limited scale projects (dissertations, papers, general grad student work)

Submission of abstract + sign-off from advisor required
EXPLORE (400,000 credits)

For larger scale projects (research labs, classroom work, heavy compute)

Submission of 1-3 page project proposal
DISCOVER (1.5 million credits), ACCELERATE (3 million credits)
MAXIMIZE (unlimited, 10 page proposal, application open twice a year)

Regardless of initial application, you can always apply for higher tier later!

Below is a table of various Jetstream2 instance sizes and how long they can be run continuously, without shutting down with 400K credits that graduate students can apply for. Today we were using the Large CPU system:

System Type	Resources	Days of continuous compute (@ 400K credits)
L CPU	16 CPUs, 60 GB RAM	1040 days (16 credits/hour)
XL CPU	32 CPUs, 125 GB RAM	520 days (32 credits/hour)
XL GPU	32 CPUs, 125 GB RAM, 40 GB GPU	130 days (128 credits/hour)
XL RAM	128 CPUs, 1000 GB RAM	65 days (256 credits/hour)

For a more thorough breakdown of the available instances and information on credits, check out the Jetstream2 documentation.

As the final part of the workshop, we want to draw your attention to some helpful resources for maintaining and publishing your research code. In the digital age, distributing research effectively and efficiently is paramount for ensuring reproducibility, collaboration, and accessibility. This section will discuss how you can leverage GitHub and GitHub Codespaces for code management and execution, along with Zenodo for comprehensive research archiving.

Using GitHub for Code Management

GitHub is a powerful platform for version control and collaboration, essential for managing research code. By storing your research code in a GitHub repository, you benefit from features such as issue tracking, pull requests, and continuous integration. These tools enable you to manage contributions from multiple collaborators seamlessly and ensure that changes are tracked meticulously.

GitHub Codespaces

GitHub Codespaces takes collaboration a step further by providing a full development environment in the cloud. This allows researchers to work on their projects from anywhere, without the need to set up local development environments. The key to making this work efficiently is the use of .devcontainer configuration files. This eliminates the "it works on my machine" problem, significantly enhancing reproducibility.

Here is a demo repository with which we can launch a GitHub Codespaces instance if you have a GitHub account. Simply click on "Code" then "Codespaces" and lastly "Create codespace on main":

It should be noted that there are 2 caveats to this:

To use GitHub Codespaces, you need a GitHub Pro account. Fortunately, GitHub offers GitHub Pro for free to educators, which makes this a cost-effective solution for academic research.
devcontainer.json has to modified slightly to function correctly with Codespaces, namely we need to remove the following 3 bits of information:

"build": {
  "dockerfile": "Dockerfile",
  "options": ["--format=docker"] // remove for Codespaces (or Docker)
}
...
// change `type=bind,z` to `type=bind` for Codespaces (or Docker)
"workspaceMount": "source=${localWorkspaceFolder},target=/home/jovyan/work,type=bind,z",
...
"runArgs": [
  ...
  "--userns=keep-id:uid=1000,gid=100", // remove for Codespaces (or Docker)
  ...
]

Archiving with Zenodo

While GitHub is excellent for code management, it is equally important to have a robust system for archiving the entirety of your research output. This is where Zenodo comes into play. Zenodo is a research repository managed by CERN that provides a secure and reliable platform for storing a variety of research outputs.

Research Papers: You can upload preprints or published versions of your research papers, ensuring they are freely accessible and citable.
Research Code: Zenodo integrates seamlessly with GitHub allowing you to archive your code alongside your paper.
Research Data: Zenodo supports the storage of research data, including confidential data that has been anonymized. This ensures that your datasets are preserved and accessible for future studies.

Lastly, by using Zenodo, you can generate DOI links for your research outputs, which enhances their visibility and citability. This is particularly important for ensuring that your work is easily discoverable and can be referenced by other researchers in the field.

Combining GitHub and Zenodo

The combination of GitHub and Zenodo provides a powerful ecosystem for distributing research:

Version Control and Collaboration: Use GitHub to manage and collaborate on your research code.
Reproducible Development Environments: Utilize GitHub Codespaces with .devcontainers to ensure consistent development environments across all contributors.
Archiving and Citability: Link your GitHub repositories to Zenodo to archive your code and generate DOI links for all your research outputs, ensuring they are preserved and citable.

This integrated approach not only enhances the reproducibility of your research but also ensures that your work is accessible and can be built upon by the wider research community. By leveraging these tools, you contribute to a more open and collaborative research environment, ultimately advancing scientific discovery.