In this workshop, we aim to solve common issues in data science like software installation, dependency management, and performance limitations of local machines. We will explore how to create reproducible and user-friendly research environments using development containers from inside a remote computing environment powered by Indiana University's Jetstream2.
I. Downloading Visual Studio Code
II. Accessing a Remote Computing Instance
III. Creating and Managing Projects
IV. Remote Computing and Resource Management
V. Distributing Research
Before we begin, we will need to install Visual Studio Code (VS Code for short!) which is a powerful text editor that can be extended to be used as an integrated development environment for many languages.
ctrl
/⌘
+ shift
+ x
):We have set up some computing instances for you to use on the Jetstream2 cluster that is run by Indiana University. In order to connect to your Jetstream2 compute instance we will need to first make sure that you can access your Secure Shell (SSH), set up your SSH keys, and then connect to the remote compute instance.
In order to even get started with connecting to a remote server, we first need to make sure that the tools necessary to do so are up and running. Namely, we need to enable the SSH agent which handles authentication to remote connections. Once we do this, you'll be able to add the SSH key you've been given in order to connect to your computing instance!
There are 2 different sets of instructions to follow depending on your operating system but the end result will be the same!
.ssh
which will exist at the location found using the command echo $HOME
in Powershell:new-item $HOME\.ssh -ItemType Directory
container_workshop.pub
public key and the container_workshop
private key that were sent to you before the workshop. Save them in your "Downloads" folder. Once you've downloaded both of the keys, we will move them to the .ssh
folder using the following command in powershell:Move-Item -Path $HOME\Downloads\container_workshop.pub -Destination $HOME\.ssh\ Move-Item -Path $HOME\Downloads\container_workshop -Destination $HOME\.ssh\
ssh-add $HOME\.ssh\container_workshop
⌘
+ space
, then search "terminal"ctrl
+ alt
+ t
eval "$(ssh-agent -s)"
container_workshop.pub
public key and the container_workshop
private key that were sent to you before the workshop. Save them in your "Downloads" folder. Once you've downloaded both of the keys, we will move them to the .ssh
folder using the following command in terminal:mv ~/Download/container_workshop.pub ~/.ssh/ mv ~/Download/container_workshop ~/.ssh/
ssh-add ~/.ssh/container_workshop
Now that we have set up your SSH key and agent, it's time to connect to your Jetstream2 instance!
IdentityFile
which will be the name of the private key we saved (container_workshop
). Your final configuration should look similar to this:Host my-awesome-jetstream2-instance HostName container-workshop.mth230010.projects.jetstream-cloud.org User exouser IdentityFile container_workshop
Now that you are connected to a server, we can set up the development container!
<project-name>
with the name of your project/directory you wish to create:copier copy gh:UCSB-PSTAT/devcontainer-template <project-name>
🎤 What is the name of your project? (Must be unique and use lowercase, dashes -, underscores _ ONLY) my-awesome-project 🎤 What language(s) will you use in this project? R 🎤 Do you want to install Visual Studio Code extensions for Jupyter notebooks using R? Yes 🎤 Install RStudio Server? This is optional if using VS Code and R extensions for development. Yes 🎤 Install Quarto? Quarto is optional publishing system compatible with R. Yes 🎤 Do you want to include example files? Yes Copying from template version 1.4.1 create . create .devcontainer create .devcontainer/Dockerfile create .devcontainer/devcontainer.json create README.md create example.Rmd create .copier-answers.yml
<project-name>
with the name of your project/directory you created):tree -a <project-name>
<project-name> ├── .copier-answers.yml ├── .devcontainer │ ├── devcontainer.json │ └── Dockerfile ├── example.Rmd ├── example-R.qmd └── README.md
So while the container builds, let's take a step back and break this down...
In essence, we use the VS Code text editor for the following 3 things:
Today, we are working inside a remote Jetstream2 instance and you can currently watch your container being built on there. Containers help isolate any required system packages, programming language packages, and tools that your project requires from the rest of your system. This means that if your project has specific versioning requirements, these can be baked into your container via the container configuration files. But what are those files? They are...
On their own, containers are usually managed via terminal using terminal commands of the containerization software in question. However, with the Dev Containers standard, we can easily delegate running and connecting to containers to the VS Code UI. We will discuss how to make adjustments to these files in a little bit.
We can take a look at the 2 files and note that they can be non-trivial to put together. This is why we created the devcontainer template that we used today to generate the project files. This creates an easily extendable configuration with well documented container files that can be modified to your liking as the complexity of your project develops which we will talk about next.
By now, your VS Code instances should look a little something like this:
Click on the "+" icon next to the "Dev Containers" dialogue to open up a new terminal instance which will have a "Jupyter Token" pop-up once launched:
We can do most of our editing in VS Code with extensions for Python, R, and other languages; however, our container comes with additional tool options that can be more conducive for data analysis such as JupyterLab and RStudio. Here, we will show how to access these container tools.
To show off the usage of development containers for reproducibility, we will do a small sentiment analysis on the famous "To be or not to be" speech from William Shakespeare's Hamlet. Containers give us a lot of flexibility on the type of packages and tools we can install using commands we are familiar with (pip install ..., mamba install ..., install.packages(...)
). However, to create a reproducible project that can be shared and easily setup and built we need to make changes to our actual Dockerfile, the file that defines the entire computational environment.
example.Rmd
file which has some starter code and functions for us to use. We can do this by selecting "Files" in the bottom right pane and selecting example.Rmd
.{r setup ...}
. Let's add a few packages underneath the knitr options:```{r setup, include=FALSE} library(ggplot2) library(syuzhet) knitr::opts_chunks$set(echo = TRUE) ```
ggplot2
but not syuzhet
. We will need to add this package to our container files so that in the future, when this project is shared with others, it can be built and run without additional tweaking. More than that, we will do so in a manner that specifies the exact package version we wish to install. This is because by default, R will install the latest version of packages. There are times when doing so can break an installation due to either specific version requirements, dependency issues, or upstream changes in other packages. ctrl
+ f
to find syuzhet. Click on the package name.R -q -e 'remotes::install_version("syuzhet", version="1.0.7", repos="cloud.r-project.org")' && \
ctrl
+ s
). You may have a pop-up saying that your configuration files have changed and that you need to rebuild your container. Either click on "Rebuild" in the dialog OR click the bottom left green remotes button and select "Rebuild Container":Now that we have included the syuzhet package, we can finish up this little project! Below is the famous "To be or not to be" speech made by Hamlet. We will analyze and plot the sentiment of each line of this speech.
To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take Arms against a Sea of troubles, And by opposing end them: to die, to sleep No more; and by a sleep, to say we end The heart-ache, and the thousand natural shocks That Flesh is heir to? 'Tis a consummation Devoutly to be wished. To die, to sleep, To sleep, perchance to Dream; aye, there's the rub, For in that sleep of death, what dreams may come, When we have shuffled off this mortal coil, Must give us pause. There's the respect That makes Calamity of so long life: For who would bear the Whips and Scorns of time, The Oppressor's wrong, the proud man's Contumely, The pangs of despised Love, the Law's delay, The insolence of Office, and the spurns That patient merit of th'unworthy takes, When he himself might his Quietus make With a bare Bodkin? Who would Fardels bear, To grunt and sweat under a weary life, But that the dread of something after death, The undiscovered country, from whose bourn No traveller returns, puzzles the will, And makes us rather bear those ills we have, Than fly to others that we know not of? Thus conscience does make cowards of us all, And thus the native hue of Resolution Is sicklied o'er, with the pale cast of Thought, And enterprises of great pitch and moment, With this regard their Currents turn awry, And lose the name of Action. Soft you now, The fair Ophelia? Nymph, in thy Orisons Be all my sins remember'd.
```{r text_processing}
```
hamlet <- ("To be, or not to be, ...
...
Be all my sins remember'd.")
\n
). We also need to "unlist" the output since it gets processed as a list of lists:hamlet_processed <- strsplit(hamlet, "\n", perl=TRUE)
hamlet_processed <- unlist(hamlet_processed)
hamlet_processed
get_sentiment
function from syuzhet:sentiment <- get_sentiment(hamlet_processed)
df <- data.frame(lineno=1:length(sentiment), sentiment=sentiment)
ggplot(df) +
geom_line(aes(x=lineno, y=sentiment)) +
labs(x="Line Number", y="Syuzhet Sentiment")
Let's talk a bit about the remote computing we were using today and how you could get access to it. Compute time on these instances is made available through the National Science Foundation's Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (NSF ACCESS) program which exists "...to help researchers and educators, with or without supporting grants, to utilize the nation's advanced computing systems and services – at no cost."
While NSF ACCESS provides time in the form of credits, the actual compute instances we are using are through Indiana University's Jetstream2 supercomputing system. Jetstream2 aims to make research computing easy by providing access to instances, remote desktop, and resource management all through the browser. NSF ACCESS is not limited to Jetstream2 as there is a variety of resource providers available to choose from. That said, if you want direct support, UCSB PSTAT provides support for Jetstream2 development container images that we used today!
To get started, visit the ACCESS website and then:
The number of units that you can apply to can be summarized as follows:
For limited scale projects (dissertations, papers, general grad student work)
For larger scale projects (research labs, classroom work, heavy compute)
Regardless of initial application, you can always apply for higher tier later!
Below is a table of various Jetstream2 instance sizes and how long they can be run continuously, without shutting down with 400K credits that graduate students can apply for. Today we were using the Large CPU system:
System Type | Resources | Days of continuous compute (@ 400K credits) |
L CPU | 16 CPUs, 60 GB RAM | 1040 days (16 credits/hour) |
XL CPU | 32 CPUs, 125 GB RAM | 520 days (32 credits/hour) |
XL GPU | 32 CPUs, 125 GB RAM, 40 GB GPU | 130 days (128 credits/hour) |
XL RAM | 128 CPUs, 1000 GB RAM | 65 days (256 credits/hour) |
For a more thorough breakdown of the available instances and information on credits, check out the Jetstream2 documentation.
As the final part of the workshop, we want to draw your attention to some helpful resources for maintaining and publishing your research code. In the digital age, distributing research effectively and efficiently is paramount for ensuring reproducibility, collaboration, and accessibility. This section will discuss how you can leverage GitHub and GitHub Codespaces for code management and execution, along with Zenodo for comprehensive research archiving.
GitHub is a powerful platform for version control and collaboration, essential for managing research code. By storing your research code in a GitHub repository, you benefit from features such as issue tracking, pull requests, and continuous integration. These tools enable you to manage contributions from multiple collaborators seamlessly and ensure that changes are tracked meticulously.
GitHub Codespaces takes collaboration a step further by providing a full development environment in the cloud. This allows researchers to work on their projects from anywhere, without the need to set up local development environments. The key to making this work efficiently is the use of .devcontainer
configuration files. This eliminates the "it works on my machine" problem, significantly enhancing reproducibility.
Here is a demo repository with which we can launch a GitHub Codespaces instance if you have a GitHub account. Simply click on "Code" then "Codespaces" and lastly "Create codespace on main":
It should be noted that there are 2 caveats to this:
devcontainer.json
has to modified slightly to function correctly with Codespaces, namely we need to remove the following 3 bits of information:"build": {
"dockerfile": "Dockerfile",
"options": ["--format=docker"] // remove for Codespaces (or Docker)
}
...
// change `type=bind,z` to `type=bind` for Codespaces (or Docker)
"workspaceMount": "source=${localWorkspaceFolder},target=/home/jovyan/work,type=bind,z",
...
"runArgs": [
...
"--userns=keep-id:uid=1000,gid=100", // remove for Codespaces (or Docker)
...
]
While GitHub is excellent for code management, it is equally important to have a robust system for archiving the entirety of your research output. This is where Zenodo comes into play. Zenodo is a research repository managed by CERN that provides a secure and reliable platform for storing a variety of research outputs.
Lastly, by using Zenodo, you can generate DOI links for your research outputs, which enhances their visibility and citability. This is particularly important for ensuring that your work is easily discoverable and can be referenced by other researchers in the field.
The combination of GitHub and Zenodo provides a powerful ecosystem for distributing research:
This integrated approach not only enhances the reproducibility of your research but also ensures that your work is accessible and can be built upon by the wider research community. By leveraging these tools, you contribute to a more open and collaborative research environment, ultimately advancing scientific discovery.