This repository serves as a walk-through to begin a data analysis project on the GenomeDK cluster. Be that a PiB, BSc, MSc, or Ph.D. project. For more seasoned users, it can serve as a showcase of best practices. Some basic experience executing commands in a "terminal" is assumed.
The cluster is a very large collection of computers with a shared file system. It does not have a screen and a keyboard you can go use. However, by connecting to the cluster from your computer, you can create and edit files much like if they were on your machine. The goal here is to take you through all the steps required to make this possible.
NB: This is not a leisure read. Make sure you follow it to the letter, doing each step in order.
Before you can begin, you need access to the cluster. The cluster is called GenomeDK and has its own website with lots of information and documentation. To get an account on the cluster, you request one here. Below, username
will represent your user name.
On the cluster, you have a home folder that only you have can access. That is where you end up when you log in. Collaborative projects or projects that use or generate a lot of data projects belong in project folders. If you do a project, we will set up a dedicated project folder for this.
Your project is a research project. Even if it is meant for training, think of it as training reproducibility too. Your efforts, experiences, results, and developed expertise is valuable to all that comes after you. So do not make your project a dead end; make it a shoulder to stand on.
Reproducibility basically means that anyone should be able to do exactly what you did to exactly reproduce your results. They should be able to get the same results in terms of numbers, plots, etc. This requires care and attention to detail, but it is not difficult. This page will help you get set up to do reproducible research.
Before we get to the cluster, we need to get you properly set up on your own machine.
If you have not done so already, you should install a distribution of Python called Anaconda. Anaconda is not only an easy way of installing Python on Windows, Mac, and Linux; it also comes with the conda package management system (see below). To install Anaconda, head to this page. When the download has been completed, you must follow the default installation.
Most of the programs we will use in this course are command-line applications. I.e., programs that are executed by writing their name and any arguments in a terminal rather than clicking on an icon and using a graphical user interface. There are many different programs that can serve as a terminal. If you have a Windows machine, you must use the Anaconda Powershell Prompt (not the Anaconda Prompt and not the CMD
). You installed Anaconda Powershell Prompt along with Anaconda Python. If you have a Mac, the terminal you will use is called Terminal. The Terminal application is pre-installed on Mac. So from now on, whenever I refer to the terminal, I mean Anaconda Powershell Prompt on Windows and Terminal on Mac. We will assume some familiarity with using a terminal and with executing commands on the command line. If you have not used a terminal before, or if you are a bit rusty, you should run through this primer before you go on.
You need to install packages and programs for use in your analyses and pipelines. Sometimes, however, the versions of packages you need for one project conflicts with the versions you need for other projects that you work on in parallel. Such conflicts seem like an unsolvable problem. Would it not be fantastic if you could create a small world insulated from the rest of your Anaconda installation. Then that small world would only contain the packages you needed for a single project. If each project had its own isolated world, then there would be no such conflicts. Fortunately, there is a tool that lets you do just that, and its name is Conda. The small worlds that Conda creates are called "environments," and you can create as many as you like. You can then switch between them as you switch between your bioinformatics projects. Conda also downloads and installs the packages for you, and it makes sure that the packages you install in each environment are compatible. It even makes sure that packages needed by packages (dependencies) are installed too. By creating an environment for each project, the libraries installed for each project do not interfere.
When you install Anaconda or Miniconda, Conda makes a single base environment for you. It is called "base," and this is why it says "(base)" at your terminal prompt. You need a conda environment for your project on both your local machine and on the cluster. Let us call both of them 'bircproject' (you can call it anything you like).
The environment on your local machine does not need a lot of packages since it mainly serves to let you connect to the cluster. This creates the environment and installs slurm-jupyter
from my conda channel:
conda create -n bircproject -c kaspermunch slurm-jupyter
Say yes (press Enter) when asked to install packages.
To be able to connect to the cluster, you need to on the AU internal network. You can do that by either physically being on campus or by connecting to the AU network using VPN. To install VPN, use the instructions on this page. Before you can use the VPN, you also need to enable two-step verification. You can see how to do that on the same page. If you are not physically on campus, you need to activate your VPN before you can log in to the cluster. Your password for VPN is the same as you use to log on to access Blackboard.
You connect to the cluster from the terminal by executing this command (replace username
with your cluster user name):
When you do, you are prompted for the password for your cluster username. Enter that and press enter. You are now in your home folder on the cluster. Your terminal looks the same as before, but it will print:
_____ ______ _ __
| __ \ | _ \ | / /
| | \/ ___ _ __ ___ _ __ ___ ___| | | | |/ /
| | __ / _ \ '_ \ / _ \| '_ ` _ \ / _ \ | | | \
| |_\ \ __/ | | | (_) | | | | | | __/ |/ /| |\ \
\____/\___|_| |_|\___/|_| |_| |_|\___|___/ \_| \_/
If you run the hostname
command, you can see that you are on fe1.genomedk.net
. Now log out of the cluster again. You do that using the exit
command or by pressing Ctrl-d
. Now you are back on your own machine. Try hostname
again and see what your own machine is called.
You will need to log in to the cluster many many times, so you should set up your ssh
connection to the cluster so you can connect securely without typing the password every time. You do not need to know how this works, but if you are interested, here is roughly how:
Firstly, you have to understand what public/private encryption keys are. A private key is a very long, random sequence of bits. A private key is kept secret and never leaves your own machine. A public key is another string of bits that is a derivative of the private key. You can generate a unique public key from the private key but cannot get the private key from a public key: It is a one-way process. Using the public key, you can encrypt (or sign) any message, and it will only be possible to decrypt it using the private key. In other words, anyone with your public key can send you encrypted messages that only you will be able to read. So, if the cluster has your public key saved, it can authenticate you like this: The cluster sends your machine a message that is encrypted using your public key. Your machine then decrypts the message using its private key and sends it back. If the cluster agrees it is decrypted correctly, it logs you in.
First, check if you have these two authentication files on your local machine (you can do so by running ls -a ~/.ssh
in the terminal):
~/.ssh/id_rsa
~/.ssh/id_rsa.pub
You most likely do not. If so, you generate a pair of authentication keys with the command below. Just press Enter when asked "Enter file in which to save the key". Do not enter a passphrase when prompted - just press enter:
ssh-keygen -t rsa
Now use ssh
to create a directory ~/.ssh
on the cluster (assuming your username on the cluster is username
). You will be prompted for your password.
ssh [email protected] mkdir -p .ssh
Finally, append the public ssh
key on your local machine to the file .ssh/authorized_keys
on the cluster and enter your password (replace username
with your cluster user name):
cat ~/.ssh/id_rsa.pub | ssh [email protected] 'cat >> .ssh/authorized_keys'
From now on, you can log into the cluster from your local machine without being prompted for a password.
Now log in to the cluster:
(see, no password).
You need to install miniconda (a minimal Anaconda version) in your cluster home dir. Log in to the cluster and run this command to download the miniconda install script:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Then use this command to download and install miniconda:
bash Miniconda3-latest-Linux-x86_64.sh
Follow the default installation, and say yes when it asks you if it should run conda init
for you.
NP: Now log out of the cluster and log back in. This is needed to make the conda
command available to you.
Log in to the cluster and run this command to create a conda environment for your project on the cluster. This environment should contain the packages that you need for your project. Such packages may include:
- Grid workflow: gwf
- Jupyter: jupyter jupyterlab
- Vectors and data frames: numpy pandas
- Plotting: matplotlib seaborn ipympl
- Maps: cartopy fiona
- Statistics: scipy statsmodels
- Machine learning: scikit-learn
- Trees: ete3
- Misc bioinformatics: scikit-bio, biopython
- Storage and indexing: pyfaidx samtools h5py
- Graphs: networkx
- Gene annotation: mygene
- Pathways and enrichment: pypathway
- Simulation: msprime
- VCF files: scikit-allel vcftools
The command below should install what you need for a project in population genetics:
conda create --name bircproject -c gwforg -c conda-forge -c bioconda -c kaspermunch slurm-jupyter gwf pip jupyter jupyterlab ipyparallel numpy pandas matplotlib seaborn ipympl biopython pyfaidx scikit-allel pylint vcftools tabix samtools
Should you end up needing more packages than you initially included, you can easily install them later. E.g., to see how to install cartopy
using conda
, just google "conda cartopy". The top link instructs you to install it like this: conda install -c conda-forge cartopy
.
Important: Whenever you log into the cluster to work on your project, you should activate your bircproject
environment like this:
conda activate bircproject
When your environment is active, it says (bircproject)
on the command prompt instead of (base)
.
Jupyter is a notebook environment where you can easily combine text, code, and plots. Using the slurm-jupyter tool, you can run a jupyter notebook on the cluster but see it in the browser on your own machine. That way, your analysis runs on the cluster file system where your data is but the notebook interface is sent to your browser window.
Before you can connect to a jupyter session on the cluster, you need to do a bit of configuration of the jupyter settings on the cluster. slurm-jupyter
comes with script that automates that procedure. Just log into the cluster, activate your environment, and run:
config-slurm-jupyter.sh
The script will ask for a lot of information. You can just press Enter
for all of them except when prompted for what password you want to use: Then, you must type your cluster password.
Your files on the cluster are not backed up! If you want to backup files, you need to put them in a folder called BACKUP. However, a better way is to use git and GitHub (see below).
The page you are reading now is a GitHub repository. In addition to the documentation you are currently reading, it serves as a template for your project.
Start by creating your own github account if you do not have one. Then follow the instructions on this page to add ssh keys to GitHub, so you do not need to type your password every time you sync with GitHub.
Now log in to your GitHub account on the web. Go to the birc-project repository. In addition to this page, the repository contains a folder structure and config files, which serve as a good starting point for a project repository. You must now "fork" the repository. Forking creates your own copy of the birc-project
repository under your own GitHub account. To fork the repository, click the button in the top right corner that says Fork
.
Once your own birc-project
repository is ready, go to the repository front page. There you will find your own copy of this page. Read on from there.
Git is a version control tool. Learn basic Git. These tutorials are good. Once you have some idea how it works, there is a cheat sheep here are some good visual guides here and here
Any files you push to GitHub are backed up. Make a habit of pushing all changes you make.
The project folder is a folder that is set up on the cluster to hold your project. I call it projectfolder
here, but it will be called something sensible like baboonadmixture
.
It is accessible to only you and anyone else you collaborate with (such as your supervisor). The project folder is in your home directory and should hold the following subfolders:
projectfolder
/data
/people
/username
/supervisor_username
The projectfolder/people/username
is your domain. This is where you have all the files that relates to your project.
The name of the project folder is is also the name of the account that you should bill any work on the cluster to. When you run srun
, sbatch
or slurm-jupyter
you myst specify that project name using the -A
or --account
options (see below for more details on that).
Go to your project folder and clone your repository:
git clone [email protected]:username/birc-project.git
(replace username
with your GitHub username).
If you cd
into birc-directory
and run ls
, you will see a number of folders.
data
: stores small (tens of megabases) data files you want to keep .docs
: stores text files explaining how you set up your project and analyzes.binder
: stores anenvironment.yml
files documenting your conda environment used in the project.sandbox
: stores experiments that are not yet part of your project workflow.scripts
: stores Python scripts that that produces intermediate and final results.steps
: stores intermediary files ("steps" on the way to final results).notebooks
: stores Juptyer notebooks with code, documentation, and results.figures
: stores result plots and figures you make.results
: stores the small result files of your project (tens of megabases).
Files in all those folders are under Git control, except files in the steps
folder. Those files are not backed up in any way, but should instead be reproducible using the code and information in your other folders.
If you did not do so when you installed Anaconda, you should download and install Visual Studio Code. VScode is great for developing scripts and editing text files. Once you have installed VS code, you should install the "Remote Development" extension. You do that by clicking the funny squares in the left bar and search for "Remote Development". Once installed, you can click the small green square in the lower-left corner to connect to the cluster. Select "Connect current window to host" then "Add new SSH host", then type <your_cluster_username>@login.genome.au.dk
, then select the config file .ssh/config
. Now you can click the small green square in the lower-left corner to connect to the cluster by selecting login.genome.au.dk
. It may take a bit, but once it is done installing a remote server, you will have access to the files in your home folder on the cluster.
Jupyter runs best in the Chrome browser. For the best experience, install that before you go on. It does not need to be your default browser. slurm-jupyter
will use it anyway. Now make sure you are on your own machine and that your popgen
environment is activated. Then run this command to start a jupyter notebook on the cluster and send the display to your browser:
slurm-jupyter -u usernanme -A projectfolder -e bircproject --run notebook
(replace username
with your cluster user name, projectfolder
with your project folder name, and bircproject
with whatever you called your conda environment on the cluster).
Watch the terminal to see what is going on. After a while, a jupyter notebook should show up in your browser window. The first time you do this, your browser may refuse to show jupyter because the connection is unsafe. In Safari you proceed to allow this. In Chrome, you can simply type the characters "thisisunsafe" while in the Chrome window:
Once ready, jupyter may ask for your cluster password. To close the jupyter notebook, press Ctrl-c
in the terminal. Closing the browser window does not close down the jupyter on the cluster. You can read this tutorial to learn how to use a jupyter notebook.
When you log into the cluster, you land on the "front-end" of the cluster. Think of it as the lobby of a giant hotel. If you execute the hostname
command, you will get fe1.genomedk.net
. fe1
is the name of the front-end machine. The "front-end" is a single machine shared by anyone who logs in. So you cannot run resource-intensive jobs there, but quick commands are ok. Commands that finish in less than ten seconds are ok. In the exercises for this course, you will run software that takes a much longer time to finish. So you need one of the computing machines on the cluster, so you can work on that instead. You ask for a computing machine by running this command:
srun --mem-per-cpu=1g --time=3:00:00 --account=populationgenomics --pty bash
That says that you need at most one gigabyte of memory, that you need it for at most three hours (the duration of the exercise), and that the computing expenses should be billed to the project populationgenomics
(which is our course). When you execute the command, your terminal will say "srun: job 40924828 queued and waiting for resources". That means that you are waiting for a machine. Once it prints "srun: job 40924828 has been allocated resources", you have been logged into a computing node. If you execute the hostname
command, you will get something like s05n20.genomedk.net
. s05n20
is a computing machine. Now you can execute any command you like without causing trouble for anyone.
Now try to log out of the compute node by executing the exit
command or by pressing Ctrl-d
. If you execute the hostname
command again, you will get fe1.genomedk.net
showing that you are back at the front-end machine.
For non-interactive work, it is better to submit your command as a job to the cluster. When you do that, the job gets queued along with many other jobs, and as soon as the requested resources are available on the cluster, the job will start on one the many many machines. To submit a job, you must first create a file (a "batch script") that contains both the requested computer resources and the command you want to run.
Create a file called myscript.sh
with exactly this content:
#!/bin/bash
#SBATCH --mem=1gb
#SBATCH --time=01:00:00
#SBATCH --account=projectfolder
#SBATCH --job-name=firstjob
echo "I can submit cluster jobs now!" > success.txt
(replace projectfolder
with your project folder name)
The first line says this is a bash script, the lines following three lines say that your job needs at most one gigabyte of memory, will run for at most one hour, that the expenses should be billed to the project populationgenomics (which is our course). The fourth line gives the name of the job. Here we have called it firstjob
, but you should name it something sensible.
You submit the job using the sbatch
command:
sbatch myscript.sh
Now your job is queued. Use the mj
command to see what jobs you have queued or running. That will show something like this:
Alloc
Job ID Username Queue Jobname SessID NDS S Elap Time nodes
---------------- -------- -------- ---------- ------ --- - ---------- -----
34745986 kmt normal firstjob -- 1 R 0-00:19:27 s03n56
If you want to cancel this job before it finishes, you can use the scancel
command:
scancel 34745986
Once your job finishes, it has created the file success.txt
and written "I can submit cluster jobs now!" to it. So see that you can use the cat
command:
cat success.txt
When you a program or script on the command line, it usually also prints some information in the terminal. When you run a job on the cluster there is no terminal to print to. Instead, this is written to two files that you can read when the job finishes. In this case, the fiels are called firstjob.stdout
and firstjob.stderr
. To see what is in them, you can use the cat
command:
cat firstjob.stdout
and
cat firstjob.stderr
That is basically it.
You may need to transfer files back and forth between your own machine and the cluster. To copy a file called file
in a directory called dir
on the cluster to the current folder on your own machine, you can use the scp
command:
scp [email protected]:dir/file .
To copy a file called file
in the current folder on your own machine to a folder called dir
on the cluster, you do this:
scp ./file [email protected]:dir/
This is the last part, but the most important.
More detail on what that means...
In the world of data projects, there are three kinds of data files.
- Those representing the input to your project (sequencing reads, raw data, etc.)
- Those representing the output from your project (numbers, notebooks, plots, tables, etc.)
- Those representing intermediary steps to get from files of type 1 to files of type 2.
Type 1 files are usually hard/expensive to reproduce, and type 2 is saved indefinitely on the cluster. Type 2 files are generally very small and inexpensive to save indefinitely on the cluster. Type 3 files can be large and are easily regenerated if your project is reproducible. So type 3 files are not saved indefinitely. In fact, type 3 files should be deleted as soon as the project is finished. Toward the end of this tutorial, I will help you get set up to distinguish these three types of files.
When you have created you conda environment (and whenever you add new packages to your environment), you should update the environment.yml
file in the binder directory. That file allows anyone to create the same conda environment, ensuring that they have all the packages needed to run your code. You do that like this:
conda env export --from-history -f ./binder/environment.yml
Supervisor note: this also makes it easy to create a binder for playing with the students notebooks without interfering with the project. Go to Binder and fill in the
https
address of the student repository: E.g.https://github.com/username/birc-project
.