This tutorial describes how to initiate a cluster on Openstack suitable for running Hail.
The command line tool osdataproc is used to create an openstack cluster with Apache Spark and Apache Hadoop configured. The cluster is provisioned with Jupyterlab for interactive computing, Hail for genomic analysis, and Netdata for monitoring.
This part of work is done on your local machine.
Note
The osdataproc
utility requires python 3.9. Before proceeding, ensure that you're using the correct Python version. Ensure that the Python version is correct:
python --version
The easiest way to install the correct Python version is using Conda.
conda create -n py39 python=3.9
conda activate py39
or install the correct Python version locally:
wget https://www.python.org/ftp/python/3.9.8/Python-3.9.8.tgz
tar -zxvf Python-3.9.8.tgz
cd Python-3.9.8/
mkdir ~/.localpython
# Prepare the environment for building
./configure --prefix=$HOME/.localpython
# Building the system
make
# Implement the installation
make install
Running the code above will install python 3.9.8 at ~/localpython/bin/python3.9
Set up a virtual environment with newly installed Python.
virtualenv osdataproc_env -p path/to/python3.9
# activate
source osdataproc_env/bin/activate
# deactivate with simply `deactivate`
osdataproc
requires Terraform version 1.4 or higher. Find the location of the latest Terraform here, download and unzip into a location on your path. The easy way is to use the bin
folder of the environment created above. You should download archive for amd64, it works on M1/M3 MacBooks.
cd osdataproc_env/bin
wget https://releases.hashicorp.com/terraform/1.9.3/terraform_1.9.3_linux_386.zip
unzip terraform_1.8.2_darwin_amd64.zip
cd ../..
Clone the osdataproc
git repository in a separate folder on your machine, and install it via pip
.
git clone https://github.com/wtsi-hgi/osdataproc.git
cd osdataproc
pip install -e .
cd ..
Check, that osdataproc
installed successfully:
osdataproc create --help
Warning
Terraform stores all cluster configuration data in the ./terraform/terraform.tfstate.d
folder in the osdataproc
folder. Don't remove it. Otherwise, you lose access to cluster configuration and won't be able to suspen/resume/destroy created clusters.
Tip
If when installing osdataproc
the wheel fail to be built, try updating wheel
and setuptools
packages.
pip install wheel -U # 0.43.0 version
pip install setuptools -U # 71.1.0 version
For naming your clusters and volumes, we recommend prefixing with your username, similarly for your volumes (nfs-volume).
Example: gz3-hail
You will be prompted for a password, this password is used for accessing Jupyter and Spark master node. Since you may have to swap clusters or allow other people to access this password needs to be something you are happy to share with others.
Example: To create a cluster using 50 m2.medium
workers
and create a new volume called gz3-hail
run the following script:
eval `ssh-agent -s` \
ssh-add path/to/public/key
osdataproc create [--num-workers] <Number of desired worker nodes>
[--public-key] <Path to public key file>
[--flavour] <OpenStack flavour to use>
[--network-name] <OpenStack network to use>
[--lustre-network] <OpenStack Lustre provider network to use>
[--image-name] <OpenStack image to use - Ubuntu images only>
[--nfs-volume] <Name/ID of volume to attach or create as NFS shared volume>
[--volume-size] <Size of OpenStack volume to create>
[--device-name] <Device mountpoint name of volume>
[--floating-ip] <OpenStack floating IP to associate to master node - will automatically create one if not specified>
<cluster_name>
Wait until Terraform completes cluster creation. The creation of clusters can take some time and needs your laptop to be connected to the network.
Important
The IP address of your new cluster will be printed to STDOUT. Save it somewhere.
When the cluster completion finishes, inspect the log. There should be no failed tasks:
There are many reasons, why the cluster creation can fail. In most cases, it's easier to delete and re-create the cluster.
To delete cluster, run the following command:
osdataproc destroy "${cluster_name}"
To re-create cluster, run the same command without specifying --volume-size
. The new cluster will use existing volume. In most cases, volumes are created correctly, so you don't need to delete and re-create it.
Log into your new cluster using the IP address created when your new cluster was created. For example:
ssh ubuntu@<public_ip>
The WES-QC code requires Python GnomaAD package. To install it, run the following:
sudo apt install postgresql python3.9-dev libpq-dev
pip install gnomad
To check the cluster status use the SPARK web interface: <http://<public_ip>/spark/>
Hail scripts are submitted to worker nodes using spark as follows:
export PYSPARK_DRIVER_PYTHON=/home/ubuntu/venv/bin/python
spark-submit /path/to/hail_script.py
Alternatively, you can activate environment and use spark commands directly
source /home/ubuntu/venv/bin/activate
spark-submit /path/to/hail_script.py
It is highly advisable to open tmux
or screen
session before submitting spark jobs.
Jupyter can be accessed via a web browser as follows (using the IP address created when you created your new cluster): <https://<public_ip>/jupyter/lab>
You will be prompted for a password that you used to create the cluster.
Warning
All Hail tasks (both command-line and interactive via Jupyter) occupy all working nodes. You can't run Jupyter notebook and command-lite script simultaneously.
To kill a Jupyter job, shut down all Jupyter kernels, or kill the Hail
process via Spark master web interface.