Skip to content

Commit d32abe5

Browse files
Adds infrastructure for the dry-run of the 2025 HPDC tutorial at PAVE (#13)
* Makes several infrastructure tweaks to account for dry run testing * Adds initial work to properly set resources * added c7i.12xlarge to system.py * Adds a test infrastructure for trying to get HTTPS working * Updates Flux TOML config based on feedback from the Flux team * Adds source code for Caliper tutorial apps * Updates Caliper submodule * Fixes some paths for the Flux config file --------- Co-authored-by: stephanielam3211 <[email protected]>
1 parent 2df93d1 commit d32abe5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+4823
-20
lines changed

2025-HPDC/docker/Dockerfile.spawn

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ RUN cmake \
8383
# rm -rf /tmp/build-xsbench
8484

8585
COPY ./tutorial-code/caliper-tutorial/tutorial ${HOME}/caliper-tutorial/
86+
COPY ./tutorial-code/caliper-tutorial/apps ${HOME}/caliper-tutorial/apps
8687
COPY ./tutorial-code/thicket-tutorial/data/lassen ${HOME}/thicket-tutorial/data/lassen
8788
COPY ./tutorial-code/thicket-tutorial/data/quartz ${HOME}/thicket-tutorial/data/quartz
8889
COPY ./tutorial-code/thicket-tutorial/notebooks/01_thicket_tutorial.ipynb ${HOME}/thicket-tutorial/notebooks/01_thicket_tutorial.ipynb

2025-HPDC/docker/spawn-entrypoint.sh

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,19 @@
1111
# /usr/bin/mpiexec.hydra -n $num_brokers -bind-to core:$num_cores_per_node /usr/bin/flux start /opt/global_py_venv/bin/jupyterhub-singleuser
1212

1313
# NOTE: use this if we only want a single "node"
14-
/usr/bin/flux start /opt/global_py_venv/bin/jupyterhub-singleuser
14+
if [[ $# -ne 1 ]]; then
15+
/usr/bin/flux start /opt/global_py_venv/bin/jupyterhub-singleuser
16+
else
17+
last_core_id=$(( $1 - 1 ))
18+
mkdir -p ${HOME}/.flux
19+
cat > ${HOME}/.flux/resource.toml <<EOF
20+
[resource]
21+
noverify = true
22+
23+
[[resource.config]]
24+
hosts = "$(hostname)"
25+
cores = "0-${last_core_id}"
26+
EOF
27+
/usr/bin/flux start -c ${HOME}/.flux/resource.toml \
28+
/opt/global_py_venv/bin/jupyterhub-singleuser
29+
fi

2025-HPDC/docker/spawn-local-entrypoint.sh

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,19 @@
77
# /usr/bin/mpiexec.hydra -n $num_brokers -bind-to core:$num_cores_per_node /usr/bin/flux start /opt/global_py_venv/bin/jupyter-lab --ip=0.0.0.0
88

99
# NOTE: do this if we only want a single "node"
10-
/usr/bin/flux start /opt/global_py_venv/bin/jupyter-lab --ip=0.0.0.0
10+
if [[ $# -ne 1 ]]; then
11+
/usr/bin/flux start /opt/global_py_venv/bin/jupyter-lab --ip=0.0.0.0
12+
else
13+
last_core_id=$(( $1 - 1 ))
14+
mkdir -p ${HOME}/.flux
15+
cat > ${HOME}/.flux/resource.toml <<EOF
16+
[resource]
17+
noverify = true
18+
19+
[[resource.config]]
20+
hosts = "$(hostname)"
21+
cores = "0-${last_core_id}"
22+
EOF
23+
/usr/bin/flux start -c ${HOME}/.flux/resource.toml \
24+
/opt/global_py_venv/bin/jupyter-lab --ip=0.0.0.0
25+
fi
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Deploy hpdc-2025-pave-dry-run to AWS Elastic Kubernetes Service (EKS)
2+
3+
These config files and scripts can be used to deploy the hpdc-2025-pave-dry-run tutorial to EKS.
4+
5+
The sections below walk you through the steps to deploying your cluster. All commands in these
6+
sections should be run from the same directory as this README.
7+
8+
## Step 1: Create EKS cluster
9+
10+
To create an EKS cluster with your configured settings, run the following:
11+
12+
```bash
13+
$ ./create_cluster.sh
14+
```
15+
16+
Be aware that this step can take upwards of 15-30 minutes to complete.
17+
18+
## Step 2: Configure Kubernetes within the EKS cluster
19+
20+
After creating the cluster, we need to configure Kubernetes and its addons. In particular,
21+
we need to setup the Kubernetes autoscaler, which will allow our tutorial to scale to as
22+
many users as our cluster's resources can possibly handle.
23+
24+
To configure Kubernetes and the autoscaler, run the following:
25+
26+
```bash
27+
$ ./configure_kubernetes.sh
28+
```
29+
30+
## Step 3: Deploy JupyterHub to the EKS cluster
31+
32+
With the cluster properly created and configured, we now can deploy JupyterHub to the cluster
33+
to manage everything else about our tutorial.
34+
35+
To deploy JupyterHub, run the following:
36+
37+
```bash
38+
$ ./deploy_jupyterhub.sh
39+
```
40+
41+
## Step 4: Verify that everything is working
42+
43+
After deploying JupyterHub, we need to make sure that all the necessary components
44+
are working properly.
45+
46+
To check this, run the following:
47+
48+
```bash
49+
$ ./check_jupyterhub_status.sh
50+
```
51+
52+
If everything worked properly, you should see an output like this:
53+
54+
```
55+
NAME READY STATUS RESTARTS AGE
56+
continuous-image-puller-2gqrw 1/1 Running 0 30s
57+
continuous-image-puller-gb7mj 1/1 Running 0 30s
58+
hub-8446c9d589-vgjlw 1/1 Running 0 30s
59+
proxy-7d98df9f7-s5gft 1/1 Running 0 30s
60+
user-scheduler-668ff95ccf-fw6wv 1/1 Running 0 30s
61+
user-scheduler-668ff95ccf-wq5xp 1/1 Running 0 30s
62+
```
63+
64+
Be aware that the hub pod (i.e., hub-8446c9d589-vgjlw above) may take a minute or so to start.
65+
66+
If something went wrong, you will have to edit the config YAML files to get things working. Before
67+
trying to work things out yourself, check the FAQ to see if your issue has already been addressed.
68+
69+
Depending on what file you edit, you may have to run different commands to update the EKS cluster and
70+
deployment of JupyterHub. Follow the steps below to update:
71+
1. If you only edited `helm-config.yaml`, try to just update the deployment of Jupyterhub by running `./update_jupyterhub_deployment.sh`
72+
2. If step 1 failed, fully tear down the JupyterHub deployment with `./tear_down_jupyterhub.sh` and then re-deploy it with `./deploy_jupyterhub.sh`
73+
3. If you edited `cluster-autoscaler.yaml` or `storage-class.yaml`, tear down the JupyterHub deployment with `./tear_down_jupyterhub.sh`. Then, reconfigure Kubernetes with `./configure_kubernetes.sh`, and re-deploy JupyterHub with `./deploy_jupyterhub.sh`
74+
4. If you edited `eksctl-config.yaml`, fully tear down the cluster with `cleanup.sh`, and then restart from the top of this README
75+
76+
## Step 5: Get the public cluster URL
77+
78+
Now that everything's ready to go, we need to get the public URL to the cluster.
79+
80+
To do this, run the following:
81+
82+
```bash
83+
$ ./get_jupyterhub_url.sh
84+
```
85+
86+
Note that it can take several minutes after the URL is available for it to actually redirect
87+
to JupyterHub.
88+
89+
## Step 6: Distribute URL and password to attendees
90+
91+
Now that we have our pulbic URL, we can give the attendees everything they need to join the tutorial.
92+
93+
For attendees to access JupyterHub, they simply need to enter the public URL (from step 5) in their browser of choice.
94+
This will take them to a login page. The login credentials are as follows:
95+
* Username: anything the attendee wants (note: this should be unique for every user. Otherwise, users will share pods.)
96+
* Password: the password specified towards the top of `helm-config.yaml`
97+
98+
Once the attendees log in with these credentials, the Kubernetes autoscaler will spin up a pod for them (and grab new
99+
resources, if needed). This pod will contain a JupyterLab instace with the tutorial materials and environment already
100+
prepared for them.
101+
102+
At this point, you can start presenting your interactive tutorial!
103+
104+
## Step 7: Cleanup everything
105+
106+
Once you are done with your tutorial, you should cleanup everything so that there are not continuing, unneccesary expenses
107+
to your AWS account. To do this, simply run the following:
108+
109+
```bash
110+
$ ./cleanup.sh
111+
```
112+
113+
After cleaning everything up, you can verify that everything has been cleaned up by going to the AWS web consle
114+
and ensuring nothing from your tutorial still exists in CloudFormation and EKS.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
hub_pod_id=$(kubectl get pods -n default --no-headers=true | awk '/hub/{print $1}')
13+
kubectl logs $hub_pod_id
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
if [ $# -ne 1 ]; then
13+
echo "Usage: ./check_init_container_log.sh <pod_name>"
14+
exit 1
15+
fi
16+
17+
kubectl logs $1 -c init-tutorial-service
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
kubectl --namespace=default get pods
13+
14+
echo "If there are issues with any pods, you can get more details with:"
15+
echo " $ kubectl --namespace=default describe pod <pod-name>"
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
if ! command -v kubectl >/dev/null 2>&1; then
6+
echo "ERROR: 'kubectl' is required to configure a Kubernetes cluster on AWS with this script!"
7+
echo " Installation instructions can be found here:"
8+
echo " https://kubernetes.io/docs/tasks/tools/#kubectl"
9+
exit 1
10+
fi
11+
12+
if ! command -v eksctl >/dev/null 2>&1; then
13+
echo "ERROR: 'eksctl' is required to create a Kubernetes cluster on AWS with this script!"
14+
echo " Installation instructions can be found here:"
15+
echo " https://eksctl.io/installation/"
16+
exit 1
17+
fi
18+
19+
if ! command -v helm >/dev/null 2>&1; then
20+
echo "ERROR: 'helm' is required to configure and launch JupyterHub on AWS with this script!"
21+
echo " Installation instructions can be found here:"
22+
echo " https://helm.sh/docs/intro/install/"
23+
exit 1
24+
fi
25+
26+
# Temporarily allow errors in the script so that the script won't fail
27+
# if the JupyterHub deployment failed or was previously torn down
28+
set +e
29+
echo "Tearing down JupyterHub and uninstalling everything related to Helm:"
30+
helm uninstall hpdc-2025-pave-dry-run-jupyter
31+
set -e
32+
33+
echo ""
34+
echo "Deleting all pods from the EKS cluster:"
35+
kubectl delete pod --all-namespaces --all --force
36+
37+
echo ""
38+
echo "Deleting the EKS cluster:"
39+
eksctl delete cluster --config-file ./eksctl-config.yaml --wait
40+
41+
echo ""
42+
echo "Everything is now cleaned up!"

0 commit comments

Comments
 (0)