Open OnDemand on AWS with Parallel Cluster

This reference architecture provides a set of templates for deploying Open OnDemand (OOD) with AWS CloudFormation and integration points for AWS Parallel Cluster.

The main branch is for Open OnDemand v 4.0.1

Architecture

The primary components of the solution are:

Application load balancer as the entry point to your OOD portal.
An Auto Scaling Group for the OOD Portal.
A Microsoft ManagedAD Directory
A Network Load Balancer (NLB) to provide a single point of connectivity to Microsoft ManagedAD
An Elastic File System (EFS) share for user home directories
An Aurora MySQL database to store Slurm Accounting data
Automation via Event Bridge to automatically register and deregister Parallel Cluster HPC Clusters with OOD

Prerequisites

This solution was tested with AWS ParallelCluster version 3.13.0

Deployment 🚀

All-in-one Deployment

All in one deployment including infrastructure and Open OnDemand

Run deploy-assets.sh to deploy the CloudFormation assets to an S3 bucket in the respective AWS account
Deploy all-in-one stack ood_full.yml

Note: DeploymentAssetBucketName is the output from step 1 (deploy assets)

Individual Component Deployment

Deploy Stacks individually:

Deploy Infrastructure (Networking, and Managed Active Directory): infra.yml
Deploy Slurm Accounting Database: slurm_accounting_db.yml
Deploy Open OnDemand: ood.yml

Post Deployment Steps

Once deployed, you should be able to navigate to the URL you set up as a CloudFormation parameter and log into your Open OnDemand portal. You can use the username Admin and retrieve the default password from Secrets Manager. The correct secret can be identified in the output of the Open OnDemand CloudFormation template via the entry with the key ADAdministratorSecretArn.

Deploying an integrated Parallel Cluster HPC Cluster

The OOD solution is built to integrate with AWS ParallelCluster HPC Cluster can be created and automatically registered with the portal. To deploy a ParallelCluster refer to Setting up AWS ParallelCluster to get started. This includes (but not limited to):

Install AWS ParallelCluster CLI
Create ParallelCluster configuration file

Automatically generate ParallelCluster configuration

To create a pcluster configuration file the scripts/create_sample_pcluster_config.sh script can be used to automatically build a configuration file

_Example to create a pcluster-config.yml file

./create_sample_pcluster_config.sh ood

Usage

Usage: ./scripts/create_sample_pcluster_config.sh <stack-name> [region] [domain1] [domain2]
  stack-name: The name of the stack you deployed
  region: The region of the stack you deployed
  domain1: The first domain name to use for the cluster
  domain2: The second domain name to use for the cluster

Manual Cluster Configuration

To create the ParallelCluster configuration file refer to the following information:

HeadNode:
1. SubnetId: PrivateSubnets from OOD Stack Output
2. AdditionalScurityGroups: HeadNodeSecurityGroup from CloudFormation Outputs
3. AdditionalIAMPolicies: HeadNodeIAMPolicyArn from CloudFormation Outputs
4. OnNodeConfigured
  1. Script: CloudFormation Output for the ClusterConfigBucket; in the format s3://$ClusterConfigBucket/pcluster_head_node.sh
  2. Args: Open OnDemand CloudFormation stack name
SlurmQueues:
1. SubnetId: PrivateSubnets from OOD Stack Output
2. AdditionalScurityGroups: ComputeNodeSecurityGroup from CloudFormation Outputs
3. AdditionalIAMPolicies: ComputeNodeIAMPolicyArn from CloudFormation Outputs
4. OnNodeConfigured
  1. Script: CloudFormation Output for the ClusterConfigBucket; in the format s3://$ClusterConfigBucket/pcluster_worker_node.sh . Args: Open OnDemand CloudFormation stack name
LoginNode:
1. OnNodeConfigured
  1. Script: CloudFormation Output for the ClusterConfigBucket; in the format s3://$ClusterConfigBucket/configure_login_nodes.sh
  2. Args: Open OnDemand CloudFormation stack name

Optional - Enable pam_slurm_adopt module for Parallel Cluster compute nodes

The pam_slurm_adopt module can be enabled on Compute nodes in ParallelCluster to prevent users from ssh'ing to nodes they do not have a job running.

In your Parallel Cluster config, update the following configuration(s):

1/ Check if any steps have been launched.

Add the CustomSlurmSetting PrologFlags: "contain" in the Scheduling section. Refer to slurm configuration documentation for more details on this slurm setting.

example

  SlurmSettings:  
    CustomSlurmSettings:
      - PrologFlags: "contain"

2/ Ensure compute nodes are exclusively allocated to users.

Add the CustomSlurmSetting ExclusiveUser: "YES" in the SlurmQueues section. Refer to slurm partition configuration for more details.

example

  CustomSlurmSettings:
    ExclusiveUser: "YES"

3/ Add configure_pam_slurm_adopt.sh to OnNodeConfigured in the CustomActions section.

example

    CustomActions:
      OnNodeConfigured:
        Sequence:
        - Script: s3://$ClusterConfigBucket/pcluster_worker_node.sh
            Args:
            - Open OnDemand CloudFormation stack name
        - Script: s3://$ClusterConfigBucket/configure_pam_slurm_adopt.sh

Enabling Interactive Desktops

You can enable interactive desktops on the Portal server. This can be enabled by creating a queue in ParallelCluster along with

This requires you to have a compute queue with pcluster_worker_node_desktop.sh as your OnNodeConfigured script.

Snippet from ParallelCluster config

      CustomActions:
        OnNodeConfigured:
          Script: >-
            s3://{{ClusterConfigBucket}}/pcluster_worker_node_desktop.sh
          Args:
            - {{OOD_STACK_NAME}}

OOD_STACK_NAME is the name of your Open OnDemand CloudFormation stack name (e.g. ood)
ClusterConfigBucket is the ClusterConfigBucket Output from CloudFormation

Slurm Configuration Management

Slurm configuration can be maintained outside of the Open OnDemand deployment.

The ClusterConfigBucket S3 bucket (found in CloudFormation Outputs) can contain slurm configuration(s) inside the /slurm prefix. Any files in /etc/slurm directory can be added to this prefix and will be automatically deployed to the Open OnDemand server by way of an EventBridge rule.

The following configurations are stored by default:

How to update slurm configuration

To update the slurm configuration on the Open OnDemand server copy any configuration file(s) to the ClusterConfigBucket s3 bucket.

e.g. Pushing a slurm.conf configuration update.
Note: Replace OOD_STACK with the name of your OOD CloudFormation stack.

OOD_STACK="<insert ood stack name here>"
CLUSTER_CONFIG_BUCKET=$(echo $OOD_STACK | jq -r '.Stacks[].Outputs[] | select(.OutputKey=="ClusterConfigBucket") | .OutputValue')
aws s3 cp slurm.conf s3://$CLUSTER_CONFIG_BUCKET/slurm/

Troubleshooting

Issue submitting jobs after adding a ParallelCluster

There can be errors submitting jobs after integrating OOD w/ParalleCluster due to slurm registering the cluster. Review the logs found in /var/log/sbatch.log and check if there are errors related to available clusters.

sample log entry

vbatch: error: No cluster 'sandbox-cluster' known by database.
sbatch: error: 'sandbox-cluster' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.

If this occurs, restart both the slurmctld and slurmdbd services should be restarted.

systemctl restart slurmctld
systemctl restart slurmdbd

Once restarted check the available clusters to verify the cluster is listed.

sacctmgr list clusters

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
assets/cloudformation		assets/cloudformation
images		images
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
deploy-assets.sh		deploy-assets.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open OnDemand on AWS with Parallel Cluster

Architecture

Prerequisites

Deployment 🚀

All-in-one Deployment

Individual Component Deployment

Post Deployment Steps

Deploying an integrated Parallel Cluster HPC Cluster

Automatically generate ParallelCluster configuration

Manual Cluster Configuration

Optional - Enable pam_slurm_adopt module for Parallel Cluster compute nodes

Enabling Interactive Desktops

Slurm Configuration Management

How to update slurm configuration

Troubleshooting

Issue submitting jobs after adding a ParallelCluster

Security

License

About

Releases 3

Packages

Contributors 5

Languages

License

aws-samples/open-on-demand-on-aws

Folders and files

Latest commit

History

Repository files navigation

Open OnDemand on AWS with Parallel Cluster

Architecture

Prerequisites

Deployment 🚀

All-in-one Deployment

Individual Component Deployment

Post Deployment Steps

Deploying an integrated Parallel Cluster HPC Cluster

Automatically generate ParallelCluster configuration

Manual Cluster Configuration

Optional - Enable pam_slurm_adopt module for Parallel Cluster compute nodes

Enabling Interactive Desktops

Slurm Configuration Management

How to update slurm configuration

Troubleshooting

Issue submitting jobs after adding a ParallelCluster

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 5

Languages

Packages