AWS ParallelCluster v3.2.0
We're excited to announce the release of AWS ParallelCluster Cookbook 3.2.0
This is associated with AWS ParallelCluster v3.2.0
ENHANCEMENTS
- Add support for multiple Elastic File Systems.
- Add support for multiple FSx File System.
- Add support for attaching existing FSx for Ontap and FSx for OpenZFS File Systems.
- Install NVIDIA GDRCopy 2.3 to enable low-latency GPU memory copy on supported instance types.
- During cluster update set Slurm nodes state accordingly to strategy set through the configuration parameter
Scheduling/SchedulerSettings/QueueUpdateStrategy. - Add support for memory-based scheduling in Slurm.
- Configure
RealMemoryon compute nodes by default as 95% of the EC2 memory. - Move
SelectTypeParameterstoslurm_parallelcluster.confinclude file. - Move
ConstrainRAMSpacetoslurm_parallelcluster_cgroup.confinclude file. - Add support for new configuration parameter
Scheduling/SlurmSettings/EnableMemoryBasedSchedulingto configure memory-based scheduling in Slurm. - Add support for new configuration parameter
Scheduling/SlurmQueues/ComputeResources/SchedulableMemoryto override default value of the memory seen by the scheduler on compute nodes.
- Configure
- Add support for rebooting compute nodes via Slurm.
CHANGES
- Restart
clustermgtdandslurmctlddaemons at cluster update time only whenSchedulingparameters are updated in the cluster configuration. - Update slurmctld and slurmd systemd service files.
- Upgrade NICE DCV to version 2022.0-12760.
- Upgrade NVIDIA driver to version 470.129.06.
- Upgrade NVIDIA Fabric Manager to version 470.129.06.
- Upgrade EFA installer to version 1.17.2.
- EFA driver:
efa-1.16.0-1 - EFA configuration:
efa-config-1.10-1 - EFA profile:
efa-profile-1.5-1 - Libfabric:
libfabric-aws-1.16.0~amzn2.0-1 - RDMA core:
rdma-core-41.0-2 - Open MPI:
openmpi40-aws-4.1.4-2
- EFA driver:
- Restrict IPv6 access to IMDS to root and cluster admin users only, when configuration parameter
HeadNode/Imds/Securedis enabled. - Set Slurm configuration
AuthInfo=cred_expire=70to reduce the time requeued jobs must wait before starting again when nodes are not available. - Move
SelectTypeParametersandConstrainRAMSpaceto theparallelcluster_slurm*.confinclude files. - Upgrade third-party cookbook dependencies:
- apt-7.4.2 (from apt-7.4.0)
- line-4.5.2 (from line-4.0.1)
- openssh-2.10.3 (from openssh-2.9.1)
- pyenv-3.5.1 (from pyenv-3.4.2)
- selinux-6.0.4 (from selinux-3.1.1)
- yum-7.4.0 (from yum-6.1.1)
- yum-epel-4.5.0 (from yum-epel-4.1.2)
- Disable
aws-ubuntu-eni-helperservice, available in Deep Learning AMIs, to avoid conflicts withconfigure_nw_interface.shwhen configuring instances with multiple network cards. - Set MTU to 9001 for all the network interfaces when configuring instances with multiple network cards.
- Remove the trailing dot when configuring the compute node FQDN.