Changelog v2.5.0

demartinofra · Sean Smith · commit 9478e7c88133 · 2019-11-07T10:14:37.000-08:00
Signed-off-by: Francesco De Martino &lt;fdm@amazon.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,12 +3,38 @@ aws-parallelcluster-node CHANGELOG
 
 This file is used to list changes made in each version of the aws-parallelcluster-node package.
 
-2.x.x
+2.5.0
 -----
 
+**ENHANCEMENTS**
+- Slurm:
+  - Add support for scheduling with GPU options. Currently supports the following GPU-related options: `—G/——gpus,
+    ——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu`.
+  - Add gres.conf and slurm_parallelcluster_gres.conf in order to enable GPU options. slurm_parallelcluster_gres.conf
+    is automatically generated by node daemon and contains GPU information from compute instances. If need to specify
+    additional GRES options manually, please modify gres.conf and avoid changing slurm_parallelcluster_gres.conf when
+    possible.
+  - Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
+    for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
+    avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
+    required to satisfy all GPU requirements.
+  - Slurm daemons will now keep running when cluster is stopped for better stability. However, it is not recommended
+    to submit jobs when the cluster is stopped.
+  - Change jobwatcher logic to consider both GPU and CPU when making scaling decision for slurm jobs. In general,
+    cluster will scale up to the minimum number of nodes needed to satisfy all GPU/CPU requirements.
+- Reduce number of calls to ASG in nodewatcher to avoid throttling, especially at cluster scale-down.
+
+**CHANGES**
+- Increase max number of SQS messages that can be processed by sqswatcher in a single batch from 50 to 200. This
+  improves the scaling time especially with increased ASG launch rates.
+- Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
+  to recover when under heavy load.
+
 **BUG FIXES**
 - Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
   already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
+- Better handling of errors occurred when adding/removing nodes from the scheduler config.
+- Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.
 
 
 2.4.1