Skip to content

Commit 9478e7c

Browse files
demartinofraSean Smith
authored andcommitted
Changelog v2.5.0
Signed-off-by: Francesco De Martino <[email protected]>
1 parent c5e864b commit 9478e7c

File tree

1 file changed

+27
-1
lines changed

1 file changed

+27
-1
lines changed

CHANGELOG.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,38 @@ aws-parallelcluster-node CHANGELOG
33

44
This file is used to list changes made in each version of the aws-parallelcluster-node package.
55

6-
2.x.x
6+
2.5.0
77
-----
88

9+
**ENHANCEMENTS**
10+
- Slurm:
11+
- Add support for scheduling with GPU options. Currently supports the following GPU-related options: `—G/——gpus,
12+
——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu`.
13+
- Add gres.conf and slurm_parallelcluster_gres.conf in order to enable GPU options. slurm_parallelcluster_gres.conf
14+
is automatically generated by node daemon and contains GPU information from compute instances. If need to specify
15+
additional GRES options manually, please modify gres.conf and avoid changing slurm_parallelcluster_gres.conf when
16+
possible.
17+
- Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
18+
for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
19+
avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
20+
required to satisfy all GPU requirements.
21+
- Slurm daemons will now keep running when cluster is stopped for better stability. However, it is not recommended
22+
to submit jobs when the cluster is stopped.
23+
- Change jobwatcher logic to consider both GPU and CPU when making scaling decision for slurm jobs. In general,
24+
cluster will scale up to the minimum number of nodes needed to satisfy all GPU/CPU requirements.
25+
- Reduce number of calls to ASG in nodewatcher to avoid throttling, especially at cluster scale-down.
26+
27+
**CHANGES**
28+
- Increase max number of SQS messages that can be processed by sqswatcher in a single batch from 50 to 200. This
29+
improves the scaling time especially with increased ASG launch rates.
30+
- Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
31+
to recover when under heavy load.
32+
933
**BUG FIXES**
1034
- Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
1135
already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
36+
- Better handling of errors occurred when adding/removing nodes from the scheduler config.
37+
- Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.
1238

1339

1440
2.4.1

0 commit comments

Comments
 (0)