forked from SchedMD/slurm
-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathNEWS
10415 lines (10155 loc) · 586 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.
* Changes in Slurm 19.05.5
==========================
-- Fix both socket-[un]constrained GRES issues that would lead to incorrect
GRES allocations and GRES underflow errors at deallocation time.
-- Reject unrunnable jobs submitted to reservations.
-- Fix misleading error returned for immediate allocation requests when defer
in SchedulerParameters by decoupling defer from too fragmented logic.
-- Fix printf format string error on FreeBSD.
-- Fix parsing of delay_boot in controller when additional arguments follow it.
-- Fix --ntasks-per-node in cons_tres.
-- Fix array tasks getting same reject reason.
-- Ignore DOWN/DRAIN partitions in reduce_completing_frag logic.
-- Fix alloc_node validation when updating a job.
-- Fix for requesting specific nodes when using cons_tres topology.
-- Ensure x11 is setup before launching a job step.
-- Fix incorrect SLURM_CLUSTER_NAME env var in batch step.
-- Perl API - Fix undefined symbol for slurmdbd_pack_fini_msg.
-- Install slurmdbd.conf.example with 0600 permissions to encourage secure
use. CVE-2019-19727.
-- srun - do not continue with job launch if --uid fails. CVE-2019-19728.
* Changes in Slurm 19.05.4
==========================
-- Don't allow empty string as a reservation name; generate a name if empty
string is provided.
-- Fix salloc segfault when using --no-shell option.
-- Fix divide by zero when normalizing partition priorities.
-- Restore ability to set JobPriorityFactor to 0 on a partition.
-- Fix multi-partition non-normalized job priorities.
-- Adjust precedence between --mem-per-cpu and --mem-per-node to enforce
them as mutually exclusive. Specifying either on the command line will
now explicitly override any value inherited through the environment.
-- Always print node's version, if it exists, in scontrol show nodes.
-- sbatch - ensure SLURM_NTASKS_PER_NODE is exported when --ntasks-per-node
is set.
-- slurmctld - fix memory leak when using DebugFlags=Reservation.
-- Reset --mem and --mem-per-cpu options correctly when using --mem-per-gpu.
-- Use correct function signature for step_set_env() in gres plugin interface.
-- Restore pre-19.05 hostname handling behavior for AllocNodes by always
truncating to just the host portion and dropping any domain name portion
returned by gethostbyaddr().
-- Fix abort initializing a configuration without acct_gather.conf.
-- Fix GRES binding and CLOUD nodes GRES setup regressions.
-- Make sview work with glib2 v2.62.
-- Fix slurmctld abort when in developer mode and submitting to multiple
partitions with a bad QOS and not enforcing QOS.
-- Enforce PART_NODES if only PartitionName is specified.
-- Fix slurmd -G functionality.
-- Fix build on 32-bit systems.
-- Remove duplicate log entry on update job.
-- sched/backfill - fix the estimated sched_nodes for multi-part jobs.
-- slurm.spec - fix pmix_version global context macro.
-- Fix cons_tres topology logic incorrectly evaluating insufficient resoruces.
-- Fix job "--switches=count@time" option handling in cons_tres topology.
-- scontrol - allow changes to the WorkDir for pending jobs.
-- Enable coordinators to delete users if they only belong to accounts that
the coordinator is over.
-- Fix regression on update from older versions with DefMemPerCPU.
-- Fix issues with --gpu-bind while using cgroups.
-- Suspend nodes after being down for SuspendTime.
-- Fix rebooting nodes from skipping nextstate states on boot.
-- Fix regression in reservation creation logic from 19.05.3 which would
incorrectly deny certain valid reservations from being created.
-- slurmdbd - process sacct/sacctmgr job queries from older clients correctly.
* Changes in Slurm 19.05.3-2
============================
-- Fix missing include for Cray Aries systems.
* Changes in Slurm 19.05.3
==========================
-- Fix missing check from conversion of cray -> cray_aries.
-- Improve job state reason string when required nodes are not available by
not including those that don't belong to the job partition.
-- Set a more appropriate ESLURM_RESERVATION_MAINT job state reason for jobs
requesting feature(s) and required nodes are in a maintenance reservation.
-- Fix logic to better handle maintenance reservations.
-- Add spank options to cache in remote callback.
-- Enforce the use of spank_option_getopt().
-- Fix select plugins' will run test under-allocating nodes usage for
completing jobs.
-- Nodes in COMPLETING state treated as being currently available for job
will-run test.
-- Cray - fix contribs slurm.conf.j2 with updated cray_aries plugin names.
-- job_submit/lua - fix problem where nil was expected for min_mem_per_cpu.
-- Fix extra, unaccounted TRESRunMins usage created by heterogeneous jobs when
running with the priority/multifactor plugin.
-- Detach threads once they are done to avoid having to join them
in track scripts code.
-- Handle situation where a slurmctld tries to communicate with slurmdbd more
than once at the same time.
-- Fix XOR/XAND features like cpu&fastio&[knl|westmere] to be resolved
correctly.
-- Don't update [min|max]_exit_code on job array task requeue.
-- Don't assume the first node of a job is the batch host when testing if the
job's allocated nodes are booted/ready.
-- Make --batch=<feature> requests wait for all nodes to be booted so that it
can choose the batch host after the nodes have been booted -- possibly with
different features.
-- Fix talking to batch host on it's protocol version when using --batch.
-- gres/mic plugin - add missing fini() function to clean up plugin state.
-- Move _validate_node_choice() before prolog/epilog check.
-- Look forward one week while create new reservation.
-- Set mising resv_desc.flags before call _select_nodes().
-- Use correct start_time for TIME_FLOAT reservation in _job_overlap().
-- Properly enforce a job's mem-per-cpu option when allocate the node
exclusively to that job.
-- sched/backfill - clear estimated sched_nodes as done for start_time.
-- Have safe_[read|write] handle EAGAIN and EINTR.
-- Fix checking for flag with logical AND.
-- Correct "extern" definition of variable if compiling with __APPLE__.
-- Deprecate FastSchedule. FastSchedule will be removed in 20.02.
The FastSchedule=2 functionality (used for testing and development) has
been retained as the new SlurmdParameters=config_overrides option.
-- Fix preemption issue when picking nodes for a feature job request.
-- Fix race condition preventing held array job from getting a db_index.
-- Fix select/cons_tres gres code infinite loop leaving slurmctld unresponsive.
-- Remove redefinition of global variable in gres.c
-- Fix issue where GPU devices are denied access when MPS is enabled.
-- Fix uninitialized errors when compiling with CFLAGS="--coverage".
-- Fix scancel --full for proctrack/cgroups.
-- Fix sdiag backfill last and mean queue length stats.
-- Do not remove batch host when resizing/shrinking a batch job.
-- nss_slurm - fix file descriptor leaks.
-- Fix preemption for jobs using complex feature requests
(e.g. -C "[rack1*2&rack2*4]").
-- Fix memory leaks in preemption when jobs request multiple features.
-- Allow Operator users to show/fix runaways.
-- Disallow coordinators to show/fix runaways.
-- mpi/pmi2 - increase array len to avoid buffer size exceeded error.
-- Preserve rebooting node's nextstate when updating state with scontrol.
-- Fully merge slurm.conf and gres.conf before node_config_load().
-- Remove FastSchedule dependence from gres.conf's AutoDetect=nvml.
-- Forbid mix of typed and untyped GRES of same name in slurm.conf.
-- cons_tres: Prevent creating a job without CPUs.
-- Prevent underflow when filtering cores with gres.
-- proctrack/cray_aries: use current pid instead of thread if we're in a fork.
-- Fix missing check for prolog launch credential creation failure that can
lead to segfaults.
* Changes in Slurm 19.05.2
==========================
-- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block.
-- Allow account coordinators to add users who don't already have an
association with any account.
-- If only allowing particular alloc nodes in a partition, deny any request
coming from an alloc node of NULL.
-- Prevent partial-load of plugins which can leave certain interfaces in
an inconsistent state.
-- Remove stray __USE_GNU macro definitions from source.
-- Fix loading fed state by backup on subsequent takeovers.
-- Add missing job read lock when loading fed job state.
-- Add missing fed_job_info jobs if fed state is lost.
-- Do not build cgroup plugins on FreeBSD or NetBSD, and use proctrack/pgid
by default instead.
-- Do not build switch/cray_aries plugin on FreeBSD, NetBSD, or macOS.
-- Fix build on FreeBSD.
-- Fix race condition in route/topology plugin.
-- In munge decode set the alloc_node field to the text representation of an
IP address if the reverse lookup fails.
-- Fix infinite loop in slurmstepd handling for nss_slurm REQUEST_GETGR RPC.
-- Fix slurmstepd early assertion fail which prevented batch job launch or
tasks launch on non-Linux systems.
-- Fix regression with SLURM_STEP_GPUS env var being renamed SLURM_STEP_GRES.
-- Fix pmix v3 linking if no rpath is allowed on build.
-- Fix sacctmgr error handling when removing associations and users.
-- Allow sacctmgr to add users to WCKeys without having TrackWCKey set in the
slurm.conf.
-- Allow sacctmgr to delete WCKeys from users.
-- Change GRES type set by gpu/gpu_nvml plugin to be more specific - based
on device name instead of brand name.
-- cli_filter - fix logic error with option lookup functions.
-- Fix bad testing of NodeFeatures debug flag in contribs/cray.
-- Cleanup track_script code to avoid race conditions and invalid memory
access.
-- Fix jobs being killed after being requeued by preemption.
-- Make register nodes verify correctly when using cons_tres.
-- Fix srun --mem-per-cpu being ignored.
-- Fix segfault in _update_job() under certain conditions.
-- job_submit/lua - restore slurm.FAILURE as a synonym for slurm.ERROR.
* Changes in Slurm 19.05.1-2
============================
-- Fix mistake in QOS time limit calculations for UsageFactor != 0 with any
combination of flags set.
* Changes in Slurm 19.05.1
==========================
-- accounting_storage/mysql - fix incorrect function names in error messages.
-- accounting_storage/slurmdbd - trigger an fsync() on the dbd.messages state
file to ensure it is committed to disk properly.
-- Avoid JobHeldUser state reason from being updated at allocation time.
-- Fix dump/load of rejected heterogeneous jobs.
-- For heterogeneous jobs, do not count the each component against the QOS or
association job limit multiple times.
-- Comment out documentation for the incomplete and currently unusable
burst_buffer/generic plugin.
-- Add new error ESLURM_INVALID_TIME_MIN_LIMIT to make note when a time_min
limit is invalid based on timelimit.
-- Correct slurmdb cluster record pack with NULL pointer input.
-- Clearer error message for ESLURM_INVALID_TIME_MIN_LIMIT.
-- Fix SchedulerParameter bf_min_prio_reserve error when not the last parameter
-- When fixing runaway jobs, change to reroll from earliest submit time, and
never reroll from Unix epoch.
-- Display submit time when running sacctmgr show runawayjobs and add format
option to display eligible time.
-- jobcomp/elasticsearch - fix minor race related to JobCompLoc setup.
-- For HetJobs, ensure SLURM_PACK_JOB_ID is set regardless of whether
PrologFlags=Alloc is enabled.
-- Fix PriorityFlags regression with the mutation of FAIR_TREE to NO_FAIR_TREE.
-- select/cons_res - fix debug flag SelectType handling in select_p_job_test.
-- Fix sacctmgr archive dump commit confirmation.
-- Prevent extra resources from being allocated when combining certain flags.
-- Cray - fix template generator with update cray_aries plugin names.
-- accounting_storage/slurmdbd - provide additional detail in several error
messages.
-- Backfill - If a job has a time_limit guess the end time of a job better
if OverTimeLimit is Unlimited.
-- Remove premature call to get system gpus before querying fake gpus that
should override the real.
-- Fix segfault in epilog_set_env() when gres_devices is NULL.
-- Fix (un)supported states in sacct.
-- Adjust build system to no longer use the AC_FUNC_MALLOC autoconf macro.
-- srun - restore the --cpu_bind option to srun.
-- Add UsageFactorSafe QOS flag to control applying UsageFactor at
submission/scheduling time.
-- Create missing reservations on DBD_MODIFY_RESV.
-- Add error message when attempting to update association manager and object
doesn't exist.
-- Fix security issue in accounting_storage/mysql plugin on archive file loads
by always escaping strings within the slurmdbd. CVE-2019-12838.
* Changes in Slurm 19.05.0
==========================
-- Fix deprecated group by clause to use order by.
-- NVML - Git rid of unneeded * when passing nvmlDevice_t to functions.
-- NVML - Fix clang warning about unneeded variable initialization.
-- NVML - remove unneeded {}.
-- Add timers to new site_factor plugin APIs to warn of slow-running plugins,
which can lead to issues with throughput and responsiveness.
-- X11 forwarding - ignore screen value for local DISPLAY.
-- Add missing locks protecting slurmctld_config.server_thread_count access.
-- Fix jobs stuck from FedJobLock when requeueing in a federation
-- Fix requeueing job in a federation of clusters with differing associations
-- sacctmgr - free memory before exiting in 'sacctmgr show runaway'.
-- Fix seff showing memory overflow when steps tres mem usage is 0.
-- Fix memory leaks in 'sacctmgr show runawayjobs'.
-- Fix potential deadlock in nss_slurm.
-- Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor.
-- Alter reservation flags column in slurmdbd to use uint64_t instead of
uint16_t to ensure all current flags are saved correctly. Older releases
unfortunately could not store details for newer flags (using bits 17-32)
due to this field being silently truncated.
-- Modify task layout with --overcommit option plus a heterogeneous job
allocation so that a cyclic task distribution can start happening before
all CPUs on all nodes are fully allocated. The number of tasks per node
will be unchanged from the previous algorithm, but tasks will be distributed
in a cyclic fashion first and then extra tasks placed on nodes with more
CPUs. Previously all CPUs would be fully allocated in a cyclic fashion,
then excess tasks distributed evenly across all allocated nodes.
-- In select/cons_tres: Only allocate 1 CPU per node with the --overcommit
option.
-- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and
--nodelist options.
-- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests.
-- Fix wrongly setting start_time to 0 for multi-part jobs.
-- Upon archive file name collision, create new archive file instead of
overwriting the old one to prevent lost records.
-- Limit archive files to 50000 records per file so that archiving large
databases will succeed.
-- Remove stray newlines in SPANK plugin error messages.
-- Fix archive loading events.
-- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and
--nodelist options.
-- Fix main scheduler from potentially not running through whole queue.
-- Fix variable initiation to avoid slurmctld abort.
-- In partition preemption, sort preemptor jobs only if they overlap a
preemtable partition.
-- cons_tres/dist_tasks - fix variable usage in cyclic distribution.
-- cons_res/job_test - prevent a job from overallocating a node memory.
-- cons_res/job_test - fix to consider a node's current allocated memory when
testing a job's memory request.
-- Fix issue where multi-node job steps on cloud nodes wouldn't finish cleaning
up until the end of the job (rather than the end of the step).
-- Fix packing pack_jobid in an sbcast.
-- Fix GCC 9 compiler warnings.
-- Add new job bit_flags of JOB_DEPENDENT.
-- Make it so dependent jobs reset the AccrueTime and do not count against any
AccrueTime limits.
-- Fix sacctmgr --parsable2 output for reservations and tres.
-- In multi-node systems make sure GRES are found on node when not bound to
specific sockets.
-- Fix gres-per-task logic for gres not bound to sockets.
-- Fix issue when --gpus plus --cpus-per-gres was forcing socket binding
unnecessarily.
-- Change event table's state column to handle 32bits.
-- Prevent slurmctld from potential segfault after job_start_data() called
for completing job.
-- Fix jobs getting on nodes with "scontrol reboot asap".
-- Record node reboot events to database.
-- Fix node reboot failure message getting to event table.
-- Don't write "(null)" to event table when no event reason exists.
-- Fix invalid memory read in cons_tres.
-- Fix minor memory leak when clearing runaway jobs.
-- Avoid flooding slurmctld and logging when prolog complete RPC errors occur.
-- Fix slurmctld node_scheduler's feature_bitmap memory leak.
-- Fatal when reading config if Alloc flag configured on FrontEnd mode.
-- Modifications needed to run Federations with clusters running
different select/switch plugins.
-- Fix Clang errors for zero initializing struct with nested arrays.
-- Fix minor memory leak in pmi2.
-- MySQL - Fix minor memory leak when quering suspended jobs fails.
-- Fix seff human readable memory string for values below a megabyte.
-- Avoid slurmctld abort if GRES defined in gres.conf, but not in the node
configuration of slurm.conf.
-- Calculate task count for job with --gpus-per-task option, but no explicit
task count.
* Changes in Slurm 19.05.0rc1
=============================
-- Set CUDA_VISIBLE_DEVICES environment variable in Prolog and Epilog for jobs
requesting gres/gpu.
-- Remove '-U' argument - which was deprecated when '-A' was made the single
character option before the Slurm 2.1 release - as an alternative to
'--account' for salloc/sbatch/srun.
-- Remove direct BLCR support and srun_cr.
-- Make slurm_print_node_table only print a node's slurmd version if it is
different to the one reported by slurm_load_ctl_conf.
-- Call gres plugin environment setup even if gres not requested in job.
-- Do not set CUDA_VISIBLE_DEVICES=NoDevFiles when no gres requested.
-- If GRES configuration data is unavailable from gres.conf, then use the
node's "Gres=" information slurm.conf. This will eliminate or minimize the
gres.conf file in many situations.
-- Fix checking IPMI XCC raw command response length.
-- jobacct_gather/common - improve lightweight process identification.
-- Cloud/PowerSave Improvements:
- Better repsonsiveness to resuming and suspending.
- Powering down nodes not eligible to be allocated until after
SuspendTimeout.
- Powering down nodes put in "Powering Down / %" state until after
SuspendTimeout.
-- Add idle_on_node_suspend SlurmctldParameter to make nodes idle regardless
of state when suspended.
-- Add PowerSave DebugFlag for Suspend/Resume debugging.
-- Changed "scontrol reboot" to not default to ALL nodes.
-- Changed "scontrol completing" to include two new fields - EndTime and
CompletingTime.
-- select/cons_tres - prevent job from overallocating a node memory.
-- Refactor CLI option parsing for salloc/sbatch/srun into a central set of
functions in src/common/slurm_opt.c. Note that this new option parsing can
be stricter in a few specific situations - places that used to ignore
invalid options and still submit/launch a job or job step may return an
error() and refuse to proceed instead.
-- Add preempt_send_user_signal SlurmctldParameter option to send user
signal (e.g. --signal=<SIG_NUM>) at preemption if it hasn't already been
sent.
-- Add PreemptExemptTime parameter to slurm.conf and QOS to guarantee a
minimum runtime before preemption.
-- Set job's preempt time for non-grace time preemptions.
-- Add sinfo format option to show used gres.
-- Add reboot_from_controller SlurmctldParameter to allow RebootProgram to be
run from the controller instead of the slurmds.
-- Fix increasing of job size when extern steps exist.
-- Reset GPU-related arguments to salloc/sbatch/srun for each separate
heterogeneous job component.
-- Do not set "(null)" for SLURM_JOB_CONSTRAINTS when no constraints are set
in PrologSlurmctld/EpilogSlurmctld.
-- Add SRUN_EXPORT_ENV as an input environment variable to srun.
-- Return an error for invalid #SBATCH directives, and do not submit the job.
-- Add S_JOB_ARRAY_ID and S_JOB_ARRAY_TASK_ID to spank_get_item().
-- Change container_{g,p}_add_pid() to container_{g,p}_join() and remove the
'pid_t pid' argument.
-- Add new site_factor plugin type to permit sites to build plugins to set
and modify the site priority factor value both initially on job submission,
and periodically every PriorityCalcPeriod.
-- Rename Cray plugins cray_aries in preperation for Cray/Shasta.
-- Allow Het Jobs to work on a Cray.
-- Add new cli_filter plugin type to permit sites to build plugins to log,
modify, or reject CLI options within the salloc/sbatch/srun commands
themselves.
-- Allocate nodes that are booting. Previously, nodes that were being booted
were off limits for allocation. This caused more nodes to be booted than
needed in a cloud environment.
-- pam_slurm_adopt - inject SLURM_JOB_ID environment variable into adopted
processes.
-- PMIx - use the Tree-based collective for empty fence operations.
-- PMIx - replace use of the non-standard PMIX_VAL_SET macro with the
standardized PMIX_VALUE_LOAD macro.
-- slurm.spec - change --without cray option to set configure option of
--enable-really-no-cray.
-- slurm.spec - add new --with slurmsmwd option.
-- pmi2: add mutex locking to all API calls to ensure thread-safety.
-- Fix QOS usage factor to apply to TRES time limits and usage.
-- Fix multi-cluster srun's with Select/Cray and other_cons_res.
* Changes in Slurm 19.05.0pre3
==============================
-- Fix RPM packaging for accounting_storage/mysql.
* Changes in Slurm 19.05.0pre2
==============================
-- Removed select/serial plugin.
-- Remove 512-character line length limit in slurm_print_topo_record().
(Used by "scontrol show topology".)
-- Removed crypto/openssl plugin.
-- Tweak the sdiag gettimeofday() line format for greater clarity.
-- Add support for SALLOC/SBATCH/SLURM_NO_KILL environment variables.
Add salloc/sbatch/srun support for optional "--no-kill=off" option to
disable the environment variables.
-- Fix salloc and missing SLURM_NTASKS.
-- Alter the backfill scheduler behavior to prevent it from scheduling lower
priority jobs on resources that become available during the backfill
scheduling cycle when bf_continue is enabled. This behavior was available
as the bf_ignore_newly_avail_nodes option in 18.08.4+, but is now enabled
by default. (The SchedulerParameters option of bf_ignore_newly_avail_nodes
is also now removed, although harmless if still set.)
-- Make LaunchParameters=send_gids the default introducing the reverse option
"disable_send_gids to go back to the original behavior.
-- Limit pam_slurm_adopt to run only in the sshd context by default, for
security reasons. A new module option 'service=<name>' can be used to
allow a different PAM applications to work. The option 'service=*' can be
used to restore the old behavior of always performing the adopt logic
regardless of the PAM application context.
-- pam_slurm_adopt: Use uid to determine whether root is logging.
-- Remove sbatch --x11 option. Slurm's internal X11 forwarding is now only
supported from salloc, or an allocating srun command.
-- Suppressed printing of job id in sbatch when quiet flag is set.
-- Changed sreport 'SizesByAccount' and 'SizesByAccountAndWckey' default
behavior and added new 'AcctAsParent' option.
-- Add ave watts to api and sview.
-- Added printf attribute to setenvf() and corrected related warnings.
-- Kill running/pending job is allocated GRES and that GRES has a "File"
configuration, and the GRES count changes.
-- Add new DebugFlag=Accrue for accrue accounting debugging purposes.
-- Change CryptoType option to CredType, and rename crypto/munge plugin to
cred/munge.
-- Add slurmd -G option to print GRES configuration and exit. This is useful
for testing and debugging.
-- Support GRES types that include numbers (e.g. "--gres=gpu:123g:2").
-- Remove MemLimitEnforce parameter and move functionality into
JobAcctGatherParam=OverMemoryKill.
-- sview - disable admin mode option (which would not work anyways) if the
user is not an admin in SlurmDBD.
-- Remove joules reporting from sview and scontrol.
-- Change the default fair share algorithm to "fair tree". The new
PriorityFlags option of NO_FAIR_TREE can be used to revert to "classic"
fair share scheduling instead.
-- libslurmdb has been merged into libslurm.
-- Added -b as a short option for --begin and removed the -b option which
was a left over artifact from the Moab compatibility work.
-- Add ArrayTaskThrottle to "scontrol show job" output.
-- Added SPRIO_FORMAT env variable to the sprio command.
-- Add batch step at the beginning of a batch job so that squeue, sstat, and
sacct will show the batch step.
-- Deprecated 32-bit builds.
-- Make -l and -o mutually exclusive in saccct, squeue, sinfo, and sprio
-- Disable running job expansion by default. A new SchedulerParameter of
permit_job_expansion has been added for sites that wish to re-enable it.
-- Permit changing a job array's ArrayTaskThrottle value even if the job is
terminated (for job requeue).
-- Add scontrol requeue option of "Incomplete" which will requeue jobs only if
they failed to complete with an exit code of zero.
-- Modify GrpNodes limit to apply to unique nodes allocated (avoid double
counting nodes allocated to multiple jobs in the same QOS or association).
-- If a job submit does NOT include --cpus-per-task option, then report the
value as "N/A" rather than always mapping the value to 1.
-- X11 forwarding - use the raw value from gethostname() with xauth to avoid
authentication issues when Slurm has internally stripped off the domain
portion.
-- Change how slurmd fills in the registration message version string from
PACKAGE_VERSION to SLURM_VERSION_STRING, affecting how the version is
displayed with sview, sinfo, scontrol and through the API.
-- Remove autogen.sh script. Please use the autoreconf command instead.
-- Disable a configuration of SelectTypeParameters=CR_ONE_TASK_PER_CORE with
SelectType=select/cons_tres. This will be addressed later.
-- job_submit/lua - expose more fields off the partition record.
-- task/cgroup - prevent setting a memory.soft_limit_in_bytes higher than the
memory.limit_in_bytes since the hard limit will take precedence anyway.
-- If a GrpNodes limit is configurated in an association, partition QOS or
job QOS then favor use of nodes already allocated to that entity. This
will result in the configured node "Weight" being incremented by one for
nodes which are not prefered. Consider adjusting configured node "Weight"
values to achieve the desired node preferences.
-- Add full node state debug2 output to slurmdbd node up/down update
-- Set CUDA_VISIBLE_DEVICES and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment
variables in Prolog and Epilog for jobs requesting gres/mps.
-- Added thresholds for backfill parameters.
-- Fix for backfill sleep overflow when large values are set.
-- Execute Epilog on nodes reliquished from job (i.e. job resized).
-- Rename burst_buffer/cray plugin to burst_buffer/datawarp.
-- X11 Forwarding - reimplement using new internal network forwarding RPCs.
-- Remove slurm_jobcomp_get_errno and slurm_jobcomp_strerror from jobcomp
plugin API.
-- Optimize backfill for checking max jobs per assoc, partition, user, etc.
* Changes in Slurm 19.05.0pre1
==============================
-- Run epilog and clean up allocation when a job is resized to zero and its
resources transferred to another job (--depend=expand).
-- If GRES are associated with specific sockets, identify those sockets in the
output of "scontrol show node". For example if all 4 GPUs on a node are
all associated with socket zero, then "Gres=gpu:4(S:0)". If associated
with sockets 0 and 1 then "Gres=gpu:4(S:0-1)". The information of which
specific GPUs are associated with specific GPUs is not reported, but only
available by parsing the gres.conf file.
-- Add configuration parameter "GpuFreqDef" to control a job's default GPU
frequency.
-- Add job flags to the database. Currently used to determine which scheduler
scheduled the job.
-- Add constraints/features to the database.
-- Add last reason job didn't run before resources/priority to the database.
-- Make it so we set the alloc_node in a resource allocation based on the auth
plugin instead of the rpc call.
* Changes in Slurm 18.08.10
===========================
* Changes in Slurm 18.08.9
==========================
-- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block.
-- Make sview work with glib2 v2.62.
-- Make Slurm compile on linux after sys/sysctl.h was deprecated.
-- Install slurmdbd.conf.example with 0600 permissions to encourage secure
use. CVE-2019-19727.
-- srun - do not continue with job launch if --uid fails. CVE-2019-19728.
* Changes in Slurm 18.08.8
==========================
-- Update "xauth list" to use the same 10000ms timeout as the other xauth
commands.
-- Fix issue in gres code to handle a gres cnt of 0.
-- Don't purge jobs if backfill is running.
-- Verify job is pending add/removing accrual time.
-- Don't abort when the job doesn't have an association that was removed
before the job was able to make it to the database.
-- Set state_reason if select_nodes() fails job for QOS or Account.
-- Avoid seg_fault on referencing association without a valid_qos bitmap.
-- If Association/QOS is removed on a pending job set that job as ineligible.
-- When changing a jobs account/qos always make sure you remove the old limits.
-- Don't reset a FAIL_QOS or FAIL_ACCOUNT job reason until the qos or
account changed.
-- Restore "sreport -T ALL" functionality.
-- Correctly typecast signals being sent through the api.
-- Properly initialize structures throughout Slurm.
-- Sync "numtask" squeue format option for jobs and steps to "numtasks".
-- Fix sacct -PD to avoid CA before start jobs.
-- Fix potential deadlock with backup slurmctld.
-- Fixed issue with jobs not appearing in sacct after dependency satisfied.
-- Fix showing non-eligible jobs when asking with -j and not -s.
-- Fix issue with backfill scheduler scheduling tasks of an array
when not the head job.
-- accounting_storage/mysql - fix SIGABRT in the archive load logic.
-- accounting_storage/mysql - fix memory leak in the archive load logic.
-- Limit records per single SQL statement when loading archived data.
-- Fix unnecessary reloading of job submit plugins.
-- Allow job submit plugins to be turned on/off with a reconfigure.
-- Fix segfault when loading/unloading Lua job submit plugin multiple times.
-- Fix printing duplicate error messages of jobs rejected by job submit plugin.
-- Fix printing of job submit plugin messages of het jobs without pack id.
-- Fix memory leak in group_cache.c
-- Fix jobs stuck from FedJobLock when requeueing in a federation
-- Fix requeueing job in a federation of clusters with differing associations
-- sacctmgr - free memory before exiting in 'sacctmgr show runaway'.
-- Fix seff showing memory overflow when steps tres mem usage is 0.
-- Upon archive file name collision, create new archive file instead of
overwriting the old one to prevent lost records.
-- Limit archive files to 50000 records per file so that archiving large
databases will succeed.
-- Remove stray newlines in SPANK plugin error messages.
-- Fix archive loading events.
-- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and
--nodelist options.
-- Fix main scheduler from potentially not running through whole queue.
-- cons_res/job_test - prevent a job from overallocating a node memory.
-- cons_res/job_test - fix to consider a node's current allocated memory when
testing a job's memory request.
-- Fix issue where multi-node job steps on cloud nodes wouldn't finish cleaning
up until the end of the job (rather than the end of the step).
-- Fix issue with a 17.11 sbcast call to a 18.08 daemon.
-- Add new job bit_flags of JOB_DEPENDENT.
-- Make it so dependent jobs reset the AccrueTime and do not count against any
AccrueTime limits.
-- Fix sacctmgr --parsable2 output for reservations and tres.
-- Prevent slurmctld from potential segfault after job_start_data() called
for completing job.
-- Fix jobs getting on nodes with "scontrol reboot asap".
-- Record node reboot events to database.
-- Fix node reboot failure message getting to event table.
-- Don't write "(null)" to event table when no event reason exists.
-- Fix minor memory leak when clearing runaway jobs.
-- Avoid flooding slurmctld and logging when prolog complete RPC errors occur.
-- Fix GCC 9 compiler warnings.
-- Fix seff human readable memory string for values below a megabyte.
-- Fix dump/load of rejected heterogeneous jobs.
-- For heterogeneous jobs, do not count the each component against the QOS or
association job limit multiple times.
-- slurmdbd - avoid reservation flag column corruption with the use of newer
flags, instead preserve the older flag fields that we can still fit in the
smallint field, and discard the rest.
-- Fix security issue in accounting_storage/mysql plugin on archive file loads
by always escaping strings within the slurmdbd. CVE-2019-12838.
* Changes in Slurm 18.08.7
==========================
-- Set debug statement to debug2 to avoid benign error messages.
-- Add SchedulerParameters option of bf_hetjob_immediate to attempt to start
a heterogeneous job as soon as all of its components are determined able to
do so.
-- Fix underflow causing decay thread to exit.
-- Fix main scheduler not considering hetjobs when building the job queue.
-- Fix regression for sacct to display old jobs without a start time.
-- Fix setting correct number of gres topology bits.
-- Update hetjobs pending state reason when appropriate.
-- Fix accounting_storage/filetxt's understanding of TRES.
-- Set Accrue time when not enforcing limits.
-- Fix srun segfault when requesting a hetjob with test_exec or bcast options.
-- Hide multipart priorities log message behind Priority debug flag.
-- sched/backfill - Make hetjobs sensitive to bf_max_job_start.
-- Fix slurmctld segfault due to job's partition pointer NULL dereference.
-- Fix issue with OR'ed job dependencies.
-- Add new job's bit_flags of INVALID_DEPEND to prevent rebuilding a job's
dependency string when it has at least one invalid and purged dependency.
-- Promote federation unsynced siblings log message from debug to info.
-- burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes.
-- burst_buffer/cray - fix memory leak due to unfreed job script content.
-- node_features/knl_cray - fix script_argv use-after-free.
-- burst_buffer/cray - fix script_argv use-after-free.
-- Fix invalid reads of size 1 due to non null-terminated string reads.
-- Add extra debug2 logs to identify why BadConstraints reason is set.
* Changes in Slurm 18.08.6-2
============================
-- Remove deadlock situation when logging and --enable-debug is used.
-- Fix RPM packaging for accounting_storage/mysql.
* Changes in Slurm 18.08.6
==========================
-- Added parsing of -H flag with scancel.
-- Fix slurmsmwd build on 32-bit systems.
-- acct_gather_filesystem/lustre - add support for Lustre 2.12 client.
-- Fix per-partition TRES factors/priority
-- Fix per-partition NICE priority
-- Fix partition access check validation for multi-partition job submissions.
-- Prevent segfault on empty response in 'scontrol show dwstat'.
-- node_features/knl_cray plugin - Preserve node's active features if it has
already booted when slurmctld daemon is reconfigured.
-- Detect missing burst buffer script and reject job.
-- GRES: Properly reset the topo_gres_cnt_alloc counter on slurmctld restart
to prevent underflow.
-- Avoid errors from packing accounting_storage_mysql.so when RPM is built
with out mysql support.
-- Remove deprecated -t option from slurmctld --help.
-- acct_gather_filesystem/lustre - fix stats gathering.
-- Enforce documented default usage start and end times when querying jobs from
the database.
-- Fix issues when querying running jobs from the database.
-- Deny sacct request where start time is later than the end time requested.
-- Fix sacct verbose about time and states queried.
-- burst_buffer/cray - allow 'scancel --hurry <jobid>' to tear down a burst
buffer that is currently staging data out.
-- X11 forwarding - allow setup if the DISPLAY environment variable lacks
a screen number. (Permit both "localhost:10.0" and "localhost:10".)
-- docs - change HTML title to include the page title or man page name.
-- X11 forwarding - fix an unnecessary error message when using the
local_xauthority X11Parameters option.
-- Add use_raw_hostname to X11Parameters.
-- Fix smail so it passes job arrays to seff correctly.
-- Don't check InactiveLimit for salloc --no-shell jobs.
-- Add SALLOC_GRES and SBATCH_GRES as input to salloc/sbatch.
-- Remove drain state when node doesn't reboot by ResumeTimeout.
-- Fix considering "resuming" nodes in scheduling.
-- Do not kill suspended jobs due to exceeding time limit.
-- Add NoAddrCache CommunicationParameter.
-- Don't ping powering up cloud nodes.
-- Add cloud_dns SlurmctldParameter.
-- Consider --sbindir configure option as the default path to find slurmstepd.
-- Fix node state printing of DRAINED$
-- Fix spamming dbd of down/drained nodes in maintenance reservation.
-- Avoid buffer overflow in time_str2secs.
-- Calculate suspended time for suspended steps.
-- Add null check for step_ptr->step_node_bitmap in _pick_step_nodes.
-- Fix multi-cluster srun issue after 'scontrol reconfigure' was called.
-- Fix accessing response_cluster_rec outside of write locks.
-- Fix Lua user messages not showing up on rejected submissions.
-- Fix printing multi-line error messages on rejected submissions.
* Changes in Slurm 18.08.5-2
============================
-- Fix Perl build for 32-bit systems.
* Changes in Slurm 18.08.5
==========================
-- Backfill - If a job has a time_limit guess the end time of a job better
if OverTimeLimit is Unlimited.
-- Fix "sacctmgr show events event=cluster"
-- Fix sacctmgr show runawayjobs from sibling cluster
-- Avoid bit offset of -1 in call to bit_nclear().
-- Insure that "hbm" is a configured GresType on knl systems.
-- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres
other than knl.
-- cons_res: Prevent overflow on multiply.
-- Better debug for bad values in gres.conf.
-- Fix double accounting of energy at end of job.
-- Read gres.conf for cloud nodes on slurmctld.
-- Don't assume the first node of a job is the batch host when purging jobs
from a node.
-- Better debugging when a job doesn't have a job_resrcs ptr.
-- Store ave watts in energy plugins.
-- Add XCC plugin for reading Lenovo Power.
-- Fix minor memory leak when scheduling rebootable nodes.
-- Fix debug2 prefix for sched log.
-- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job.
-- sbatch - search current working directory first for job script.
-- Make it so held jobs reset the AccrueTime and do not count against any
AccrueTime limits.
-- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the
job sorting algorithm for scheduling heterogeneous jobs.
-- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures.
-- Fix segfault with job arrays using X11 forwarding.
-- Revert regression caused by e0ee1c7054 which caused negative values and
values starting with a decimal to be invalid for PriorityWeightTRES and
TRESBillingWeight.
-- Fix possibility to update a job's reservation to none.
-- Suppress connection errors to primary slurmdbd when backup dbd is active.
-- Suppress connection errors to primary db when backup db kicks in
-- Add missing fields for sacct --completion when using jobcomp/filetxt.
-- Fix incorrect values set for UserCPU, SystemCPU, and TotalCPU sacct fields
when JobAcctGatherType=jobacct_gather/cgroup.
-- Fixed srun from double printing invalid option msg twice.
-- Remove unused -b flag from getopt call in sbatch.
-- Disable reporting of node TRES in sreport.
-- Re-enabling features combined by OR within parenthesis for non-knl setups.
-- Prevent sending duplicate requests to reboot a node before ResumeTimeout.
-- Down nodes that don't reboot by ResumeTimeout.
-- Update seff to reflect API change from rss_max to tres_usage_in_max.
-- Add missing TRES constants from perl API.
-- Fix issue where sacct would return incorrect array tasks when querying
specific tasks.
-- Add missing variables to slurmdb_stats_t in the perlapi.
-- Fix nodes not getting reboot RPC when job requires reboot of nodes.
-- Fix failing update the partition list of a job.
-- Use slurm.conf gres ids instead of gres.conf names to get a gres type name.
-- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc.
CVE-2019-6438.
* Changes in Slurm 18.08.4
==========================
-- burst_buffer/cray - avoid launching a job that would be immediately
cancelled due to a DataWarp failure.
-- Fix message sent to user to display preempted instead of time limit when
a job is preempted.
-- Fix memory leak when a failure happens processing a nodes gres config.
-- Improve error message when failures happen processing a nodes gres config.
-- When building rpms ignore redundant standard rpaths and insecure relative
rpaths, for RHEL based distros which use "check-rpaths" tool.
-- Don't skip jobs in scontrol hold.
-- Avoid locking the job_list when unneeded.
-- Allow --cpu-bind=verbose to be used with SLURM_HINT environment variable.
-- Make it so fixing runaway jobs will not alter the same job requeued
when not runaway.
-- Avoid checking state when searching for runaway jobs.
-- Remove redundant check for end time of job when searching for runaway jobs.
-- Make sure that we properly check for runawayjobs where another job might
have the same id (for example, if a job was requeued) by also checking the
submit time.
-- Add scontrol update job ResetAccrueTime to clear a job's time
previously accrued for priority.
-- cons_res: Delay exiting cr_job_test until after cores/cpus are calculated
and distributed.
-- Fix bug where binary in cwd would trump binary in PATH with test_exec.
-- Fix check to test printf("%s\n", NULL); to not require
-Wno-format-truncation CFLAG.
-- Fix JobAcctGatherParams=UsePss to report the correct usage.
-- Fix minor memory leak in pmix plugin.
-- Fix minor memory leak in slurmctld when reading configuration.
-- Handle return codes correctly from pthread_* functions.
-- Fix minor memory leak when a slurmd is unable to contact a slurmctld
when trying to register.
-- Fix sreport sizesbyaccount report when using Flatview and accounts.
-- Fix incorrect shift when dealing with node weights and scheduling.
-- libslurm/perl - Fix segfault caused by incorrect hv_to_slurm_ctl_conf.
-- Add qos and assoc options to confirmation dialogs.
-- Handle updating identical license or partition information correctly.
-- Makes sure accounts and QOS' are all lower case to match documentation
when read in from the slurm.conf file.
-- Don't consider partitions without enough nodes in reservation,
main scheduler.
-- Set SLURM_NTASKS correctly if having to determine from other options.
-- Removed GCP scripts from contribs. Now located at:
https://github.com/SchedMD/slurm-gcp.
-- Don't check existence of srun --prolog or --epilog executables when set to
"none" and SLURM_TEST_EXEC is used.
-- Add "P" suffix support to job and step tres specifications.
-- When doing a reconfigure handle QOS' GrpJobsAccrue correctly.
-- Remove unneeded extra parentheses from sh5util.
-- Fix jobacct_gather/cgroup to work correctly when more than one task is
started on a node.
-- If requesting --ntasks-per-node with no tasks set tasks correctly.
-- Accept modifiers for TRES originally added in 6f0342e0358.
-- Don't remove reservation on slurmctld restart if nodes are removed from
configuration.
-- Fix bad xfree in task/cgroup.
-- Fix removing counters if a job array isn't subject to limits and is
canceled while pending.
-- Make sure SLURM_NTASKS_PER_NODE is set correctly when env is overwritten
by the command line.
-- Clean up step on a failed node correctly.
-- mpi/pmix: Fixed the logging of collective state.
-- mpi/pmix: Make multi-slurmd work correctly when using ring communication.
-- mpi/pmix: Fix double invocation of the PMIx lib fence callback.
-- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll.
-- Fix race condition in route/topology when the slurmctld is reconfigured.
-- In route/topology validate the slurmctld doesn't try to initialize the
node system.
-- Fix issue when requesting invalid gres.
-- Validate job_ptr in backfill before restoring preempt state.
-- Fix issue when job's environment is minimal and only contains variables
Slurm is going to replace internally.
-- When handling runaway jobs remove all usage before rollup to remove any
time that wasn't existent instead of just updating lines that have time
with a lesser time.
-- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET in the
environment if the corresponding command line options are used.
-- slurmd - fix handling of the -f flag to specify alternate config file
locations.
-- Fix scheduling logic to avoid using nodes that require a reboot for KNL
node change when possible.
-- Fix scheduling logic bug. There should have been a test for _not_
NODE_SET_REBOOT to continue.
-- Fix a scheuling logic bug with respect to XOR operation support when there
are down nodes.
-- If there is a constraint construct of the form "[...&...]"
then an error is generated if more than one of those specifications
contains KNL NUMA or MCDRAM modes.
-- Fix stepd segfault race if slurmctld hasn't registered with the launching
slurmd yet delivering it's TRES list.
-- Add SchedulerParameters option of bf_ignore_newly_avail_nodes to avoid
scheduling lower priority jobs on resources that become available during
the backfill scheduling cycle when bf_continue is enabled.
-- Decrement message_connections in stepd code on error path correctly.
-- Decrease an error message to be debug.
-- Fix missing suffixes in squeue.
-- pam_slurm_adopt - send an error message to the user if no Slurm jobs
can be located on the node.
-- Run SlurmctldPrimaryOffProg when the primary slurmctld process shuts down.
-- job_submit/lua: Add several slurmctld return codes.
-- job_submit/lua: Add user/group info to jobs.
-- Fix formatting issues when printing uint64_t.
-- Bump RLIMIT_NOFILE for daemons in systemd services.
-- Expand %x in job name in 'scontrol show job'.
-- salloc/sbatch/srun - print warning if mutually exclusive options of --mem
and --mem-per-cpu are both set.
* Changes in Slurm 18.08.3
==========================
-- Fix regression in 18.08.1 that caused dbd messages to not be queued up
when the dbd was down.
-- Fix regression in 18.08.1 that can cause a slurmctld crash when splitting
job array elements.
* Changes in Slurm 18.08.2
==========================
-- Correctly initialize variable in env_array_user_default().
-- Remove race condition when signaling starting step.
-- Fix issue where 17.11 job's using GRES in didn't initialize new 18.08
structures after unpack.
-- Stop removing nodes once the minimum CPU or node count for the job is
reached in the cons_res plugin.
-- Process any changes to MinJobAge and SlurmdTimeout in the slurmctld when
it is reconfigured to determine changes in its background timers.
-- Use previous SlurmdTimeout in the slurmctld after a reconfigure to
determine the time a node has been down.
-- Fix multi-cluster srun between clusters with different SelectType plugins.
-- Fix removing job licenses on reconfig/restart when configured license
counts are 0.
-- If a job requested multiple licenses and one license was removed then on
a reconfigure/restart all of the licenses -- including the valid ones
would be removed.
-- Fix issue where job's license string wasn't updated after a restart when
licenses were removed or added.
-- Add allow_zero_lic to SchedulerParameters.
-- Avoid scheduling tasks in excess of ArrayTaskThrottle when canceling tasks
of an array.
-- Fix jobs that request memory per node and task count that can't be
scheduled right away.
-- Avoid infinite loop with jobacct_gather/linux when pids wrap around
/proc/sys/kernel/pid_max.
-- Fix --parsable2 output for sacct and sstat commands to remove a stray
trailing delimiter.
-- When modifying a user's name in sacctmgr enforce PreserveCaseUser.
-- When adding a coordinator or user that was once deleted enforce
PreserveCaseUser.
-- Correctly handle scenarios where a partitions MaxMemPerCPU is less than
a jobs --mem-per-cpu and also -c is greater than 1.
-- Set AccrueTime correctly when MaxJobsAccrue is disabled and BeginTime has
not been established.
-- Correctly account for job arrays for new {Max/Grp}JobsAccrue limits.
* Changes in Slurm 18.08.1
==========================
-- Remove commented-out parts of man pages related to cons_tres work in 19.05,
as these were showing up on the web version due to a syntax error.
-- Prevent slurmctld performance issues in main background loop if multiple
backup controllers are unavailable.
-- Add missing user read association lock in burst_buffer/cray during init().
-- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
-- Fix creation of step hwloc xml file for after cpuset cgroup has been
created.
-- Add userspace as a valid default governor.
-- Add timers to group_cache_lookup so if going slow advise
LaunchParameters=send_gids.
-- Fix SLURM_STEP_GRES=none to work correctly.
-- Fix potential memory leak when a failure happens unpacking a ctld_multi_msg.
-- Fix potential double free when a faulure happens when unpacking a
node_registration_status_msg.
-- Fix sacctmgr show runaways.
-- Removed non-POSIX append operator from configure script for non-bash
support.
-- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
-- Fix sacct to not print huge reserve times when the job was never eligible.
-- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
burst buffer.
-- burst_buffer/cray - Update burst buffers when an association or qos
is removed from the system.
-- Remove documentation for deprecated Cray/ALPS systems. Please switch to
Native Cray mode instead.
-- Completely copy features when copying the list in the slurmctld.
-- PMIX - Fix issue with packing processes when using an arbitrary task
distribution.
-- Fix hostlists to be able to handle nodenames with '-' in them surrounded
by integers.
-- Added sort option to sprio output.
-- Fix correct job CPU count allocated.
-- Fix sacctmgr setting GrpJobs limit when setting GrpJobsAccrue limit.
-- Change the defaults to MemLimitEnforce=no and NoOverMemoryKill
(See RELEASE_NOTES).
-- Prevent abort when using Cray node features plugin on non-knl.
-- Add ability to reboot down nodes with scontrol reboot_nodes.
-- Protect against sending to the slurmdbd if the connection has gone away.
-- Fix invalid read when not using backup slurmctlds.
-- Prevent acct coordinators from changing default acct on add user.
-- Don't allow scontrol top do modify job priorities when priority == 1.
-- slurmsmwd - change parsing code to handle systems with the svid or inst
fields set in xtconsumer output.
-- Fix infinite loop in slurmctld if GRES is specified without a count.
-- sacct: Print error when unknown arguments are found.
-- Fix checking missing return codes when unpacking structures.
-- Fix slurm.spec-legacy including slurmsmwd
-- More explicit error message when cgroup oom-kill events detected.
-- When updating an association and are unable to find parent association
initialize old fairshare association pointer correctly.
-- Wrap slurm_cond_signal() calls with mutexes where needed.
-- Fix correct timeout with resends in slurm_send_only_node_msg.
-- Fix pam_slurm_adopt to honor action_adopt_failure.
-- Have the slurmd recreate the hwloc xml file for the full system on restart.
-- sdiag - correct the units for the gettimeofday() stat to microseconds.
-- Set SLURM_CLUSTER_NAME environment variable in MailProg to the ClusterName.
-- smail - use SLURM_CLUSTER_NAME environment variable.
-- job_submit/lua - expose argc/argv options through lua interface.
-- slurmdbd - prevent false-positive warning about innodb settings having
been set too low if they're actually set over 2GB.
* Changes in Slurm 18.08.0
==========================
-- Fix segfault on job arrays when starting controller without dbd up.
-- Fix pmi2 to build with gcc 8.0+.
-- Remove the development snapshot of select/cons_tres plugin.
-- Fix slurmd -C to not print benign error from xcpuinfo.
-- Fix potential double locks in the assoc_mgr.
-- Fix sacct truncate flag behavior Truncated pending jobs will always
return a start and end time set to the window end time, so elapsed
time is 0.
-- Fix extern step hanging forever when canceled right after creation.
-- sdiag - add slurmctld agent count.
-- Remove requirement to have cgroup_allowed_devices_file.conf in order to
constrain devices. By default all devices are allowed and GRES, that are
associated with a device file, that are not requested are restricted.
-- Fix proper alignment of clauses when determining if more nodes are needed
for an allocation.
-- Fix race condition when canceling a federation job that just started
running.
-- Prevent extra resources from being allocated when combining certain flags.
-- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
when using --hint=nomultithread.
-- Fix left over socket file when step is ending and using pmi2 with
%n or %h in the spool dir.
-- Don't remove hwloc full system xml file when shutting down the slurmd.
-- Fix segfault that could happen with a het job when it was canceled while
starting.
-- Fix scan-build false-positive warning about invalid memory access in the
_ping_controller() function.
-- Add control_inx value to trigger_info_msg_t to permit future work in the
trigger management code to distinguish which of multiple backup controllers
has changed state.
* Changes in Slurm 18.08.0rc1
==============================