Status of slurm (as of 202512)
sinfo workers* up infinite 8 idle octopus[03,05-11] allnodes up infinite 3 alloc tux[06,08-09] allnodes up infinite 11 idle octopus[02-03,05-11],tux[05,07] tux up infinite 3 alloc tux[06,08-09] tux up infinite 2 idle tux[05,07] 1tbmem up infinite 1 idle octopus02 headnode up infinite 1 idle octopus01 highmem up infinite 2 idle octopus[02,11] 386mem up infinite 6 idle octopus[03,06-10] lowmem up infinite 7 idle octopus[03,05-10]
sinfo -R squeue
we have draining nodes, but no jobs running on them
Reviving draining node (as root)
scontrol update NodeName=octopus05 State=DOWN Reason="undraining" update NodeName=octopus05 State=RESUME show node octopus05
Kill time can lead to drain state
scontrol show config | grep kill UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec
check valid configuration with 'slurmd -C' and update nodes with
scontrol reconfigure
So we create a script that can deploy files from octopus01 (head node). Unfortunately ids in passwd do no match, so we can't copy those yet.
See /etc/nodes for script and ssh files, sudoers (etc)
Basically the root user can copy across.
To avoid './scratch/script.sh: Permission denied' on 'device_file':
- 'sudo bash' - 'ls /scratch -l' to check where '/scratch' is - 'vim /etc/fstab' - replace 'noexec' with 'exec' for 'device_file' - 'mount -o remount [device_file]' to remount the partition with its new configuration.
Some notes:
root@tux09:~# mkdir -p /var/lib/nfs/statd root@tux09:~# systemctl enable rpcbind Synchronizing state of rpcbind.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install enable rpcbind root@tux09:~# systemctl list-unit-files | grep -E 'rpc-statd.service' rpc-statd.service static -
network-online.target x-systemd.device-timeout= 10.0.0.110:/export/3T /mnt/3T nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0
Current nodes in the pool have:
munge --version
munge-0.5.13 (2017-09-26)
sbatch --version
slurm-wlm 18.08.5-2
To install 'munge', go to 'octopus01' and run:
guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm systemctl status munge # to check if the service is running and where its service file is
We need to setup the rights for 'munge':
sudo bash addgroup -gid 900 munge adduser -uid 900 -gid 900 --disabled-password munge sed 's,/home/munge:/bin/bash,/var/lib/munge:/usr/sbin/nologin,g' /etc/passwd -i mkdir -p /var/lib/munge chown munge:munge /var/lib/munge/ mkdir -p /etc/munge # copy 'munge.key' (from a working node) to '/etc/munge/munge.key' chown -R munge:munge /etc/munge mkdir -p /run/munge chown munge:munge /run/munge mkdir -p /var/log/munge chown munge:munge /var/log/munge mkdir -p /var/run/munge # todo: not sure why it needs such a folder chown munge:munge /var/run/munge # copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service' systemctl daemon-reload systemctl enable munge systemctl start munge systemctl status munge
To test the new installation, go to 'octopus01' and then:
munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge
If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with
sudo date MMDDhhmmYYYY.ss
To install 'slurm', go to 'octopus01' and run:
guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm
We need to setup the rights for 'slurm':
sudo bash addgroup -gid 901 slurm adduser -uid 901 -gid 901 --no-create-home --disabled-password slurm sed 's,/home/slurm:/bin/bash,/var/lib/slurm:/bin/bash,g' /etc/passwd -i mkdir -p /var/lib/slurm chown munge:munge /var/lib/slurm/ mkdir -p /etc/slurm # copy 'slurm.conf' to '/etc/slurm/slurm.conf' # copy 'cgroup.conf' to '/etc/slurm/cgroup.conf' chown -R slurm:slurm /etc/slurm mkdir -p /run/slurm chown slurm:slurm /run/slurm mkdir -p /var/log/slurm chown slurm:slurm /var/log/slurm # copy 'slurm.service' to '/etc/systemd/system/slurm.service' /export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information systemctl daemon-reload systemctl enable slurm systemctl start slurm systemctl status slurm
On 'octopus01' (the master):
sudo bash # add the new node to '/etc/slurm/slurm.conf' systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master
We are removing o3 so it can become the new head node:
scontrol update nodename=octopus03 state=drain reason="removing" scontrol show node octopus03 | grep State scontrol update nodename=octopus03 state=down reason="removed" State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A