Edit this page | Blame

Octopus Maintenance

Slurm

Status of slurm (as of 202512)

sinfo
workers*     up   infinite      8   idle octopus[03,05-11]
allnodes     up   infinite      3  alloc tux[06,08-09]
allnodes     up   infinite     11   idle octopus[02-03,05-11],tux[05,07]
tux          up   infinite      3  alloc tux[06,08-09]
tux          up   infinite      2   idle tux[05,07]
1tbmem       up   infinite      1   idle octopus02
headnode     up   infinite      1   idle octopus01
highmem      up   infinite      2   idle octopus[02,11]
386mem       up   infinite      6   idle octopus[03,06-10]
lowmem       up   infinite      7   idle octopus[03,05-10]
sinfo -R
squeue

we have draining nodes, but no jobs running on them

Reviving draining node (as root)

scontrol
  update NodeName=octopus05 State=DOWN Reason="undraining"
  update NodeName=octopus05 State=RESUME
  show node octopus05

Kill time can lead to drain state

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec

check valid configuration with 'slurmd -C' and update nodes with

scontrol reconfigure

Password management

So we create a script that can deploy files from octopus01 (head node). Unfortunately ids in passwd do no match, so we can't copy those yet.

See /etc/nodes for script and ssh files, sudoers (etc)

Basically the root user can copy across.

Execute binaries on mounted devices

To avoid './scratch/script.sh: Permission denied' on 'device_file':

- 'sudo bash' - 'ls /scratch -l' to check where '/scratch' is - 'vim /etc/fstab' - replace 'noexec' with 'exec' for 'device_file' - 'mount -o remount [device_file]' to remount the partition with its new configuration.

Some notes:

root@tux09:~# mkdir -p /var/lib/nfs/statd root@tux09:~# systemctl enable rpcbind Synchronizing state of rpcbind.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install enable rpcbind root@tux09:~# systemctl list-unit-files | grep -E 'rpc-statd.service' rpc-statd.service static -

network-online.target x-systemd.device-timeout= 10.0.0.110:/export/3T /mnt/3T nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0

Installation of 'munge' and 'slurm' on a new node

Current nodes in the pool have:

munge --version
    munge-0.5.13 (2017-09-26)
sbatch --version
    slurm-wlm 18.08.5-2

To install 'munge', go to 'octopus01' and run:

guix package -i munge@0.5.14 -p /export/octopus01/guix-profiles/slurm

systemctl status munge # to check if the service is running and where its service file is

We need to setup the rights for 'munge':

sudo bash

addgroup -gid 900 munge
adduser -uid 900 -gid 900 --disabled-password munge

sed 's,/home/munge:/bin/bash,/var/lib/munge:/usr/sbin/nologin,g' /etc/passwd -i

mkdir -p /var/lib/munge
chown munge:munge /var/lib/munge/

mkdir -p /etc/munge
# copy 'munge.key' (from a working node) to '/etc/munge/munge.key'
chown -R munge:munge /etc/munge

mkdir -p /run/munge
chown munge:munge /run/munge

mkdir -p /var/log/munge
chown munge:munge /var/log/munge

mkdir -p /var/run/munge # todo: not sure why it needs such a folder
chown munge:munge /var/run/munge

# copy 'munge.service' (from a working node) to '/etc/systemd/system/munge.service'

systemctl daemon-reload
systemctl enable munge
systemctl start munge
systemctl status munge

To test the new installation, go to 'octopus01' and then:

munge -n | ssh tux08 /export/octopus01/guix-profiles/slurm-2-link/bin/unmunge

If you get 'STATUS: Rewound credential (16)', it means that there is a difference between the encoding and decoding times. To fix it, go into the new machine and fix the time with

sudo date MMDDhhmmYYYY.ss

To install 'slurm', go to 'octopus01' and run:

guix package -i slurm@18.08.9 -p /export/octopus01/guix-profiles/slurm

We need to setup the rights for 'slurm':

sudo bash

addgroup -gid 901 slurm
adduser -uid 901 -gid 901 --no-create-home --disabled-password slurm

sed 's,/home/slurm:/bin/bash,/var/lib/slurm:/bin/bash,g' /etc/passwd -i

mkdir -p /var/lib/slurm
chown munge:munge /var/lib/slurm/

mkdir -p /etc/slurm
# copy 'slurm.conf' to '/etc/slurm/slurm.conf'
# copy 'cgroup.conf' to '/etc/slurm/cgroup.conf'

chown -R slurm:slurm /etc/slurm

mkdir -p /run/slurm
chown slurm:slurm /run/slurm

mkdir -p /var/log/slurm
chown slurm:slurm /var/log/slurm

# copy 'slurm.service' to '/etc/systemd/system/slurm.service'

/export/octopus01/guix-profiles/slurm-2-link/sbin/slurmd -f /etc/slurm/slurm.conf -C | head -n 1 >> /etc/slurm/slurm.conf # add node configuration information

systemctl daemon-reload
systemctl enable slurm
systemctl start slurm
systemctl status slurm

On 'octopus01' (the master):

sudo bash

# add the new node to '/etc/slurm/slurm.conf'

systemctl restart slurmctld # after editing /etc/slurm/slurm.conf on the master

Removing a node

We are removing o3 so it can become the new head node:

scontrol update nodename=octopus03 state=drain reason="removing"
scontrol show node octopus03 | grep State
scontrol update nodename=octopus03 state=down reason="removed"
  State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
(made with skribilo)