Installing Slurm across a multi-node cluster

In this example Slurm cluster we have 3 nodes, node1, node2 and node3

node1 is on IP 10.0.0.1, node2 is on 10.0.0.2 and node3 is on 10.0.0.3

A critical pre-req is that your /etc/hosts or DNS forward and reverse hostname and IP Address lookup commands all work and return the correct and same information about the hostnames and IPs of each node in the cluster.
In the context of this lab cluster, the /etc/hosts file is listing all hostnames and IPs used in the cluster.

All our cluster nodes are running the latest CentOS 9 Linux, updated and rebooted after running “dnf distro-sync” to be all on the same CentOS/RHEL software versions.

Firewalld is enabled and the following firewall-cmd commands have been run on each node:

firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”6817″ accept’
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”6818″ accept’
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”6819″ accept’
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”60001-60100″ accept’ && firewall-cmd –reload

Whether you need munge or not, its best to install it, so do:

dnf -y install epel-release
dnf install -y munge munge-libs

On node1 (your “main” or “head” node) run: /usr/sbin/create-munge-key

And then set the various files and directory permissions on the RHEL world are needed, so run the following on each node:

sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
sudo chmod 0755 /run/munge/
sudo chmod 0700 /etc/munge/munge.key


Then copy the /etc/munge/munge.key file to all your nodes.

Get munge going on each node with the above dnf install command and then:

systemctl enable munge && systemctl start munge
systemctl status munge

Some tests/debugs need each node to be able to ssh into the others, so setup your authorized_keys files on each node as you need.

Now install Slurm with the following on node1:

dnf install -y slurm slurm-slurmctld slurm-slurmdbd mariadb-server slurm-slurmd

And do the following installs on every compute node:

dnf install -y slurm slurm-slurmd

To configure Slurm. all nodes have the same /etc/slurm/slurm.conf file, which critically has the following changes from the default, being:

  • Set ClusterName
  • SlurmctldHost=node1(10.0.0.1)
  • List all nodes in NodeName= entries:
    NodeName=node1 CPUs=2 State=UNKNOWN
    NodeName=node2 CPUs=2 State=UNKNOWN
    NodeName=node3 CPUs=2 State=UNKNOWN
  • Add a critical firewall compatibility, see below:
    SrunPortRange=60001-60100

    Restart slurmctld on the head/main node (node1) and then restart slurmd on all nodes.

Check slurm cluster status with the sinfo command from any node, all should return the same info in standard operations.

Debug with logfiles in /var/log/slurm/

This entry was posted in Sales. Bookmark the permalink.