In this example Slurm cluster we have 3 nodes, node1, node2 and node3
node1 is on IP 10.0.0.1, node2 is on 10.0.0.2 and node3 is on 10.0.0.3
A critical pre-req is that your /etc/hosts or DNS forward and reverse hostname and IP Address lookup commands all work and return the correct and same information about the hostnames and IPs of each node in the cluster.
In the context of this lab cluster, the /etc/hosts file is listing all hostnames and IPs used in the cluster.
All our cluster nodes are running the latest CentOS 9 Linux, updated and rebooted after running “dnf distro-sync” to be all on the same CentOS/RHEL software versions.
Firewalld is enabled and the following firewall-cmd commands have been run on each node:
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”6817″ accept’
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”6818″ accept’
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”6819″ accept’
firewall-cmd –permanent –zone=public –add-rich-rule=’rule family=”ipv4″ source address=”10.0.0.0/24″ port protocol=”tcp” port=”60001-60100″ accept’ && firewall-cmd –reload
Whether you need munge or not, its best to install it, so do:
dnf -y install epel-release
dnf install -y munge munge-libs
On node1 (your “main” or “head” node) run: /usr/sbin/create-munge-key
And then set the various files and directory permissions on the RHEL world are needed, so run the following on each node:
sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
sudo chmod 0755 /run/munge/
sudo chmod 0700 /etc/munge/munge.key
Then copy the /etc/munge/munge.key file to all your nodes.
Get munge going on each node with the above dnf install command and then:
systemctl enable munge && systemctl start munge
systemctl status munge
Some tests/debugs need each node to be able to ssh into the others, so setup your authorized_keys files on each node as you need.
Now install Slurm with the following on node1:
dnf install -y slurm slurm-slurmctld slurm-slurmdbd mariadb-server slurm-slurmd
And do the following installs on every compute node:
dnf install -y slurm slurm-slurmd
To configure Slurm. all nodes have the same /etc/slurm/slurm.conf file, which critically has the following changes from the default, being:
- Set ClusterName
- SlurmctldHost=node1(10.0.0.1)
- List all nodes in NodeName= entries:
NodeName=node1 CPUs=2 State=UNKNOWN
NodeName=node2 CPUs=2 State=UNKNOWN
NodeName=node3 CPUs=2 State=UNKNOWN - Add a critical firewall compatibility, see below:
SrunPortRange=60001-60100
Restart slurmctld on the head/main node (node1) and then restart slurmd on all nodes.
Check slurm cluster status with the sinfo command from any node, all should return the same info in standard operations.
Debug with logfiles in /var/log/slurm/