{"id":9367,"date":"2025-10-16T00:44:00","date_gmt":"2025-10-16T07:44:00","guid":{"rendered":"http:\/\/blog.networkpresence.co\/?p=9367"},"modified":"2026-02-22T21:27:59","modified_gmt":"2026-02-23T04:27:59","slug":"installing-slurm-across-a-multi-node-cluster","status":"publish","type":"post","link":"https:\/\/blog.networkpresence.co\/?p=9367","title":{"rendered":"Installing Slurm across a multi-node cluster"},"content":{"rendered":"\n<p>In this example Slurm cluster we have 3 nodes, node1, node2 and node3<\/p>\n\n\n\n<p>node1 is on IP 10.0.0.1, node2 is on 10.0.0.2 and node3 is on 10.0.0.3<\/p>\n\n\n\n<p>A critical pre-req is that your \/etc\/hosts or DNS forward and reverse hostname and IP Address lookup commands all work and return the correct and same information about the hostnames and IPs of each node in the cluster.<br>In the context of this lab cluster, the \/etc\/hosts file is listing all hostnames and IPs used in the cluster.<\/p>\n\n\n\n<p>All our cluster nodes are running the latest CentOS 9 Linux, updated and rebooted after running &#8220;dnf distro-sync&#8221; to be all on the same CentOS\/RHEL software versions.<\/p>\n\n\n\n<p>Firewalld is enabled and the following firewall-cmd commands have been run on each node:<\/p>\n\n\n\n<p><code>firewall-cmd --permanent --zone=public --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0\/24\" port protocol=\"tcp\" port=\"6817\" accept'<br>firewall-cmd --permanent --zone=public --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0\/24\" port protocol=\"tcp\" port=\"6818\" accept'<br>firewall-cmd --permanent --zone=public --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0\/24\" port protocol=\"tcp\" port=\"6819\" accept'<br>firewall-cmd --permanent --zone=public --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0\/24\" port protocol=\"tcp\" port=\"60001-60100\" accept' &amp;&amp; firewall-cmd --reload<\/code><\/p>\n\n\n\n<p>Whether you need munge or not, its best to install it, so do:<\/p>\n\n\n\n<p><code>dnf -y install epel-release<br>dnf install -y munge munge-libs<br><\/code><br>On node1 (your &#8220;main&#8221; or &#8220;head&#8221; node) run: \/usr\/sbin\/create-munge-key<\/p>\n\n\n\n<p>And then set the various files and directory permissions on the RHEL world are needed, so run the following on each node:<br><br><code>chown -R munge: \/etc\/munge\/ \/var\/log\/munge\/ \/var\/lib\/munge\/ \/run\/munge\/<br>chmod 0700 \/etc\/munge\/ \/var\/log\/munge\/ \/var\/lib\/munge\/<br>chmod 0755 \/run\/munge\/<br>chmod 0700 \/etc\/munge\/munge.key<\/code><\/p>\n\n\n\n<p><br>Then copy the <strong>\/etc\/munge\/munge.key<\/strong> file to all your nodes.<br><br>Get munge going on each node with the above dnf install command and then:<br><br><code>systemctl enable munge &amp;&amp; systemctl start munge <br>systemctl status munge<\/code><\/p>\n\n\n\n<p>Some tests\/debugs need each node to be able to ssh into the others, so setup your authorized_keys files on each node as you need.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Now install Slurm with the following on node1:<\/p>\n\n\n\n<p><code>dnf install -y slurm slurm-slurmctld slurm-slurmdbd mariadb-server slurm-slurmd<\/code><\/p>\n\n\n\n<p>And do the following installs on every compute node:<\/p>\n\n\n\n<p><code>dnf install -y slurm slurm-slurmd<\/code><br><br><\/p>\n\n\n\n<p>To configure Slurm. all nodes have the same <strong>\/etc\/slurm\/slurm.conf<\/strong> file, which critically has the following changes from the default, being:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Set ClusterName<\/li>\n\n\n\n<li><code>SlurmctldHost=node1(10.0.0.1)<\/code><\/li>\n\n\n\n<li>List all nodes in NodeName= entries:<br><code>NodeName=node1 CPUs=2 State=UNKNOWN<br>NodeName=node2 CPUs=2 State=UNKNOWN<br>NodeName=node3 CPUs=2 State=UNKNOWN<\/code><\/li>\n\n\n\n<li>Add a critical firewall compatibility, see below:<br><code>SrunPortRange=60001-60100<\/code><br><br>Restart slurmctld on the head\/main node (node1) and then restart slurmd on all nodes.<br><br>On node1, run: <code>systemctl restart slurmctld &amp;&amp; systemctl restart slurmd<\/code><br><br>On just the compute nodes, not node1, just run systemctl restart slurmd<\/li>\n<\/ul>\n\n\n\n<p>Check slurm cluster status with the sinfo command from any node, all should return the same info in standard operations.<br><br>Debug with logfiles in \/var\/log\/slurm\/<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this example Slurm cluster we have 3 nodes, node1, node2 and node3 node1 is on IP 10.0.0.1, node2 is on 10.0.0.2 and node3 is on 10.0.0.3 A critical pre-req is that your \/etc\/hosts or DNS forward and reverse hostname &hellip; <a href=\"https:\/\/blog.networkpresence.co\/?p=9367\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[70],"tags":[168],"class_list":["post-9367","post","type-post","status-publish","format-standard","hentry","category-sales","tag-hpc"],"_links":{"self":[{"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=\/wp\/v2\/posts\/9367","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9367"}],"version-history":[{"count":8,"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=\/wp\/v2\/posts\/9367\/revisions"}],"predecessor-version":[{"id":9377,"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=\/wp\/v2\/posts\/9367\/revisions\/9377"}],"wp:attachment":[{"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9367"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9367"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.networkpresence.co\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9367"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}