Expand your knowledge of hardware, software and supercomputing

Adding new nodes to an existing cluster

The following steps apply if you are adding in new nodes to your cluster and these nodes will be cloned from your existing nodes image.

First edit /act/etc/act_nodes.conf and add your new node definitions below the existing node definitions. If you do not have these already they can be provided by ACT support.

Next edit /act/etc/act_util.conf

$ vi
/act/etc/act_util.conf

Look for the [node] section:

[node] 
type=range 
start=1 
end=10

The idea is that you will increase the end of any range value by the number of nodes that you are adding. For example, If you had 10 nodes, and are adding 8 more, increase ’10’ to ’18’.

The type of lines to look for are as follows:

end= dev[eth0]_ipend= 
dev[ipmi]_ ipend= 
If you have InfiniBand: 
dev[ib0]_ipend=

Regenerate all the appropriate configuration files by running this command:

$ /act/bin/act_cfgfile --hosts --ssh --cloner --dhcp --prefix=/

Restart DHCP since new hosts were added:

$ service dhcpd restart

Copy the new hosts and known_hosts files to all the nodes:

$ /act/bin/act_cp -a /etc/hosts
$ /act/bin/act_cp -a /etc/ssh/ssh_known_hosts2

Log in to node01 as root and run the following in order to update your compute node image:

$ /act/cloner/bin/cloner --server=head --image=node

Back on the head node, run the following. Replace with the names and range of your new nodes:

$ /act/bin/act_netboot -r node11-node18 --set=cloner3

When the new nodes are turned on they will network boot, install their OS, and reboot when completed. ** Once the new nodes are up and accessible continue to the next steps. **

Synchronize the clocks on the entire cluster:

$ act_exec -a 'service ntpd stop; ntpdate 1.centos.pool.ntp.org; hwclock --systohc ; service ntpd start'

The following commands use the information in act_util.conf to set the IPMI IP address and network settings on the new nodes. Replace with the names and range of your new nodes:

$ act_exec -r node11-node18 “service ipmi start” 
$ act_ipmi_netcfg -r node11-node18 
$ act_ipmi_netcfg -a --dump_dhcp > /etc/dhcpd.d/ipmi.conf 
$ service dhcpd restart 
$ act_exec -r node11-node18 “service ipmi stop” 
$ act_ipmi_log -a setdate

If you are using SGE, Sun Grid Engine, for your job scheduler
To add the new compute nodes to the SGE queueing system, run the following commands, and follow the direction with each step:

$ qconf -mhgrp @allhosts

— add an entry for each new host that you are adding

$ qconf -ae <hostname>

— add an exec host entry for each new host that you are adding
— this opens a file editor
— set ‘hostname’ to the new hostname
— set ‘complex_values’ to ‘slots=#’ where # is the # of CPU cores in that system

$ for i in `act_nodenames -r node11-node18`; do qconf -ah $i; done

— add an administrative host entry for each new host that you are adding

$ for i in `act_nodenames -r node11-node18`; do qconf -as $i; done

—add an submit host entry for each new host that you are adding Each host has to have a configuration file added in for it. We can create a config file for each of the new nodes from one of the already configured nodes.

$ qconf -sconf <existing hostname> > <new hostname>

So for our example above we can do the following:

$ mkdir /tmp/sge; cd /tmp/sge
$ for i in `act_nodenames -r node11-node18`; do qconf -sconf node01 > $i; done
$ for i in `act_nodenames -r node11-node18`; do qconf -Aconf $i; done

(Note: this is creating a file for each hostname within the current working directory, /tmp/sge)

If you are using Torque for your job scheduler
To add in the new compute nodes to the Torque scheduler edit the nodes list

$ vi /var/spool/torque/server_priv/nodes

— Add an entry line for each new compute node Next restart the pbs_server and pbs_sched services

$ /etc/init.d/pbs_server restart $ /etc/init.d/pbs_sched restart

If you are using SLURM for your job scheduler
To add the new compute nodes to SLURM, run the following commands and follow the directions with each step:

For GPU nodes, create the file gres.conf in /act/slurm

cd /act/slurm
vi gres.conf

And add a line for each type of GPU node.

NodeName=node[17-18] Name=gpu Type=kepler File=/dev/nvidia0

Then for the GPU and all other nodes, add them to slurm.conf

$ vi /act/slurm/slurm.conf

At the bottom, extend the NodeName= to include the additional nodes or add a new line if the nodes are different.

NodeName=node[01-16] CPUs=16 RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN
NodeName=node[17-18] CPUs=16 RealMemory=128000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:kepler:1 State=UNKNOWN
PartitionName=batch Nodes=node[01-16] Default=YES MaxTime=30-0:00:00 State=UP QOS=batch DefMemPerCPU=8000

Then from the head node, restart the services.

CentOS/EL6

service slurmdbd restart
chkconfig --add slurmdbd
service slurmctld restart
scontrol reconfigure

CentOS/EL7

systemctl restart slurmdbd
systemctl restart slurmctld
scontrol reconfigure

Enable and start the slurm daemon on the new compute nodes.

CentOS/EL6

 act_exec -r node11-node18 service slurm start
 act_exec -r node11-node18 chkconfig slurm on

CentOS/EL7

act_exec -r node11-node18 systemctl start slurmd.service
act_exec -r node11-node18 systemctl enable slurmd.service

 

Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

"*" indicates required fields

Name * Required
This field is for validation purposes and should be left unchanged.