What do I need to do when replacing a motherboard?
After replacing a failed motherboard, steps need to be taken to allow the network configuration in Linux work without disruption. Here, we outline the steps to take on an Enterprise Linux system. Console access is required for the node getting the replacement; the local steps can be taken as soon as the motherboard is replaced […]
Repairing a corrupted SGE database
Note: Understanding the cause of sgemaster failing to start is important. Before running these steps, there should be some indication of a database corruption issue in the logs. These logs are located in /act/sge/default/spool/qmaster/messages. A typical corruption error message may look like this: 03/07/2015 17:34:07| main|head|E|couldn’t open berkeley database “sge”: (22) Invalid argument 03/07/2015 17:34:07| […]
Using the ACT Yum Repo
Advanced Clustering Technologies maintains a software repository called actrepo for our ACT Utilities and other commonly used cluster software. To access the ACT yum repo, install actrepo RPM with these commands: CentOS 5 $ rpm -Uvh http://lab.advancedclustering.com/yum/centos5/actrepo-1.0-centos5.noarch.rpm CentOS 6 $ rpm -Uhv http://lab.advancedclustering.com/yum/centos6/actrepo-1.0-centos6.noarch.rpm CentOS 7 $ yum -y install http://lab.advancedclustering.com/yum/actel7/actrepo-7.0-el7.noarch.rpm
An Easier Way to Back Up Your HPC Cluster
Last month we reviewed the importance of making backups. Perhaps the simplest form of backup can occur by taking an image of the head node. Today, Advanced Clustering Technologies releases an update to the Cloner utility that makes this a whole lot easier. The new cloner_usb command will create a bootable USB key which can restore […]
Taking Compute Nodes Down for Maintenance
When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service […]
Keep an Eye on Your RAID Status
Our customers frequently order systems with two hard drives to hold a RAID 1 volume mirroring the OS filesystems. This is done with Linux software RAID, and it’s important to periodically check the health of the drives. To do this, run cat /proc/mdstat. If all volume members are working properly, you should see [UU]. For […]
Adding new nodes to an existing cluster
The following steps apply if you are adding in new nodes to your cluster and these nodes will be cloned from your existing nodes image. First edit /act/etc/act_nodes.conf and add your new node definitions below the existing node definitions. If you do not have these already they can be provided by ACT support. Next edit […]
Use the command line to easily find hard drive manufacturer information
If you ever need to get your hard drive’s model and serial number without physically looking at it, you can do so with the hdparm command line utility. This is especially useful if a manufacturer requires the serial number for an RMA or any other servicing needs. In this example, we are retrieving the model […]
How to enable IPMI SOL for ASUS machines running CentOS 6.x
A serial console will allow you to send all text based output to one of the on-board serial ports. While this can be done using a physical serial port and an external terminal server, it’s more likely that Serial Over Lan (SOL) is used. SOL is provided by the IPMI (Intelligent Platform Mangement Interface) device […]
Building WRF on ACT systems
WRF has many options that may be unique to any particular installation. This article is to help you get up and running with WRF as quickly as possible without having to rediscover the right settings. Below are the steps to build all dependencies for WRF 3.6 as of August 2014. Background Systems installed by ACT […]
Creating Groups of Nodes in TORQUE
Despite being a simple first in/first out (FIFO) scheduler, pbs_sched can use node properties to emulate host groups. This can be useful if you have different types of nodes that provide different types of resources. The nodes available in TORQUE are controlled by the file /var/spool/torque/server_priv/nodes. The most basic configuration simply lists the nodes and […]
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.