Tech Support Advisory: Yum updates fail from slurm package conflicts
When performing a yum update or dnf update on your system, the update may fail with messages about conflicts between Slurm packages. This is caused by the addition of new Slurm packages in upstream repos that collide with custom packages installed by ACT. The errors may look like some of the following: Transaction check error: […]
Fixing Firewall Zones In CentOS 7.5
As of CentOS 7.5, the use of ZONE=<zone> no longer works in /etc/sysconfig/network-scripts/ifcfg-* files. The most notable side-effect of this is that all nodes that accessed the Internet through the head node will no longer be able to do so until this is remedied. The new way of setting up zones in the firewall is […]
Upgrading Firmware when Adding InfiniBand to an Existing Fabric
A customer recently asked, “When adding a new InfiniBand switch to an existing fabric, should the firmware on the existing switches be upgraded to the version of the firmware on the new switch before connecting the new switch?” It is not required for all switches in an InfiniBand network to have matching firmware. Since adding […]
Updating firmware on your ACT Intel system
ACT’s servers based on Intel chassis can now be updated easier than before. We provide a package in our YUM repository that includes firmware updates and scripts to apply the updates. Here is how to do it. Make sure you have the ACT repo enabled. Run yum repolist and look for a repo named “ACT […]
Finding the serial number of your ACT system
When contacting support, it’s best to find the serial number of your ACT system and have it handy when you open a ticket via the website, via email, or when calling in. Providing the serial number allows us to quickly look up your system, and see the configuration of your system which is often relevant […]
Check the status of an LSI raid card battery backup unit
Checking on the status of your raid cards battery backup unit (BBU) is a simple process by using the following MegaCli command: $ MegaCli64 -AdpBbuCmd -a<adapter#/ALL> In the following example we have a single controller present and will pass the -a0 argument to select the controller. [root@localhost ~]# MegaCli64 -AdpBbuCmd -a0 BBU status for Adapter: […]
Setup ACT Breakin hardware diagnostics tool as a grub boot option
Breakin is Advanced Clustering Technologies stress-test and hardware diagnostics tool. It is extremely useful for detecting errors on your system while stress testing the hardware at the same time in order to create a more realistic test environment. This guide is best used for head nodes and workstations that do not have a built in […]
Server doesn’t POST – Determining if an DIMM, CPU, or MotherBoard is faulty
In this example we will troubleshoot when a server fully powers on but does not post. The three most common reasons why a server will not post is either a bad DIMM, bad CPU, or bad motherboard. The main objective of all this is to start with a minimum amount of components in the server, […]
What is a kernel panic?
A message displayed by the Linux kernel upon detecting an internal system error from which it cannot recover. Kernel panics are often software errors, but many times can an indicator of hardware issues. Common types of kernel panics The two most common types of kernel panics are: Kernel panic: VFS: Unable to mount root fs […]
What do I need to do when replacing a motherboard?
After replacing a failed motherboard, steps need to be taken to allow the network configuration in Linux work without disruption. Here, we outline the steps to take on an Enterprise Linux system. Console access is required for the node getting the replacement; the local steps can be taken as soon as the motherboard is replaced […]
Sync users across nodes
Any time you add a new user on your cluster’s head node or make changes to an existing user, you will need to synchronize those changes across the entire cluster. Advanced Clustering makes this a simple task by using our act_authsync utility. This utility takes all system user configuration files and pushes them out to […]
Replacing an LSI raid disk with MegaCli
If you have identified a failed, or failing disk, it is possible to replace it using the MegaCli utility. In the example below we will cover replacing a failed disk from a raid 5 that has three disks total. The first thing we want to check is the status of our raid 5. [root@raid log]# MegaCli64 […]
Test a compute node’s hardware with Breakin
Clusters built by Advanced Clustering Technologies come with the ability to easily set compute nodes to be able to boot to our Breakin utility to stress test the machine. This is an easy way to test the node for hardware errors. To set a compute node to boot to Breakin from the head node: $ […]
How to locate a physical disk in an LSI raid array
The MegaCli command line utility can be used to locate a physical disk in an LSI raid array by blinking the disks activity LED. The blinking will continue until directed to stop. Syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example we will locate disk 0 on adapter 0: [root@localhost MegaCli]# ./MegaCli64 -PdLocate -start -physdrv[252:0] […]
RAM – Checking for errors
Run BreakIn It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors. Breakin for compute nodes Breakin for head nodes and CentOS work stations Run memtest86+ memtest86+ is a free utility […]
Repairing a corrupted SGE database
Note: Understanding the cause of sgemaster failing to start is important. Before running these steps, there should be some indication of a database corruption issue in the logs. These logs are located in /act/sge/default/spool/qmaster/messages. A typical corruption error message may look like this: 03/07/2015 17:34:07| main|head|E|couldn’t open berkeley database “sge”: (22) Invalid argument 03/07/2015 17:34:07| […]
Using the ACT Yum Repo
Advanced Clustering Technologies maintains a software repository called actrepo for our ACT Utilities and other commonly used cluster software. To access the ACT yum repo, install actrepo RPM with these commands: CentOS 5 $ rpm -Uvh http://lab.advancedclustering.com/yum/centos5/actrepo-1.0-centos5.noarch.rpm CentOS 6 $ rpm -Uhv http://lab.advancedclustering.com/yum/centos6/actrepo-1.0-centos6.noarch.rpm CentOS 7 $ yum -y install http://lab.advancedclustering.com/yum/actel7/actrepo-7.0-el7.noarch.rpm
An Easier Way to Back Up Your HPC Cluster
Last month we reviewed the importance of making backups. Perhaps the simplest form of backup can occur by taking an image of the head node. Today, Advanced Clustering Technologies releases an update to the Cloner utility that makes this a whole lot easier. The new cloner_usb command will create a bootable USB key which can restore […]
Installing Libraries for Python Outside of System Directories
Python is being used more frequently in HPC applications. Whether a job is being run by the scheduler or pre/post-processing on login nodes, there’s a chance you may run into it. With Python comes the need for libraries. Installing the libraries in system directories normally isn’t possible, but there is a good solution for that. […]
Taking Compute Nodes Down for Maintenance
When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service […]
Pinpoint a failed drive in your array
If you see that your LSI RAID array has a failed disk, but you’re not sure which physical disk in the machine it is, use the MegaCli command line utility to flash the drive’s LEDs: Command syntax: MegaCli64 -PdLocate <-start|-stop> -physdrv[<enclosure#>:<disk#>] -a<adapter#> In this example, we will locate disk 0 on adapter 0 (the first […]
Getting package information
By using the ‘rpm’ command (RPM Package Manager) is is possible to get a lot of information about installed packages on your system. To start, say we want to see if we have a specific package name installed on our system. We can search all the currently installed packages for a package named ‘actutil’ by: […]
Viewing your system’s event log through IPMI
If your system has IPMI (Intelligent Platform Management Interface), it can be useful to pull its system event log when encountering odd behavior. If you have a cluster installed with our act_utils software tools, you can use the act_ipmi_log command (replace “node01″ with the hostname of the machine you wish to query): $ act_ipmi_log -n […]
What type of power receptacles do I have?
NEMA Receptacle Types There are many different types of NEMA power receptacles and plugs. If you already have power receptacles installed at your site and you are wanting to determine what type of NEMA plug you have, included below you will find the most common types of NEMA receptacles for the PDUs we sell at […]
Using VNC to Speed Up Slow X-forwarded Sessions
Most of you know that you can use X-forwarding built into SSH to run a graphical application on a remote host: laptop$ ssh -X head.mycluster head$ firefox & (Firefox session displays on your laptop, running on the remote host) But sometimes these programs run very slowly over the network. Firefox can be slow to render, […]
Use Screen to Run Long Processes
Tech TipScreen is a Linux utility that allows you to run multiple terminals all within a single terminal window manager. It can be used for many things and greatly increases workflow. Screen enables you to run your long scripts/processes within a screen session. If you want to execute a script that generally takes a very […]
Keep an Eye on Your RAID Status
Our customers frequently order systems with two hard drives to hold a RAID 1 volume mirroring the OS filesystems. This is done with Linux software RAID, and it’s important to periodically check the health of the drives. To do this, run cat /proc/mdstat. If all volume members are working properly, you should see [UU]. For […]
How to identify and prevent overheating
How to identify and prevent overheating Symptoms of Overheating Turning off on its own Freezing Frequent memory errors Most commonly a computer that is overheating will turn off unexpectedly, and repeat the behavior shortly after being turned back on. What causes this behavior is that the CPU temperatures are always monitored and the system will […]
Adding new nodes to an existing cluster
The following steps apply if you are adding in new nodes to your cluster and these nodes will be cloned from your existing nodes image. First edit /act/etc/act_nodes.conf and add your new node definitions below the existing node definitions. If you do not have these already they can be provided by ACT support. Next edit […]
Update Initrd
Have you blacklisted a kernel module, but it’s still showing up at boot? You probably need to update your initrd, a compressed filesystem used to bootstrap the OS. Simply run “dracut –force”, and the initrd will be recreated, taking into account any configuration changes made in your /etc filesystem. Then reboot. Your changes are now […]
Identifying Issues with Network Connectivity
Network connectivity can cover many different areas, and diagnosing which area your problem lays in is the first step to fixing the problem. Below we will cover multiple steps for identifying a problem. Verify connections and LEDs Verify that the network cable is properly connected to the back of the computer and at the switch. […]
How do I tune my HPL.dat file?
Tuning HPL can be a long and difficult process. Once you’ve found the perfect BLAS library for your architecture, now you need to create a perfect HPL.dat file. Use the form below and generate an output file as a starting point on getting the best GFLOP number you can out of your cluster. Input Nodes: […]
Troubleshooting OpenMPI Invocation Problems
OpenMPI works with a large number of transport mechanisms, from shared memory on the local machine, to IP over Ethernet or even RDMA over InfiniBand. With default settings, when you start your program using mpirun, OpenMPI will choose the best interface available.. Unfortunately, the logic isn’t foolproof, and sometimes you will hit snags and your […]
Standard Cluster – InfiniBand Fabric
This is the InfiniBand configuration for most of the HPC clusters we build.
How do I rack a .5U blade or a 2U Flex Chassis?
1U blade or 2U Flex Chassis installation & removal PLEASE NOTE: The pictorial illustrations in this FAQ show a 2U Flex chassis, however the same procedures are applicable to the 1U blade except for the fact that the 1U chassis is 1U shorter in height, uses a different size rear mounting bracket, and has fewer […]
Re-imaging a compute node back to a working state
If you accidentally misconfigure software on a cluster compute node you can always revert it back to a working image. In order to prepare a node for imaging you first set it to boot into the cloner3 image the next time it powers on: $ act_netboot -n <node name> -set=cloner3 Next you simply reboot the machine […]
Use act_locate to identify a node
Most Advanced Clustering chassis are equipped with a large locater LED on the front that can be used to easily identify a node when it’s turned on. If you’re remotely attempting to notify a technician as to which compute node needs work, you can simply run the following command from your head node: $ act_locate […]
Diagnose hardware issues with Advanced Clustering’s Breakin
If you suspect hardware problems, our clusters come with a testing facility that can test one or more nodes. Using Advanced Clustering’s Breakin software can help you look for and diagnose potential hardware issues. This software is a stress-test suite developed in-house since there were no other tools available that provided this level of rigorous […]
Checking InfiniBand
If one of your machines has an InfiniBand device installed and you want to know what state the device is in, you can use the “ibstat” command. The output of “ibstat” shows a lot of information, but the two main lines you should look at are: State: Active Physical state: LinkUp The “State” line can […]
Using grep to filter results
The command line utility “grep” is one of the most powerful and useful tools in Linux. Its most common use is to filter results from everyday commands. For instance, if you want to see all the hostnames your system has mapped out in /etc/hosts you can simply run: $ cat /etc/hosts But if you know […]
Use the command line to easily find hard drive manufacturer information
If you ever need to get your hard drive’s model and serial number without physically looking at it, you can do so with the hdparm command line utility. This is especially useful if a manufacturer requires the serial number for an RMA or any other servicing needs. In this example, we are retrieving the model […]
Changing Contents in a File in Every Node
Occasionally you may want to change a a single string inside of a file that is on every compute node. If the file was the same on every node you could change it in one place and then copy it out like so: $ act_cp -g nodes /path/to/file Some config files are unique to each […]
Installing NVIDIA Drivers on RHEL or CentOS 7
Most users of NVIDIA graphics cards prefer to use the drivers provided by NVIDIA. These more fully support the capabilities of the card when compared to the nouveau driver that is included with the distribution. These are the steps to install the NVIDIA driver and disable the nouveau driver. Prepare your machine yum -y update yum […]
Checking and Clearing InfiniBand Errors
An easy way to check for errors on your entire cluster IB network is to run the command ‘ibcheckerrors.’ This will print any errors that can range from a port being down (even just unplugged temporarily) to transmission errors. After troubleshooting any errors you find, you can clear out the error counters with the command […]
How to enable IPMI SOL for ASUS machines running CentOS 6.x
A serial console will allow you to send all text based output to one of the on-board serial ports. While this can be done using a physical serial port and an external terminal server, it’s more likely that Serial Over Lan (SOL) is used. SOL is provided by the IPMI (Intelligent Platform Mangement Interface) device […]
Replacing an LSI raid card with a pre-configured raid array
Newer LSI raid cards (depending on their current firmware version it seems) will auto-import raid configurations from previous raid cards. However on older cards you have to import the disks ‘foreign’ configuration. In order to check if your raid array was automatically imported by your new raid card you can run the following command: $ MegaCli64 […]
Create a raid array with MegaCli64
Note: The following is assuming that you have attached new drives to a newly installed LSI raid controller. The first thing to do is to get a list of all the drives attached to the raid controller. The way the LSI raid controllers identify/label their attached disks is by an ‘Enclosure ID’ and the drive […]
How to expand an existing LSI raid array using MegaCli
Warning: You should ALWAYS make a backup of all of your information on the raid array before performing any of these steps. The exact commands to do this vary on your current configuration and number of disks in the raid. Before adding in the disks you need to get a feel for your current setup by […]
How to update the date/time on LSI Raid cards using MegaCli
Setting the date/time on your controller is advised to keep system logs in sync. Although this is normally done by the drivers after bootup, we can do this manually with the MegaCli tool using the following syntax: MegaCli64 -AdpSetTime yyyymmdd hh:mm:ss -a<adapter#> yyyy is year in 4 digit format: 2013 mm is month in 2 […]
Building WRF on ACT systems
WRF has many options that may be unique to any particular installation. This article is to help you get up and running with WRF as quickly as possible without having to rediscover the right settings. Below are the steps to build all dependencies for WRF 3.6 as of August 2014. Background Systems installed by ACT […]
What is Cli64?
Cli64 is a (poorly named) proprietary tool developed by Areca that provides reporting AND management functions from userspace. If installed from the ACT repo the binary is located at /usr/local/bin/cli64. The default password for the controller is 0000. [root@localhost ~]# cli64 ? Copyright (c) 2004-2011 Areca, Inc. All Rights Reserved. Areca CLI, Version: 1.86, Arclib: […]
My Areca raid controller is beeping; how do I make it STOP?
WARNING – Only continue with this operation after the cause of the alarm has been identified* First we must authenticate to the controller by passing a password, the default is 0000. [root@localhost ~]# cli64 set password=0000 GuiErrMsg: Success. [root@localhost ~]# Now we can mute the beeping! [root@localhost ~]# cli64 sys beeper p=0 GuiErrMsg: Success. [root@localhost […]
Common questions when using Areca Cli64
What is the status of my raid array? You can obtain an overview of the arrays status using the cli64 rsf info command. [root@localhost ~]# cli64 rsf info # Name Disks TotalCap FreeCap DiskChannels State =============================================================================== 1 Raid Set # 00 2 240.0GB 0.0GB 12 Normal =============================================================================== GuiErrMsg: Success. [root@localhost ~]# You can also obtain an overview […]
Creating Groups of Nodes in TORQUE
Despite being a simple first in/first out (FIFO) scheduler, pbs_sched can use node properties to emulate host groups. This can be useful if you have different types of nodes that provide different types of resources. The nodes available in TORQUE are controlled by the file /var/spool/torque/server_priv/nodes. The most basic configuration simply lists the nodes and […]
MPI Over InfiniBand
To take full advantage of InfiniBand, an MPI implementation with native InfiniBand support should be used. Supported MPI Types MVAPICH2, MVAPICH, and Open MPI support InfiniBand directly. Intel MPI supports InfiniBand through and abstraction layer called DAPL. Take note that DAPL adds an extra step in the communication process and therefore has increased latency and […]
InfiniBand Port States
The status for your InfiniBand Host Channel Adapter (HCA) can be found using the ‘ibstat’ command. # ibstat CA ‘mlx4_0’ CA type: MT4099 Number of ports: 1 Firmware version: 2.10.0 Hardware version: 0 Node GUID: 0x0002c9030031fdc0 System image GUID: 0x0002c9030031fdc3 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 1 LMC: 0 SM […]
Unpacking Your Cluster Hardware
Watch our instructional video on unpacking your HPC cluster from Advanced Clustering: WARNING: The components of an HPC Cluster can be very heavy. Some pre-assembled clusters are more than 2,000 lbs. Great care must be used when unpacking and/or moving a cluster or any of its components. Check For Shipping Damage Despite all attempts at safe and […]
RMAs: How to package compute blades
Half U boxes sent for compute node repair will have the following items inside them: Two white bases Anti Static Bag Grey foam slab Place the two white bases in the bottom of the box. Cover the compute node with the anti-static bag and place the compute node, top up, on the white bases. Place […]
RMA process – beginning to end
Once you have contacted ACT support concerning a failing device and an RMA has been created you will: Receive email notification the RMA was created. Receive email notification once the replacement part has shipped. Receive the replacement part and a return shipping label will be included in the packaging (or sent to you via email) […]
Example RMA Request
If you believe you have a faulty device and your system is still under warranty, this guide will cover submitting an RMA request. To start you can submit a support ticket at: http://advancedclustering.com/support/ticket.html Or you can email ACT support directly: [email protected] Below is an example email containing all the important information needed to get an […]
IPoIB – Using TCP/IP on an InfiniBand Network
Existing applications can take advantage of the higher bandwidth and lower latency of InfiniBand by use of IPoIB, Internet Protocol over InfiniBand. When the driver for IPoIB is loaded virtual network interfaces are made visible to the operating system. These devices appear is if they were Ethernet device and can be manipulated in the same […]
InfiniBand Cable and Port Types
QSFP QSFP cables and ports are used to DDR (20 Gbps), QDR (40 Gbps), and FDR (56 Gbps) InfiniBand links. QSFP Cable The connector on a QSFP cable is long and narrow. The connector slides into the port. QSFP Port QSFP port are recessed openings. The QSFP cable slides into the port. CX4 CX4 cables […]
What are Machine Check Exceptions (or MCE)?
A machine check exception is an error detected by your system’s processor. There are 2 major types of MCE errors, a notice or warning error, and a fatal exception. The warning will be logged by a “Machine Check Event logged” notice in your system logs, and can be later viewed via some Linux utilities. A […]
Drivers: Distro vs OFED
Like all computer hardware, InfiniBand adapters need drivers in order to be used by the operating system. Most modern Linux distributions provide the kernel drivers, libraries, and support programs needed to have a functioning InfiniBand adapter. While functional, these may not be the best choice in all cases. When a new InfiniBand card, or firmware […]
Disk marked as Foreign or Bad?
Some times when you replace a disk you may find the new disk marked as “Foreign”. The “Foreign” state means the controller has detected a raid signature on this disk from a previous configuration. This will prevent you from using the disk before the foreign state is cleared out. In order to use the disk we […]
InfiniBand Types and Speeds
Since its release, InfiniBand has been made in 5 speeds and has used two types of connectors. FDR FDR InfiniBand provides a 56 Gbps second link. The data encoding for FDR is different from the other InfiniBand speeds: for every 66 bits transmitted 64 bit are data. This is cable 64b/66b encoding. This provides actual […]
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.