Setup ACT Breakin hardware diagnostics tool as a grub boot option
Breakin is Advanced Clustering Technologies stress-test and hardware diagnostics tool. It is extremely useful for detecting errors on your system while stress testing the hardware at the same time in order to create a more realistic test environment. This guide is best used for head nodes and workstations that do not have a built in […]
Server doesn’t POST – Determining if an DIMM, CPU, or MotherBoard is faulty
In this example we will troubleshoot when a server fully powers on but does not post. The three most common reasons why a server will not post is either a bad DIMM, bad CPU, or bad motherboard. The main objective of all this is to start with a minimum amount of components in the server, […]
What is a kernel panic?
A message displayed by the Linux kernel upon detecting an internal system error from which it cannot recover. Kernel panics are often software errors, but many times can an indicator of hardware issues. Common types of kernel panics The two most common types of kernel panics are: Kernel panic: VFS: Unable to mount root fs […]
Test a compute node’s hardware with Breakin
Clusters built by Advanced Clustering Technologies come with the ability to easily set compute nodes to be able to boot to our Breakin utility to stress test the machine. This is an easy way to test the node for hardware errors. To set a compute node to boot to Breakin from the head node: $ […]
RAM – Checking for errors
Run BreakIn It can be difficult to tell if a memory error is related to hardware or software. To help determine this we suggest running the ACT breakin utility to remove any possibility of software related errors. Breakin for compute nodes Breakin for head nodes and CentOS work stations Run memtest86+ memtest86+ is a free utility […]
How to identify and prevent overheating
How to identify and prevent overheating Symptoms of Overheating Turning off on its own Freezing Frequent memory errors Most commonly a computer that is overheating will turn off unexpectedly, and repeat the behavior shortly after being turned back on. What causes this behavior is that the CPU temperatures are always monitored and the system will […]
Identifying Issues with Network Connectivity
Network connectivity can cover many different areas, and diagnosing which area your problem lays in is the first step to fixing the problem. Below we will cover multiple steps for identifying a problem. Verify connections and LEDs Verify that the network cable is properly connected to the back of the computer and at the switch. […]
What are Machine Check Exceptions (or MCE)?
A machine check exception is an error detected by your system’s processor. There are 2 major types of MCE errors, a notice or warning error, and a fatal exception. The warning will be logged by a “Machine Check Event logged” notice in your system logs, and can be later viewed via some Linux utilities. A […]
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.