Checking and Clearing InfiniBand Errors
An easy way to check for errors on your entire cluster IB network is to run the command ‘ibcheckerrors.’
This will print any errors that can range from a port being down (even just unplugged temporarily) to transmission errors. After troubleshooting any errors you find, you can clear out the error counters with the command ‘ibclearerrors’.
(Note: most IB errors can be resolved by reseating the cables on both ends for any ports that showed errors.)
The output of ‘ibcheckerrors’ can be confusing when you’re trying to determine which physical ports the errors are happening on. A good way to see which ‘lid’ an error happens on is by using the tool ‘ibnetdiscover’ to print out your entire, active IB network layout.
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Use our Breakin stress test and diagnostics tool to pinpoint hardware issues and component failures.
Check out our product catalog and use our Configurator to plan your next system and get a price estimate.
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.