How to identify and prevent overheating
How to identify and prevent overheating
Symptoms of Overheating
- Turning off on its own
- Freezing
- Frequent memory errors
Most commonly a computer that is overheating will turn off unexpectedly, and repeat the behavior shortly after being turned back on. What causes this behavior is that the CPU temperatures are always monitored and the system will be immediately turned off if temperatures get too high.
Check that all fans are working
Place the server in an area where you can easily take the chassis cover off and watch all the fans while powering the server on. Look for fans that are not spinning at all, but also not spinning as quickly as the others. Typically you will be able to hear if there are fans spinning at different speeds.
If a fan is not spinning correctly it will need to be replaced, unless a build up of dust can easily be blown out and corrects the problem.
Re-seat CPU heat sinks
Sometimes when the server is moved the contact between the CPU and heat sink can be disrupted. To be sure, completely remove the heat sinks, check that the thermal paste is still intact, and firmly re-seat them. If you are not sure if contact is being made a good test is to:
- completely remove all current thermal paste and re-apply it to the heat sink
- re-seat and then remove the heat sink again
- check to see if the thermal paste was spread out on the CPU by the heat sink making contact
Clean with canned air
Look for build up of dust and blow it out with canned air. Be sure to pay special attention to the heat sinks, fans, and around the base of the CPUs. This is considered a good practice to have for regular preventative maintenance.
Compute node air ducts
Advanced Clustering Half-U compute nodes will have an fitted air duct that guides air from the fans over the CPUs. For proper air flow it is important to keep the air duct in place when the node is powered on.
Check temperature with act_sensors
If your server is part of a cluster with the package act_dir installed you can use the act_sensors utility to check temperatures on every node.
To check temperatures on every node in the cluster:
$ act_sensors -a temps
To check a specific node:
$ act_sensors -n nodename temps
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.