Repairing a corrupted SGE database
Note: Understanding the cause of sgemaster failing to start is important. Before running these steps, there should be some indication of a database corruption issue in the logs. These logs are located in /act/sge/default/spool/qmaster/messages. A typical corruption error message may look like this: 03/07/2015 17:34:07| main|head|E|couldn’t open berkeley database “sge”: (22) Invalid argument 03/07/2015 17:34:07| […]
Taking Compute Nodes Down for Maintenance
When taking your compute nodes down for any reason, it’s good to take that node out of any job queues in which it may be a member. Nodes coming up temporarily may start new jobs, only to be shut down again, killing the user’s job. Here’s how to safely pull a node out of service […]
Creating Groups of Nodes in TORQUE
Despite being a simple first in/first out (FIFO) scheduler, pbs_sched can use node properties to emulate host groups. This can be useful if you have different types of nodes that provide different types of resources. The nodes available in TORQUE are controlled by the file /var/spool/torque/server_priv/nodes. The most basic configuration simply lists the nodes and […]
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.