Repairing a corrupted SGE database
Note: Understanding the cause of sgemaster failing to start is important. Before running these steps, there should be some indication of a database corruption issue in the logs. These logs are located in /act/sge/default/spool/qmaster/messages. A typical corruption error message may look like this:
03/07/2015 17:34:07| main|head|E|couldn't open berkeley database "sge": (22) Invalid argument 03/07/2015 17:34:07| main|head|E|startup of rule "default rule" in context "berkeleydb spooling" failed 03/07/2015 17:34:07| main|head|C|setup failed
or
03/12/2015 13:07:08| main|head|E|couldn't open database environment for server "local spooling", directory "/act/sge/default/spool/spooldb": (-30974) DB_RUNRECOVERY: Fatal error, run database recovery 03/12/2015 13:07:08| main|head|E|startup of rule "default rule" in context "berkeleydb spooling" failed 03/12/2015 13:07:08| main|head|C|setup failed
If your filesystem ever fills up or the system crashes as the wrong time, your SGE database may get corrupted. In your errors, take note of the database mentioned. In our example errors “sge” is the corrupted database. Here are steps that can usually repair the “sge” database so SGE will run properly again. The same steps below will work with the “sge_job” database as well.
cd $SGE_ROOT/default/spool cp -a spooldb spooldb.bak cd spooldb db_verify sge db_recover db_dump -f sge.out sge mv sge sge.old db_load -f sge.out sge db_verify sge chown -R sgeadmin. $SGE_ROOT/default/spool
If the above does not work, this alternative method may work instead.
cd /act/sge ./sge_inst -bak
This starts an interactive backup script. Choose the default answers. Optionally, selecting not to use tar/gzip will make the backups easier to inspect. The settings are saved to /act/sge/backup. To fix the database corruption, simply restore this backup with the following.
./sge_inst -rst
This starts another interactive script, but to restore from backup. Answer all the questions, which should have correct default answers. You can then start sgemaster without any issues.
Categories
- Getting Support (5)
- Hardware (35)
- Areca Raid Arrays (3)
- InfiniBand (10)
- LSI Raid Arrays (9)
- NVIDIA Graphics Cards (1)
- Racks (1)
- Troubleshooting (8)
- Software (11)
- ACT Utilities (5)
- HPC apps & benchmarks (1)
- Linux (3)
- Schedulers (3)
- SGE / Grid Engine (1)
- TORQUE (1)
- Tech Tips (17)
Request a Consultation from our team of HPC and AI Experts
Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.