checkBatchFarm.sh

Problem

From time to time there might be problems with the batch farm. Machines hang, ore do not see a directory anymore.

Solution

To test which machines are affected, this script can be used. It will connect via ssh to each machine and execute the following command:

command='if [ -x '$EXECUTABLE' ] ; then echo -n executable found; \
if [ -d '$DIRECTORY' ] ; then echo -n " data dir found"; if [ -d $LIBDIR ]; \
then echo -n $HOSTNAME "">> hosts.sh; echo " library dir found"; \
else echo " ko"; echo HOSTNAME >> hosts_problem.sh; fi; else echo " ko"; \
echo $HOSTNAME >> hosts_problem.sh; fi; \
else echo $HOSTNAME >> hosts_problem.sh; fi'

The script takes four parameters:
  1. Name of the batchqueu to use,
  2. name of an executable (including location),
  3. name of a directory with datafiles,
  4. name of a directory with libraries used by the executable.

If you never have connected via ssh to the batch machines, you will have to type "yes" a couple of times.

The result will be two files named hosts.sh and hosts_problem.sh containing one definition of a environmentvariable containing all hostnames, that are found to be ok (hosts.sh) and those which are unable to se one of the required directories (hosts_problem.sh).

If you had to type your password for an machine and it shows up in hosts_problem.sh, this means that the machine does not see your homedirectory. Otherwise it does not accept ssh connections at the moment.

Sending this report to the DVEE department experts, will help them to find out which machines are corrupted and need a reboot.

-- JoernWuestenfeld - 20 Jul 2004
Topic revision: r2 - 2004-07-20, JoernWuestenfeld
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki Send feedback | Imprint | Privacy Policy (in German)