Debugging
Everything is great when things work as we expect them to; oftentimes, however, we are not so lucky. Distributed applications, and even simple jobs running remotely, are particularly challenging to debug. It is usually hard to know exactly which user account our jobs run under, which environment they are executed in, where they run, and, with job schedulers, it is even hard to predict when they will run.
When things do not work as we expect them to, there are a few places where we could get some hints as to what happened. When working with a job scheduler, the first thing to do is look at any error messages returned by the job submission tool (that is, condor_submit
, condor_submit_dag
, or qsub
). The second place to look for clues are the job STDOUT
, STDERR
, and log files.
Usually, the job scheduler itself has tools to diagnose problematic jobs. HTCondor, for instance, provides condor_q -better-analyze
to investigate why a given job might be stuck in the queue longer than expected...