Job schedulers
As mentioned in the previous section, you cannot typically run code directly on an HPC cluster but rather must submit a request to run that code to a job scheduler. The job scheduler identifies appropriate compute resources for our application and runs our code on those nodes.
This level of indirection introduces some overhead but also guarantees that every user gets a fair share of the supercomputer time, job priorities are enforced, and that the many cores are kept busy.
The following figure shows the basic components of a job scheduler (for example, PBS or HTCondor) as well as the sequence of events from job submission to execution:
First, let's look at a few definitions:
Job: This is the metadata around our application, such as its executables, any input and output, its hardware and software requirements, its execution environment, and so on
Machine: This is the minimal job execution hardware; it could be a fraction of a physical compute node (for example, one single core...