Overview
The underlying problem with submitting jobs with IU's HPC Environment is the latency of starting the individual batch jobs. It is not uncommon for a singular job to wait 12-24 hours, so when processing large batch of objects the amount of time waiting far outstrips the amount of time actually using the GPUs.
Scheduling
To describe the mechanism used, consider a situation with a single GPU resource and these jobs:
ID | Run Time (mins) | Notes |
---|---|---|
G1 | 120 | Random GPU User job |
G2 | 60 | Random GPU User job |
G3 | 240 | Random GPU User job |
G4 | 180 | Random GPU User job |
A1 | 15 | AMP interactive GPU Job |
A2 | 10 | AMP batch GPU Job |
A3 | 10 | AMP batch GPU Job |
A4 | 10 | AMP batch GPU Job |
A5 | 15 | AMP interactive GPU Job |
On IU's HPC environment, the job scheduling is handled by the SLURM software. It manages resources and runs job in terms of available resources and priority. It is also highly addictive. Since there is a single resource, the order the jobs are submitted into the SLURM queue makes a huge difference:
Case | Order | AMP Start Time | AMP Run Time | Notes |
---|---|---|---|---|
1 | G1, G2, G3, G4, A1, A2, A3, A4, A5 | 8.5 hours | 1 hour | This is really the worst case for latency, and there isn't anything we can do about it |
2 | G1, A1, G2, A2, G3, A3, G4, A4, A5 | 2 hours | 7.5 hours | Interleaved requests. Some results coming back earlier, but the overall time is longer |
3 | G1, A1, G2, A2, A3, A4, G3, A5, G4 | 2 hours | 6 hours | The batched AMP processes get queued together, shortening the overall time |
4 | G1, A1, A2, A3, A4, A5, G2, G3, G4 | 2 hours | 1 hour | All of the amp jobs are run when the first amp job makes it into the queue, significantly shortening the overall time |
5 | G1, A1, A2, G2, A3, A4, A5, G3, G4 | 2 hours | 5 hours | A random job got scheduled in the middle of the AMP jobs, stalling the completion |
Cases 1 and 4 are effectively the same case: there is a number of random jobs head of the all of the AMP jobs and the random jobs after the AMP work can be ignored. The other three cases are all interlaced cases where random jobs appeared between the AMP work.
So, to optimize the AMP throughput, it's important to get the AMP jobs at the head of the queue with minimal interlacing with other jobs. There isn't any way to control when jobs are placed into the SLURM queue (AMP or otherwise), another method needs to be used: an hpc_runner.
As AMP generates HPC requests, instead of submitting them directly into the SLURM queue, they are placed into an unordered queue managed by the hpc_runner. When the AMP jobs are submitted, the hpc_runner job is submitted to SLURM, if it is not already in the queue or already running on HPC. When the hpc_runner is started, any AMP jobs in the queue are processed, including any which were submitted after the hpc_runner was submitted to SLURM or while hpc_runner was processing previous jobs on this invocation.
This allows all of the AMP HPC requests to complete in the minimum amount of time.
HPC Batch Components
MGMs
Each MGM generates a YAML file describing the job they want done on the HPC cluster and then waits for the results. A python library with a single function (submit_and_wait) puts the information into a drop box and waits for completion, returning the status of the job. The MGM doesn't need to know anything about HPC – just the location of the system dropbox. As such, the MGM code for a typical HPC invocation is less than 50 lines of code.
HPC_Submitter
The HPC submitter is a cron job which runs every minute on the AMP machines that performs three tasks:
- Any new jobs in the dropbox are pushed to the HPC cluster
- A workspace directory is created on HPC
- Local input files are copied to the workspace
- A job.sh script is created on HPC from the parameters in the job
- The job is marked as submitted
- The hpc_queuer program is run on the HPC system (see below)
- Any jobs which are complete on HPC are resolved
- If the workspace has a finished.out file, the job is finished
- Retrieve the return code, stdout, and stderr for the job
- If the job was successful, copy the output files from HPC to the local machine
- mark the job as finished so the submit_and_wait call in the MGM can complete
HPC_Queuer
The hpc_queuer script will check if there are any outstanding AMP tasks and will queue the hpc_runner if it is not currently in SLURM (either pending or running).
If the hpc_runner needs to be submitted to SLURM, a trampoline script is created which calls the hpc_runner with the appropriate arguments and the trampoline is submitted to SLURM. The trampoline is required because when submitting a job to SLURM, it will copy the script to a temporary directory and run it from there – that means that any script-relative paths do not work.
The script has to be triggered by the hpc_submitter because cronjobs are not allowed on the HPC system. Because the hpc_queuer is always started by the hpc_submitter, the system is fail-safe: any jobs on HPC which were interrupted or if the hpc_runner fails, as long as there are incomplete jobs the hpc_queuer will requeue the hpc_runner.
HPC_Runner
When invoked by the trampoline script via SLURM, the hpc_runner will:
- Scan the workspaces for any AMP jobs that need to be processed
- If there are none, it will exit
- For each outstanding job:
- Run the job.sh that was created by the hpc_submitter, capturing the return code, stderr, and stdout
- Write the return code to finished.out so hpc_submitter knows the job has completed
- Go to 1
A real-life example
This is an example using a single invocation of the INA Speech Segmenter HPC MGM.
First, Galaxy (or by hand), this command gets run:
./ina_speech_segmenter.py ../dropbox/ ~/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3 /tmp/test2.seg
Which tells the MGM to create a job in ../dropbox to get the segments from the .mp3 file and place the output in /tmp/test2.seg.
In the dropbox, a YAML file (tmp8u48rlln.job) is created with this information:
input_map: input: /home/bdwheele/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3 output_map: segments: /tmp/test2.seg script: ina_speech_segmenter
This job file gives all of the details based on the parameters. The script will then wait for completion, checking every few seconds.
In less than a minute, a cron job will run:
./hpc_submitter hpc_submitter.ini
That will see the .job file and submit it to HPC. The workspace is created and the job.sh file is created. There is a template for each tool, and for ina_speech_segmenter, the template is:
#!/bin/bash module load singularity exec singularity run --nv \ --bind <<workspace>>:/mnt \ <<containers>>/inaspeech.sif \ /mnt/input/<<input>> \ /mnt/output/<<segments>>
The french quoted variables are replaced to generate the job.sh:
#!/bin/bash module load singularity exec singularity run --nv \ --bind /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725:/mnt \ /N/u/bdwheele/Carbonate/containers/inaspeech.sif \ /mnt/input/01-Chapter_0A_-_Introduction.mp3 \ /mnt/output/test2.seg
Then the job file is modified to hold the required HPC information:
hpc: finished_file: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/finished.out input_map: input: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/input/01-Chapter_0A_-_Introduction.mp3 output_map: segments: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/output/test2.seg workspace: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725 input_map: input: /home/bdwheele/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3 output_map: segments: /tmp/test2.seg script: ina_speech_segmenter
and the file is renamed to tmp8u48rlln.job.submitted to indicate the job is in progress.
hpc_submitter then calls hpc_queuer on the HPC system which sees that the hpc_runner is not in SLURM:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 70314 dl bash ninoparu R 10:44:43 1 dl8 70365 dl com shishuz R 24:26 1 dl3 70175 dl lin timlai R 19:52:52 1 dl7 70174 dl model timlai R 19:53:01 1 dl5 70013 dl emb timlai R 1-17:30:43 1 dl5
so this trampoline script is created to launch hpc_runner via SLURM:
#!/bin/bash /geode2/home/u010/bdwheele/Carbonate/hpc_batch/on_hpc/scripts/hpc_runner
At that point, SLURM is aware of the script and will run it at some point:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 70367 dl hpc_runn bdwheele R 0:00 1 dl1 70314 dl bash ninoparu R 10:44:54 1 dl8 70365 dl com shishuz R 24:37 1 dl3 70175 dl lin timlai R 19:53:03 1 dl7 70174 dl model timlai R 19:53:12 1 dl5 70013 dl emb timlai R 1-17:30:54 1 dl5
When hpc_runner is finished running the jobs.sh file, the directory structure looks like:
total 256 drwxrwxr-x 4 bdwheele bdwheele 32768 Sep 2 10:57 . drwxrwxr-x 3 bdwheele bdwheele 32768 Sep 2 10:56 .. -rw-rw-r-- 1 bdwheele bdwheele 2 Sep 2 10:57 finished.out drwxrwxr-x 2 bdwheele bdwheele 32768 Sep 2 10:56 input -rwxr-x--- 1 bdwheele bdwheele 318 Sep 2 10:56 job.sh drwxrwxr-x 2 bdwheele bdwheele 32768 Sep 2 10:57 output -rw-rw-r-- 1 bdwheele bdwheele 6734 Sep 2 10:57 stderr.txt -rw-rw-r-- 1 bdwheele bdwheele 259 Sep 2 10:57 stdout.txt ./input: total 576 drwxrwxr-x 2 bdwheele bdwheele 32768 Sep 2 10:56 . drwxrwxr-x 4 bdwheele bdwheele 32768 Sep 2 10:57 .. -rw-rw-r-- 1 bdwheele bdwheele 499840 Sep 2 10:56 01-Chapter_0A_-_Introduction.mp3 ./output: total 64 drwxrwxr-x 2 bdwheele bdwheele 32768 Sep 2 10:57 . drwxrwxr-x 4 bdwheele bdwheele 32768 Sep 2 10:57 .. -rw-r--r-- 1 bdwheele bdwheele 90 Sep 2 10:57 test2.seg
During this process, the hpc_submitter process has been running every minute waiting for the finished.out file to appear. When it does, the output files and process status information are collected and the workspace cleaned up. The .job file contains the updated information and is renamed to tmp8u48rlln.job.finished so it can be discovered by the MGM. The file now has additional information in it, in the 'job' node:
hpc: finished_file: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/finished.out input_map: input: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/input/01-Chapter_0A_-_Introduction.mp3 output_map: segments: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/output/test2.seg workspace: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725 input_map: input: /home/bdwheele/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3 job: message: Successful rc: 0 status: ok stderr: "singularity version 3.5.2 loaded.\n2020-09-02 10:57:03.330071: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:09.523235:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcuda.so.1\n2020-09-02 10:57:09.524845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716]\ \ Found device 0 with properties: \npciBusID: 0000:58:00.0 name: Tesla V100-PCIE-32GB\ \ computeCapability: 7.0\ncoreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB\ \ deviceMemoryBandwidth: 836.37GiB/s\n2020-09-02 10:57:09.525199: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:09.751912:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcublas.so.10\n2020-09-02 10:57:09.795948: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcufft.so.10\n2020-09-02 10:57:09.996329:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcurand.so.10\n2020-09-02 10:57:10.096515: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcusolver.so.10\n2020-09-02 10:57:10.171751:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcusparse.so.10\n2020-09-02 10:57:10.314215: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudnn.so.7\n2020-09-02 10:57:10.316617:\ \ I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu\ \ devices: 0\n2020-09-02 10:57:10.317230: I tensorflow/core/platform/cpu_feature_guard.cc:142]\ \ This TensorFlow binary is optimized with oneAPI Deep Neural Network Library\ \ (oneDNN)to use the following CPU instructions in performance-critical operations:\ \ AVX2 AVX512F FMA\nTo enable them in other operations, rebuild TensorFlow with\ \ the appropriate compiler flags.\n2020-09-02 10:57:10.332543: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104]\ \ CPU Frequency: 2600000000 Hz\n2020-09-02 10:57:10.332983: I tensorflow/compiler/xla/service/service.cc:168]\ \ XLA service 0x8428250 initialized for platform Host (this does not guarantee\ \ that XLA will be used). Devices:\n2020-09-02 10:57:10.333256: I tensorflow/compiler/xla/service/service.cc:176]\ \ StreamExecutor device (0): Host, Default Version\n2020-09-02 10:57:10.446889:\ \ I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x8437ca0 initialized\ \ for platform CUDA (this does not guarantee that XLA will be used). Devices:\n\ 2020-09-02 10:57:10.447323: I tensorflow/compiler/xla/service/service.cc:176]\ \ StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0\n\ 2020-09-02 10:57:10.450031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716]\ \ Found device 0 with properties: \npciBusID: 0000:58:00.0 name: Tesla V100-PCIE-32GB\ \ computeCapability: 7.0\ncoreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB\ \ deviceMemoryBandwidth: 836.37GiB/s\n2020-09-02 10:57:10.450384: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:10.450559:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcublas.so.10\n2020-09-02 10:57:10.450721: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcufft.so.10\n2020-09-02 10:57:10.450844:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcurand.so.10\n2020-09-02 10:57:10.450970: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcusolver.so.10\n2020-09-02 10:57:10.451853:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcusparse.so.10\n2020-09-02 10:57:10.452025: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudnn.so.7\n2020-09-02 10:57:10.453998:\ \ I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu\ \ devices: 0\n2020-09-02 10:57:10.455794: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:13.697125:\ \ I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect\ \ StreamExecutor with strength 1 edge matrix:\n2020-09-02 10:57:13.697519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]\ \ 0 \n2020-09-02 10:57:13.697711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276]\ \ 0: N \n2020-09-02 10:57:13.704263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]\ \ Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with\ \ 30132 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci\ \ bus id: 0000:58:00.0, compute capability: 7.0)\n2020-09-02 10:57:16.160694:\ \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\ \ opened dynamic library libcublas.so.10\n2020-09-02 10:57:17.334465: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\ \ Successfully opened dynamic library libcudnn.so.7\n/usr/local/lib/python3.6/dist-packages/sidekit/bosaris/detplot.py:40:\ \ MatplotlibDeprecationWarning: The 'warn' parameter of use() is deprecated since\ \ Matplotlib 3.1 and will be removed in 3.3. If any parameter follows 'warn',\ \ they should be pass as keyword, not positionally.\n matplotlib.use('PDF', warn=False,\ \ force=True)\n/usr/local/lib/python3.6/dist-packages/pyannote/algorithms/utils/viterbi.py:88:\ \ FutureWarning: arrays to stack must be passed as a \"sequence\" type such as\ \ list or tuple. Support for non-sequence iterables such as generators is deprecated\ \ as of NumPy 1.16 and will raise an error in the future.\n for e, c in six.moves.zip(emission.T,\ \ consecutive)\n/usr/local/lib/python3.6/dist-packages/pyannote/algorithms/utils/viterbi.py:97:\ \ FutureWarning: arrays to stack must be passed as a \"sequence\" type such as\ \ list or tuple. Support for non-sequence iterables such as generators is deprecated\ \ as of NumPy 1.16 and will raise an error in the future.\n for e, c in six.moves.zip(constraint.T,\ \ consecutive)\n/usr/local/lib/python3.6/dist-packages/inaSpeechSegmenter/segmenter.py:60:\ \ RuntimeWarning: invalid value encountered in subtract\n data = (data - np.mean(data,\ \ axis=1).reshape((len(data), 1))) / np.std(data, axis=1).reshape((len(data),\ \ 1))\n/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:193: RuntimeWarning:\ \ invalid value encountered in subtract\n x = asanyarray(arr - arrmean)\n" stdout: 'INFO: Container was created Fri Aug 14 19:17:31 UTC 2020 INFO: Arguments received: /mnt/input/01-Chapter_0A_-_Introduction.mp3 /mnt/output/test2.seg INFO: Output will be created in /mnt''s bind batch_processing 1 files 1/1 [(''/mnt/output/test2.seg'', 0, ''ok'')] ' output_map: segments: /tmp/test2.seg script: ina_speech_segmenter