Page tree
Skip to end of metadata
Go to start of metadata

Overview

The underlying problem with submitting jobs with IU's HPC Environment is the latency of starting the individual batch jobs. It is not uncommon for a singular job to wait 12-24 hours, so when processing large batch of objects the amount of time waiting far outstrips the amount of time actually using the GPUs.

Scheduling

To describe the mechanism used, consider a situation with a single GPU resource and these jobs:

IDRun Time (mins)Notes
G1120Random GPU User job
G260Random GPU User job
G3240Random GPU User job
G4180Random GPU User job
A115AMP interactive GPU Job
A210AMP batch GPU Job
A310AMP batch GPU Job
A410AMP batch GPU Job
A515AMP interactive GPU Job

On IU's HPC environment, the job scheduling is handled by the SLURM software.  It manages resources and runs job in terms of available resources and priority.  It is also highly addictive. Since there is a single resource, the order the jobs are submitted into the SLURM queue makes a huge difference:

CaseOrder

AMP

Start Time

AMP

Run Time

Notes
1G1, G2, G3, G4, A1, A2, A3, A4, A58.5 hours1 hourThis is really the worst case for latency, and there isn't anything we can do about it
2G1, A1, G2, A2, G3, A3, G4, A4, A52 hours7.5 hoursInterleaved requests.  Some results coming back earlier, but the overall time is longer
3G1, A1, G2, A2, A3, A4, G3, A5, G42 hours6 hoursThe batched AMP processes get queued together, shortening the overall time
4G1, A1, A2, A3, A4, A5, G2, G3, G42 hours1 hourAll of the amp jobs are run when the first amp job makes it into the queue, significantly shortening the overall time
5G1, A1, A2, G2, A3, A4, A5, G3, G42 hours5 hoursA random job got scheduled in the middle of the AMP jobs, stalling the completion

Cases 1 and 4 are effectively the same case:  there is a number of random jobs head of the all of the AMP jobs and the random jobs after the AMP work can be ignored.  The other three cases are all interlaced cases where random jobs appeared between the AMP work.

So, to optimize the AMP throughput, it's important to get the AMP jobs at the head of the queue with minimal interlacing with other jobs.  There isn't any way to control when jobs are placed into the SLURM queue (AMP or otherwise), another method needs to be used:  an hpc_runner.

As AMP generates HPC requests, instead of submitting them directly into the SLURM queue, they are placed into an unordered queue managed by the hpc_runner.  When the AMP jobs are submitted, the hpc_runner job is submitted to SLURM, if it is not already in the queue or already running on HPC.  When the hpc_runner is started, any AMP jobs in the queue are processed, including any which were submitted after the hpc_runner was submitted to SLURM or while hpc_runner was processing previous jobs on this invocation.

This allows all of the AMP HPC requests to complete in the minimum amount of time.

HPC Batch Components

MGMs

Each MGM generates a YAML file describing the job they want done on the HPC cluster and then waits for the results.  A python library with a single function (submit_and_wait) puts the information into a drop box and waits for completion, returning the status of the job.  The MGM doesn't need to know anything about HPC – just the location of the system dropbox.  As such, the MGM code for a typical HPC invocation is less than 50 lines of code.

HPC_Submitter

The HPC submitter is a cron job which runs every minute on the AMP machines that performs three tasks:

  1. Any new jobs in the dropbox are pushed to the HPC cluster
    1. A workspace directory is created on HPC
    2. Local input files are copied to the workspace
    3. A job.sh script is created on HPC from the parameters in the job
    4. The job is marked as submitted
  2. The hpc_queuer program is run on the HPC system (see below)
  3. Any jobs which are complete on HPC are resolved
    1. If the workspace has a finished.out file, the job is finished
    2. Retrieve the return code, stdout, and stderr for the job
    3. If the job was successful, copy the output files from HPC to the local machine
    4. mark the job as finished so the submit_and_wait call in the MGM can complete

HPC_Queuer

The hpc_queuer script will check if there are any outstanding AMP tasks and will queue the hpc_runner if it is not currently in SLURM (either pending or running). 

If the hpc_runner needs to be submitted to SLURM, a trampoline script is created which calls the hpc_runner with the appropriate arguments and the trampoline is submitted to SLURM.  The trampoline is required because when submitting a job to SLURM, it will copy the script to a temporary directory and run it from there – that means that any script-relative paths do not work.

The script has to be triggered by the hpc_submitter because cronjobs are not allowed on the HPC system.  Because the hpc_queuer is always started by the hpc_submitter, the system is fail-safe:  any jobs on HPC which were interrupted or if the hpc_runner fails, as long as there are incomplete jobs the hpc_queuer will requeue the hpc_runner.

HPC_Runner

When invoked by the trampoline script via SLURM, the hpc_runner will:

  1. Scan the workspaces for any AMP jobs that need to be processed
    1. If there are none, it will exit
  2. For each outstanding job:
    1. Run the job.sh that was created by the hpc_submitter, capturing the return code, stderr, and stdout
    2. Write the return code to finished.out so hpc_submitter knows the job has completed
  3. Go to 1


A real-life example

This is an example using a single invocation of the INA Speech Segmenter HPC MGM.

First, Galaxy (or by hand), this command gets run:

./ina_speech_segmenter.py ../dropbox/ ~/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3 /tmp/test2.seg

Which tells the MGM to create a job in ../dropbox to get the segments from the .mp3 file and place the output in /tmp/test2.seg. 

In the dropbox, a YAML file (tmp8u48rlln.job) is created with this information:

input_map:
  input: /home/bdwheele/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3
output_map:
  segments: /tmp/test2.seg
script: ina_speech_segmenter

This job file gives all of the details based on the parameters.  The script will then wait for completion, checking every few seconds.

In less than a minute, a cron job will run:

./hpc_submitter hpc_submitter.ini

That will see the .job file and submit it to HPC.  The workspace is created and the job.sh file is created.  There is a template for each tool, and for ina_speech_segmenter, the template is:

#!/bin/bash
module load singularity
exec singularity run --nv \
        --bind <<workspace>>:/mnt \
        <<containers>>/inaspeech.sif \
        /mnt/input/<<input>> \
        /mnt/output/<<segments>>

The french quoted variables are replaced to generate the job.sh:

#!/bin/bash
module load singularity
exec singularity run --nv \
        --bind /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725:/mnt \
        /N/u/bdwheele/Carbonate/containers/inaspeech.sif \
        /mnt/input/01-Chapter_0A_-_Introduction.mp3 \
        /mnt/output/test2.seg

Then the job file is modified to hold the required HPC information:

hpc:
  finished_file: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/finished.out
  input_map:
    input: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/input/01-Chapter_0A_-_Introduction.mp3
  output_map:
    segments: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/output/test2.seg
  workspace: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725
input_map:
  input: /home/bdwheele/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3
output_map:
  segments: /tmp/test2.seg
script: ina_speech_segmenter

and the file is renamed to tmp8u48rlln.job.submitted to indicate the job is in progress.

hpc_submitter then calls hpc_queuer on the HPC system which sees that the hpc_runner is not in SLURM:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             70314        dl     bash ninoparu  R   10:44:43      1 dl8
             70365        dl      com  shishuz  R      24:26      1 dl3
             70175        dl      lin   timlai  R   19:52:52      1 dl7
             70174        dl    model   timlai  R   19:53:01      1 dl5
             70013        dl      emb   timlai  R 1-17:30:43      1 dl5

so this trampoline script is created to launch hpc_runner via SLURM:

#!/bin/bash
/geode2/home/u010/bdwheele/Carbonate/hpc_batch/on_hpc/scripts/hpc_runner

At that point, SLURM is aware of the script and will run it at some point:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             70367        dl hpc_runn bdwheele  R       0:00      1 dl1
             70314        dl     bash ninoparu  R   10:44:54      1 dl8
             70365        dl      com  shishuz  R      24:37      1 dl3
             70175        dl      lin   timlai  R   19:53:03      1 dl7
             70174        dl    model   timlai  R   19:53:12      1 dl5
             70013        dl      emb   timlai  R 1-17:30:54      1 dl5

When hpc_runner is finished running the jobs.sh file, the directory structure looks like:

total 256
drwxrwxr-x 4 bdwheele bdwheele 32768 Sep  2 10:57 .
drwxrwxr-x 3 bdwheele bdwheele 32768 Sep  2 10:56 ..
-rw-rw-r-- 1 bdwheele bdwheele     2 Sep  2 10:57 finished.out
drwxrwxr-x 2 bdwheele bdwheele 32768 Sep  2 10:56 input
-rwxr-x--- 1 bdwheele bdwheele   318 Sep  2 10:56 job.sh
drwxrwxr-x 2 bdwheele bdwheele 32768 Sep  2 10:57 output
-rw-rw-r-- 1 bdwheele bdwheele  6734 Sep  2 10:57 stderr.txt
-rw-rw-r-- 1 bdwheele bdwheele   259 Sep  2 10:57 stdout.txt

./input:
total 576
drwxrwxr-x 2 bdwheele bdwheele  32768 Sep  2 10:56 .
drwxrwxr-x 4 bdwheele bdwheele  32768 Sep  2 10:57 ..
-rw-rw-r-- 1 bdwheele bdwheele 499840 Sep  2 10:56 01-Chapter_0A_-_Introduction.mp3

./output:
total 64
drwxrwxr-x 2 bdwheele bdwheele 32768 Sep  2 10:57 .
drwxrwxr-x 4 bdwheele bdwheele 32768 Sep  2 10:57 ..
-rw-r--r-- 1 bdwheele bdwheele    90 Sep  2 10:57 test2.seg

During this process, the hpc_submitter process has been running every minute waiting for the finished.out file to appear.  When it does, the output files and process status information are collected and the workspace cleaned up.  The .job file contains the updated information and is renamed to tmp8u48rlln.job.finished so it can be discovered by the MGM.  The file now has additional information in it, in the 'job' node:

hpc:
  finished_file: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/finished.out
  input_map:
    input: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/input/01-Chapter_0A_-_Introduction.mp3
  output_map:
    segments: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725/output/test2.seg
  workspace: /N/u/bdwheele/Carbonate/hpc_batch/on_hpc/workspace/wombat-2281849691-1599058616.0142725
input_map:
  input: /home/bdwheele/projects/inaspeech_carbonate/01-Chapter_0A_-_Introduction.mp3
job:
  message: Successful
  rc: 0
  status: ok
  stderr: "singularity version 3.5.2 loaded.\n2020-09-02 10:57:03.330071: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:09.523235:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcuda.so.1\n2020-09-02 10:57:09.524845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716]\
    \ Found device 0 with properties: \npciBusID: 0000:58:00.0 name: Tesla V100-PCIE-32GB\
    \ computeCapability: 7.0\ncoreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB\
    \ deviceMemoryBandwidth: 836.37GiB/s\n2020-09-02 10:57:09.525199: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:09.751912:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcublas.so.10\n2020-09-02 10:57:09.795948: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcufft.so.10\n2020-09-02 10:57:09.996329:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcurand.so.10\n2020-09-02 10:57:10.096515: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcusolver.so.10\n2020-09-02 10:57:10.171751:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcusparse.so.10\n2020-09-02 10:57:10.314215: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudnn.so.7\n2020-09-02 10:57:10.316617:\
    \ I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu\
    \ devices: 0\n2020-09-02 10:57:10.317230: I tensorflow/core/platform/cpu_feature_guard.cc:142]\
    \ This TensorFlow binary is optimized with oneAPI Deep Neural Network Library\
    \ (oneDNN)to use the following CPU instructions in performance-critical operations:\
    \  AVX2 AVX512F FMA\nTo enable them in other operations, rebuild TensorFlow with\
    \ the appropriate compiler flags.\n2020-09-02 10:57:10.332543: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104]\
    \ CPU Frequency: 2600000000 Hz\n2020-09-02 10:57:10.332983: I tensorflow/compiler/xla/service/service.cc:168]\
    \ XLA service 0x8428250 initialized for platform Host (this does not guarantee\
    \ that XLA will be used). Devices:\n2020-09-02 10:57:10.333256: I tensorflow/compiler/xla/service/service.cc:176]\
    \   StreamExecutor device (0): Host, Default Version\n2020-09-02 10:57:10.446889:\
    \ I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x8437ca0 initialized\
    \ for platform CUDA (this does not guarantee that XLA will be used). Devices:\n\
    2020-09-02 10:57:10.447323: I tensorflow/compiler/xla/service/service.cc:176]\
    \   StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0\n\
    2020-09-02 10:57:10.450031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716]\
    \ Found device 0 with properties: \npciBusID: 0000:58:00.0 name: Tesla V100-PCIE-32GB\
    \ computeCapability: 7.0\ncoreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB\
    \ deviceMemoryBandwidth: 836.37GiB/s\n2020-09-02 10:57:10.450384: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:10.450559:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcublas.so.10\n2020-09-02 10:57:10.450721: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcufft.so.10\n2020-09-02 10:57:10.450844:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcurand.so.10\n2020-09-02 10:57:10.450970: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcusolver.so.10\n2020-09-02 10:57:10.451853:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcusparse.so.10\n2020-09-02 10:57:10.452025: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudnn.so.7\n2020-09-02 10:57:10.453998:\
    \ I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu\
    \ devices: 0\n2020-09-02 10:57:10.455794: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudart.so.10.1\n2020-09-02 10:57:13.697125:\
    \ I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect\
    \ StreamExecutor with strength 1 edge matrix:\n2020-09-02 10:57:13.697519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]\
    \      0 \n2020-09-02 10:57:13.697711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276]\
    \ 0:   N \n2020-09-02 10:57:13.704263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]\
    \ Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with\
    \ 30132 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci\
    \ bus id: 0000:58:00.0, compute capability: 7.0)\n2020-09-02 10:57:16.160694:\
    \ I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully\
    \ opened dynamic library libcublas.so.10\n2020-09-02 10:57:17.334465: I tensorflow/stream_executor/platform/default/dso_loader.cc:48]\
    \ Successfully opened dynamic library libcudnn.so.7\n/usr/local/lib/python3.6/dist-packages/sidekit/bosaris/detplot.py:40:\
    \ MatplotlibDeprecationWarning: The 'warn' parameter of use() is deprecated since\
    \ Matplotlib 3.1 and will be removed in 3.3.  If any parameter follows 'warn',\
    \ they should be pass as keyword, not positionally.\n  matplotlib.use('PDF', warn=False,\
    \ force=True)\n/usr/local/lib/python3.6/dist-packages/pyannote/algorithms/utils/viterbi.py:88:\
    \ FutureWarning: arrays to stack must be passed as a \"sequence\" type such as\
    \ list or tuple. Support for non-sequence iterables such as generators is deprecated\
    \ as of NumPy 1.16 and will raise an error in the future.\n  for e, c in six.moves.zip(emission.T,\
    \ consecutive)\n/usr/local/lib/python3.6/dist-packages/pyannote/algorithms/utils/viterbi.py:97:\
    \ FutureWarning: arrays to stack must be passed as a \"sequence\" type such as\
    \ list or tuple. Support for non-sequence iterables such as generators is deprecated\
    \ as of NumPy 1.16 and will raise an error in the future.\n  for e, c in six.moves.zip(constraint.T,\
    \ consecutive)\n/usr/local/lib/python3.6/dist-packages/inaSpeechSegmenter/segmenter.py:60:\
    \ RuntimeWarning: invalid value encountered in subtract\n  data = (data - np.mean(data,\
    \ axis=1).reshape((len(data), 1))) / np.std(data, axis=1).reshape((len(data),\
    \ 1))\n/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:193: RuntimeWarning:\
    \ invalid value encountered in subtract\n  x = asanyarray(arr - arrmean)\n"
  stdout: 'INFO: Container was created Fri Aug 14 19:17:31 UTC 2020

    INFO: Arguments received: /mnt/input/01-Chapter_0A_-_Introduction.mp3 /mnt/output/test2.seg

    INFO: Output will be created in /mnt''s bind

    batch_processing 1 files

    1/1 [(''/mnt/output/test2.seg'', 0, ''ok'')]

    '
output_map:
  segments: /tmp/test2.seg
script: ina_speech_segmenter






  • No labels