The following is a report on the results of a test of two different computing environments for a Kaldi-based transcript workflow in AMP: a high-performance computing (HPC) environment in IU's Carbonate computing cluster, and a local environment in AMPPD. While it was (correctly) assumed that the HPC environment would provide significant performance increases over the local environment, it was not known precisely how much the performance increase would be. To determine this, the 32 primary files (equaling 10 hours of content) that had been submitted to the HMGM workflow were submitted to two new workflows in Galaxy consisting of the INA speech segmenter and Kaldi running in each respective environment. Timing data for each workflow was then collected in an Excel workbook. This was then saved as a CSV and imported into Python for basic statistical analysis in Pandas. The results are reported below.
The data was collected by hand from Galaxy into an Excel workbook. The data consists of start and end times for each primary file in each workflow, as well as running times for INA and Kaldi for each workflow. Total running time (wall time) was additionally derived for each workflow from the start and end times. All running times are presented in seconds, except for the INA and Kaldi times for the HPC workflow, which is presented in seconds with six decimal points of precision for fractions of seconds.
|Elapsed Time (INA HPC)||Elapsed Time (Kaldi HPC)||Wall Time (HPC)||Elapsed Time (INA Local)||Elapsed Time (Kaldi Local)||Wall Time (Local)|
(All times in seconds)
The results show that, on average, there is a per-file time reduction of 79.52% for files in the HPC environment as opposed to the local environment, based on the median total (wall) times for each workflow (2,736 seconds, or 45 minutes and 36 seconds, versus 13,356.5 seconds, or 3 hours, 42 minutes, and 36.5 seconds, respectively); the median was used instead of the mean because the data for the local workflow was right-skewed due to three extreme outliers. In other words, on average, a file running through the HPC workflow can be expected to finish processing in roughly 20 percent of the time it would take for the same file to finish processing in the local workflow. This is a substantial performance increase, to say the very least. The increase is made more substantial by the fact that the HPC workflow can handle a larger number of files: while exact maximum numbers for each are unknown, all 32 files were submitted simultaneously to the HPC workflow, yet the files needed to be separated into groups of 6 for the local workflow.
For Kaldi, there is an average per-file time reduction of 99.28%, based on the median running times (~44 seconds and 6,122 seconds), and for the INA speech segmenter, there is a reduction of 94.84% (median times of 31.24 seconds and 605.5 seconds). This again shows enormous performance increases in the HPC workflow.