Category description and use cases
Speech-to-text (STT) MGMs produce a transcription of speech in an audio file. None of these tools generate 100% accurate transcripts, but machine-generated are useful as substitutes for labor-intensive human-generated transcripts when "close" is good enough, or for for expediting the creation of a human-generated transcript. STT transcripts are also useful as a starting point for any workflow which cares about the spoken content of the media, such as named-entity recognition (NER), topic clustering, or sentiment analysis.
Workflow example:
Audio is passed through a segmenter MGM to label speech, silence and music. If necessary, the audio file is split into segments of speech. A new file composed of only the speech segments is sent through a speech-to-text MGM to generate transcripts. If necessary, timestamps are adjusted to restore original segments of silence and music.
Output standard
Summary:
Element | Datatype | Obligation | Definition |
media | object | required | Wrapper for metadata about the source media file. |
media.filename | string | required | Filename of the source file. |
media.duration | string | required | The duration of the source file audio. |
results | object | required | Wrapper for transcription results. |
results.transcript | string | required | The full text string of the transcription. |
results.words | array | required | Wrapper for timecoded words in the transcript. |
results.words[*].type | string (pronunciation | punctuation) | required | Type of text, pronunciation or punctuation. |
results.words[*].start | string (s.fff) | required if words[*].type is “pronunciation” | Start time of the word, in seconds. |
results.words[*].end | string(s.fff) | required if words[*].type is “pronunciation” | End time of the word, in seconds. |
results.words[*].text | string | required | The text of the word. |
results.words[*].score | object | optional | A confidence or relevance score for the entity. |
results.words[*].score.type | string (confidence | relevance) | required | The type of score, confidence or relevance. |
results.words[*].score.scoreValue | number | required | The score value, typically a float in the range of 0-1. |
JSON Schema
Sample output
Recommended tool(s)
AWS Transcribe
URL: https://aws.amazon.com/transcribe/
Official documentation: AWS Transcribe Developer Guide
Basic information
Open source or proprietary | Proprietary |
Cost | $0.0004/second of audio |
Input | “supports both 16 kHz and 8kHz audio streams, and multiple audio encodings, including WAV, MP3, MP4 and FLAC.” (from FAQ) |
Output | - JSON file (ex. https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-cli.html) |
Speaker diarization/identification | - Yes |
Languages Supported | English, French, German, Italian, Korean, Portuguese, Spanish |
Other features | |
Custom vocabulary | |
Programming languages | .NET, Go, Java, Javascript, PHP, Python and Ruby |
Training data | Unknown/black box |
Privacy/Access | “Amazon Transcribe may store and use voice inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Transcribe and other Amazon machine-learning/artificial-intelligence technologies.” “Only authorized employees will have access to your content that is processed by Amazon Transcribe.” |
Evaluation
Input formats | 16 kHz and 8kHz WAV, MP3, MP4 and FLAC |
Output formats | JSON output needs to be reshaped by script into text or VTT formats for use in production. |
Accuracy | 58% average accuracy (or 42% WER) across our samples |
Processing time | .22x-real-time |
Computing resources required | N/A (cloud-based) |
Growth rate | N/A |
Social impact | AWS Transcribe's algorithm is proprietary, It is unknown what pre-processing steps or models are used to generate transcripts or how results will change over time as the algorithm and models are changed. Custom vocabularies can be used to train the model for certain words, but this training may not be useful for heterogeneous groups of materials. AWS offers a very vague explanation of how it will use your data to "improve" their services, which may make it an undesirable choice for processing materials that will not be made public. |
Cost | $0.0004/second of audio |
Support | AWS offers detailed API documentation for Transcribe. Logs can be generated for jobs to assess success or completion of jobs. |
Integration capabilities | AWS makes it easy to integrate Transcribe into a pipeline of other AWS services, such as S3 for storage and Comprehend for NLP. If not using S3 for primary storage, assets must still be transferred to S3 in order to run them through Transcribe. |
Training | Accuracy would likely benefit from using custom vocabularies, but they would need to be created custom for each group of like materials. |
Notes: AWS returns diarization data in addition to speech-to-text. The AWS adapter should output both STT JSON (see schema above) and diarization/segmentation JSON. See below for details. Accuracy would likely benefit from using custom vocabularies, but they would need to be created custom for each group of like materials.
See Adapters
Example Usage
[Tested through the AWS Console]
Example Output
{ "jobName": "Student-admin_forum", "accountId": "751719168081", "results": { "transcripts": [ { "transcript": "So the 17th this morning, demonstrators blocked of access to the Brian administration building, and after that, picket lines were established around Ballantine and Rolls halls. } } ], "speaker_labels": { "speakers": 10, "segments": [ { "start_time": "3.94", "speaker_label": "spk_0", "end_time": "4.95", "items": [ { "start_time": "3.94", "speaker_label": "spk_0", "end_time": "4.17" }, { "start_time": "4.18", "speaker_label": "spk_0", "end_time": "4.31" }, { "start_time": "4.31", "speaker_label": "spk_0", "end_time": "4.95" } ] }, { "start_time": "6.03", "speaker_label": "spk_0", "end_time": "14.91", "items": [ { "start_time": "6.03", "speaker_label": "spk_0", "end_time": "6.18" }, { "start_time": "6.18", "speaker_label": "spk_0", "end_time": "6.47" }, { "start_time": "6.47", "speaker_label": "spk_0", "end_time": "7.18" }, { "start_time": "7.19", "speaker_label": "spk_0", "end_time": "7.85" }, { "start_time": "8.34", "speaker_label": "spk_0", "end_time": "8.5" }, { "start_time": "8.51", "speaker_label": "spk_0", "end_time": "8.98" }, { "start_time": "8.98", "speaker_label": "spk_0", "end_time": "9.13" }, { "start_time": "9.13", "speaker_label": "spk_0", "end_time": "9.25" }, { "start_time": "9.25", "speaker_label": "spk_0", "end_time": "9.59" }, { "start_time": "9.59", "speaker_label": "spk_0", "end_time": "10.29" }, { "start_time": "10.29", "speaker_label": "spk_0", "end_time": "10.69" }, { "start_time": "11.06", "speaker_label": "spk_0", "end_time": "11.24" }, { "start_time": "11.25", "speaker_label": "spk_0", "end_time": "11.55" }, { "start_time": "11.55", "speaker_label": "spk_0", "end_time": "11.76" }, { "start_time": "11.77", "speaker_label": "spk_0", "end_time": "12.11" }, { "start_time": "12.11", "speaker_label": "spk_0", "end_time": "12.42" }, { "start_time": "12.42", "speaker_label": "spk_0", "end_time": "12.56" }, { "start_time": "12.56", "speaker_label": "spk_0", "end_time": "13.08" }, { "start_time": "13.08", "speaker_label": "spk_0", "end_time": "13.34" }, { "start_time": "13.34", "speaker_label": "spk_0", "end_time": "13.91" }, { "start_time": "13.91", "speaker_label": "spk_0", "end_time": "14.05" }, { "start_time": "14.05", "speaker_label": "spk_0", "end_time": "14.38" }, { "start_time": "14.38", "speaker_label": "spk_0", "end_time": "14.91" } ] } ] }, "items": [ { "start_time": "3.94", "end_time": "4.17", "alternatives": [ { "confidence": "0.3686", "content": "So" } ], "type": "pronunciation" }, { "start_time": "4.18", "end_time": "4.31", "alternatives": [ { "confidence": "0.9998", "content": "the" } ], "type": "pronunciation" }, { "start_time": "4.31", "end_time": "4.95", "alternatives": [ { "confidence": "0.8866", "content": "17th" } ], "type": "pronunciation" }, { "start_time": "6.03", "end_time": "6.18", "alternatives": [ { "confidence": "1.0000", "content": "this" } ], "type": "pronunciation" }, { "start_time": "6.18", "end_time": "6.47", "alternatives": [ { "confidence": "1.0000", "content": "morning" } ], "type": "pronunciation" }, { "alternatives": [ { "confidence": "0.0000", "content": "," } ], "type": "punctuation" }, { "start_time": "6.47", "end_time": "7.18", "alternatives": [ { "confidence": "0.9997", "content": "demonstrators" } ], "type": "pronunciation" }, { "start_time": "7.19", "end_time": "7.85", "alternatives": [ { "confidence": "1.0000", "content": "blocked" } ], "type": "pronunciation" }, { "start_time": "8.34", "end_time": "8.5", "alternatives": [ { "confidence": "0.7914", "content": "of" } ], "type": "pronunciation" }, { "start_time": "8.51", "end_time": "8.98", "alternatives": [ { "confidence": "1.0000", "content": "access" } ], "type": "pronunciation" }, { "start_time": "8.98", "end_time": "9.13", "alternatives": [ { "confidence": "1.0000", "content": "to" } ], "type": "pronunciation" }, { "start_time": "9.13", "end_time": "9.25", "alternatives": [ { "confidence": "1.0000", "content": "the" } ], "type": "pronunciation" }, { "start_time": "9.25", "end_time": "9.59", "alternatives": [ { "confidence": "0.6497", "content": "Brian" } ], "type": "pronunciation" }, { "start_time": "9.59", "end_time": "10.29", "alternatives": [ { "confidence": "1.0000", "content": "administration" } ], "type": "pronunciation" }, { "start_time": "10.29", "end_time": "10.69", "alternatives": [ { "confidence": "1.0000", "content": "building" } ], "type": "pronunciation" }, { "alternatives": [ { "confidence": "0.0000", "content": "," } ], "type": "punctuation" }, { "start_time": "11.06", "end_time": "11.24", "alternatives": [ { "confidence": "0.9809", "content": "and" } ], "type": "pronunciation" }, { "start_time": "11.25", "end_time": "11.55", "alternatives": [ { "confidence": "1.0000", "content": "after" } ], "type": "pronunciation" }, { "start_time": "11.55", "end_time": "11.76", "alternatives": [ { "confidence": "1.0000", "content": "that" } ], "type": "pronunciation" }, { "alternatives": [ { "confidence": "0.0000", "content": "," } ], "type": "punctuation" }, { "start_time": "11.77", "end_time": "12.11", "alternatives": [ { "confidence": "1.0000", "content": "picket" } ], "type": "pronunciation" }, { "start_time": "12.11", "end_time": "12.42", "alternatives": [ { "confidence": "1.0000", "content": "lines" } ], "type": "pronunciation" }, { "start_time": "12.42", "end_time": "12.56", "alternatives": [ { "confidence": "0.9788", "content": "were" } ], "type": "pronunciation" }, { "start_time": "12.56", "end_time": "13.08", "alternatives": [ { "confidence": "1.0000", "content": "established" } ], "type": "pronunciation" }, { "start_time": "13.08", "end_time": "13.34", "alternatives": [ { "confidence": "0.9956", "content": "around" } ], "type": "pronunciation" }, { "start_time": "13.34", "end_time": "13.91", "alternatives": [ { "confidence": "0.9549", "content": "Ballantine" } ], "type": "pronunciation" }, { "start_time": "13.91", "end_time": "14.05", "alternatives": [ { "confidence": "0.9985", "content": "and" } ], "type": "pronunciation" }, { "start_time": "14.05", "end_time": "14.38", "alternatives": [ { "confidence": "0.2558", "content": "Rolls" } ], "type": "pronunciation" }, { "start_time": "14.38", "end_time": "14.91", "alternatives": [ { "confidence": "0.7122", "content": "halls" } ], "type": "pronunciation" }, ] }, "status": "COMPLETED" }
Kaldi
URL: https://github.com/kaldi-asr/kaldi
Official documentation: Kaldi Documentation | HIPSTAS Kaldi instanceBasic information
Open source or proprietary | Open source |
Cost | Free |
Input | 16kHz wav; but this containerized version contains a bash script that converts MP3, MP4, WAV, and MOV files to the appropriate spec before sending to Kaldi scripts. |
Output | - JSON and txt file |
Speaker diarization/identification | No |
Languages Supported | English |
Other features | |
Custom vocabulary | No |
Programming languages | Written in C++, bindings for Python and BASH |
Other tech notes | This version is containerized. |
Privacy/Access | Will inherit whatever security protocols IU requires in IT systems. |
Evaluation
Input formats | 16kHz WAV audio file |
Output formats | The WGBH fork outputs JSON with word-level timestamps and confidence as well as plain text |
Accuracy | 46% average accuracy (or 64% WER) across our samples |
Processing time | 4x-real-time |
Computing resources required | Each container requires at least 1 CPU an 6GB of memory. |
Growth rate | N/A |
Social impact | Uses open source tool. If AMP contributes to code this will contribute to the community. Kaldi can be trained, so there is more control over the output, but training a model is difficult and time consuming. The same level of care would need to be taken for Kaldi as for commercial services to ensure that the risks of unintended consequences are mitigated. |
Cost | Cost is related to cost of IU servers and the average throughput by users. |
Support | Open source community. Support is only via goodwill from the community. Kaldi has an active community forum. |
Integration capabilities | N/A |
Training | A requirement for successful training of Kaldi would be one or more expert speech recognition researchers to manage model development and training for all unique recording scenarios (e.g, clear english with one speaker and no background noise and one specific type of accent; clear english with two speakers and no background noise and noticeable southern accent, etc.) — additionally the type of content will require different models (e.g., english read out of the WSJ, poetry, weather news; sports news, theater, etc.). |
Installation & requirements
Our version of Kaldi runs using Docker
docker pull hipstas/kaldi-pop-up-archive
Example Usage
See also this full walkthrough on Google Drive
Example Output
Other evaluated tools
Google Cloud STT
URL: https://cloud.google.com/speech-to-text/
Basic information
Open source or proprietary | Proprietary |
Cost | $0.006 / 15 seconds ($0.009 / 15 seconds for video) |
Input | Supported encodings: FLAC, LINEAR16, MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE |
Output | JSON |
Speaker diarization/identification | Yes (Beta) |
Languages Supported | |
Custom vocabulary | Yes, up to 500 phrases per request, 10,000 total characters per request, 100 characters per phrase |
Programming languages | REST API (easy from many languages). Libraries for C#, Go, Java, Node.js, PHP, Python, Ruby |
Privacy/Access | Google offers a lower rate for opting in to data logging, but they are not specific about uses of the data |
Evaluation
Input formats | Encodings: MP3 (beta), FLAC, LINEAR16, MULAW, AMR, OGG_OPUS. See documentation for required sample rate for each: https://cloud.google.com/speech-to-text/docs/encoding |
Output formats | JSON with word-level timestamps, and "sentence"-level confidence (ability to enable word confidence). JSON output needs to be reshaped by script into text or VTT formats for use in production. |
Accuracy | 55% average accuracy (or 45% WER) across our samples |
Processing time | .16x-real-time |
Computing resources required | N/A (cloud-based) |
Growth rate | N/A |
Social impact | Google's algorithm is proprietary, It is unknown what pre-processing steps or models are used to generate transcripts or how results will change over time as the algorithm and models are changed. Custom vocabularies can be used to train the model for certain words, but this training may not be useful for heterogeneous groups of materials. Google offers a lower rate for opting in to data logging, but they are not specific about uses of the data, which may make it an undesirable choice for processing materials that will not be made public. |
Cost | $0.006 / 15 seconds ($0.0004/second) for audio; $0.009 / 15 seconds for video ($0.0006/second) |
Support | Well documented. Professional support team and many community support forums |
Integration capabilities | Google makes it easy to integrate Google Cloud STT into a pipeline of other Google services, such as Google Cloud for storage and Natural Language for NLP. If not using Google Cloud for primary storage, assets must still be transferred to Google Cloud in order to run them through STT. |
Training | Google offers four models to choose from: command_and_search, phone_call, video (premium model that costs more), and default. (Currently only phone_call allows speaker diarization) It does not yet offer custom vocabularies, but does offer "phrase hints" |
Mozilla DeepSpeech
URL: https://github.com/mozilla/DeepSpeech & https://research.mozilla.org/machine-learning/
Official documentation: https://github.com/mozilla/DeepSpeech
Basic information
Open source or proprietary | Open source |
Cost | Compute cost (if running on cloud services) |
Input | Currently only WAVE files with 16-bit, 16 kHz, mono are supported |
Output | Plain text |
Speaker diarization/identification | No |
Languages Supported | English |
Custom vocabulary | Yes |
Training data | Global Speech dataset (from Common Voices project--diverse range of voices) |
Programming languages | Python, but there are a few other bindings. Early java JNI https://github.com/mozilla/DeepSpeech/tree/master/native_client/java |
Privacy/Access | Will inherit whatever security protocols IU requires in IT systems. |
Evaluation
Input formats | 16bit mono WAV audio file |
Output formats | Plain text only |
Accuracy | 25% average accuracy (or 75% WER) across our samples |
Processing time | .25x-real-time |
Social impact | Deep Speech can be trained, so there is more control over the output, but training a model is difficult and time consuming. The same level of care would need to be taken for Deep Speech as for commercial services to ensure that the risks of unintended consequences are mitigated. Mozilla claims a more diverse body of training data (Project Common Voice) to represent a wider variety of dialects, so while current outputs are still very inaccurate, future versions may prove to offer more accurate transcripts. |
Cost | Cost is related to cost of IU servers and the average throughput by users. |
Support | Open source community. Support is only via goodwill from the community. Deep Speech has an active user community: https://discourse.mozilla.org/c/deep-speech |
Integration capabilities | Deep Speech does not offer timestamps, so its use may be limited for other purposes beyond keyword searchable transcripts. |
Training | Deep Speech allows training and looks easier to train than Kaldi, but this will still take time and expertise to get desired results. Different types of content will require different models. |
Example Usage
Example Output
PocketSphinx/CMUSphinx
URL & official documentation: https://cmusphinx.github.io/
Basic information
Open source or proprietary | Open source |
Cost | Free |
Input | WAVE (16 bit, mono 8 kHz or 16kHz) only |
Output | Plain text |
Speaker diarization/identification | |
Languages Supported | Prebuilt models for: Mandarin, Indian English, Catalan, German, Greek, French, Dutch, US English, Spanish, Italian, Hindi, Russian, Kazakh |
Custom vocabulary | |
Training data | Unsure what the prebuilt models are trained on; ability to train your own |
Programming languages | Java, C |
Privacy/Access | Everything is handled locally, so it is what we make it |
Evaluation
Input formats | 16bit mono WAV audio file |
Output formats | Sphinx4 outputs plain text, but timestamps could also be generated with some effort learning the tool. (Learning curve is steep.) |
Accuracy | 24% average accuracy (or 76% WER) across our samples |
Processing time | .65x-real-time |
Social impact | Sphinx can be trained, so there is more control over the output, but training a model is difficult and time consuming. The same level of care would need to be taken for Sphinx as for commercial services to ensure that the risks of unintended consequences are mitigated. |
Cost | Cost is related to cost of IU servers and the average throughput by users. |
Support | Open source community. Support is only via goodwill from the community. |
Training | Sphinx 4 allows training, but this will still take time and expertise to get desired results. |