Page tree
Skip to end of metadata
Go to start of metadata

Category description and use cases

Speech-to-text (STT) MGMs produce a transcription of speech in an audio file. None of these tools generate 100% accurate transcripts, but machine-generated are useful as substitutes for labor-intensive human-generated transcripts when "close" is good enough, or for for expediting the creation of a human-generated transcript. STT transcripts are also useful as a starting point for any workflow which cares about the spoken content of the media, such as named-entity recognition (NER), topic clustering, or sentiment analysis.

Workflow example:

Audio is passed through a segmenter MGM to label speech, silence and music. If necessary, the audio file is split into segments of speech. A new file composed of only the speech segments is sent through a speech-to-text MGM to generate transcripts. If necessary, timestamps are adjusted to restore original segments of silence and music.

Output standard

Summary: 

Element

Datatype

Obligation

Definition

media

object

required

Wrapper for metadata about the source media file.

media.filename

string

required

Filename of the source file.

media.duration

string

required

The duration of the source file audio.

results

object

required

Wrapper for transcription results.

results.transcript

string

required

The full text string of the transcription.

results.words

array

required

Wrapper for timecoded words in the transcript.

results.words[*].type

string (pronunciation | punctuation)

required

Type of text, pronunciation or punctuation.

results.words[*].start

string (s.fff)

required if words[*].type is “pronunciation”

Start time of the word, in seconds.

results.words[*].end

string(s.fff)

required if words[*].type is “pronunciation”

End time of the word, in seconds.

results.words[*].text

string

required

The text of the word.

results.words[*].score

object

optional

A confidence or relevance score for the entity.

results.words[*].score.type

string (confidence | relevance)

required

The type of score, confidence or relevance. 

results.words[*].score.scoreValue

number

required

The score value, typically a float in the range of 0-1.

JSON Schema

Schema
{
	"$schema": "http://json-schema.org/schema#",
	"type": "object",
	"title": "Speech-to-text Transcription Schema",
    "required": [
        "media",
        "results"
    ], 
    "properties": {
    	"media": {
    		"type": "object",
    		"title": "Media",
            "description": "Wrapper for metadata about the source media file.",
    		"required": [
    			"filename",
    			"duration"
    		],
    		"properties": {
    			"filename": {
    				"type": "string",
    				"title": "Filename",
                    "description": "Filename of the source file.",
    				"default": "",
    				"examples": ["myfile.wav"]
    			},
    			"duration": {
    				"type": "string",
    				"title": "Duration",
                    "description": "Duration of the source file.",
    				"default": "",
    				"examples": ["25.888"]
    			}
    		}
    	},
    	"results": {
    		"type": "object",
    		"title": "Results", 
            "description": "Results from the transcription job.",
    		"required": [
    			"transcript",
    			"words"],
    		"properties": {
    			"transcript": {
    				"type": "string",
    				"title": "Transcript",
                    "description": "A plain text transcript of the transcription output.",
    				"default": "",
    				"examples": ["Professional answer."]
    			},
    			"words": {
    				"type": "array",
    				"title": "Words",
                    "description": "The list of words spoken by speakers in the audio.",
    				"items": {
    					"type": "object",
    					"required": [
    						"type",
    						"text"
    					],
						"properties": {
							"type": {
								"type": "string",
                                "title": "Type",
                                "description": "The type of word, pronunciation or punctuation.",
								"enum": ["pronunciation", "punctuation"]
							},
							"text": {
								"type": "string",
								"title": "Text",
                                "description": "The text of the word.",
								"default": "",
								"examples": ["professional"]
							},
							"start": {
								"type": "string",
								"title": "Start",
                                "description": "Start time of the word, in seconds.",
								"default": "0.0",
								"examples": ["0.690"]
							},
							"end": {
								"type": "string",
								"title": "End",
                                "description": "End time of the word, in seconds.",
								"default": "0.0",
								"examples": ["1.210"]
							},
                            "score": {
                                "type": "object",
                                "title": "score",
                                "description": "A confidence or relevance score for the word.",
                                "required": [
                                    "type",
                                    "scoreValue"
                                ],
                                "properties": {
                                    "type": {
                                        "type": "string",
                                        "title": "Type",
                                        "description": "The type of score, confidence or relevance.",
                                        "enum": [
                                            "confidence",
                                            "relevance"
                                        ]
                                    },
                                    "scoreValue": {
                                        "type": "number",
                                        "title": "Score value",
                                        "description": "The score value, typically a float in the range of 0-1.",
                                        "default": 0,
                                        "examples": [0.437197]
                                    }
                                }
                            }
						}
    				}
    			}
			}
		}
	}

Sample output

Sample output
{		
	"media": {
			"filename": "myfile.wav",
			"duration": "1.500"
		},
	"results": {
		"transcript": "Professional answer.",
		"words": [{
			"start": "0.100",
			"end": "0.690",
			"text": "Professional",
			"type": "pronunciation"
		}, {
			"start": "0.690",
			"end": "1.210",
			"text": "answer",
			"type": "pronunciation"
		}, {
			"text": ".",
			"type": "punctuation"
		}]
	}
}

Recommended tool(s)

 AWS Transcribe

URL: https://aws.amazon.com/transcribe/
Official documentation:
 AWS Transcribe Developer Guide

Basic information

Open source or proprietary

Proprietary

Cost

$0.0004/second of audio

Input

“supports both 16 kHz and 8kHz audio streams, and multiple audio encodings, including WAV, MP3, MP4 and FLAC.” (from FAQ)

Output

- JSON file (ex. https://docs.aws.amazon.com/transcribe/latest/dg/getting-started-cli.html)
- Word-level objects w/ timestamp, confidence
- There are third-party libraries for converting AWS Transcribe JSON to VTT and TTML

Speaker diarization/identification

- Yes
- “You can specify that Amazon Transcribe identify between 2 and 10 speakers in the audio clip. You get the best performance when the number of speakers that you ask to identify matches the number of speakers in the input audio.” (from How it works page)
- Supports channel identification

Languages Supported

English, French, German, Italian, Korean, Portuguese, Spanish

Other features


Custom vocabulary

Yes, supports files 50kb or less

Programming languages

.NET, Go, Java, Javascript, PHP, Python and Ruby

Training data

Unknown/black box

Privacy/Access

“Amazon Transcribe may store and use voice inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Transcribe and other Amazon machine-learning/artificial-intelligence technologies.” “Only authorized employees will have access to your content that is processed by Amazon Transcribe.”

Evaluation

Input formats

16 kHz and 8kHz WAV, MP3, MP4 and FLAC

Output formats

JSON output needs to be reshaped by script into text or VTT formats for use in production.

Accuracy

58% average accuracy (or 42% WER) across our samples

Processing time

.22x-real-time

Computing resources required

N/A (cloud-based)

Growth rate

N/A

Social impact

AWS Transcribe's algorithm is proprietary, It is unknown what pre-processing steps or models are used to generate transcripts or how results will change over time as the algorithm and models are changed. Custom vocabularies can be used to train the model for certain words, but this training may not be useful for heterogeneous groups of materials. AWS offers a very vague explanation of how it will use your data to "improve" their services, which may make it an undesirable choice for processing materials that will not be made public.

Cost

$0.0004/second of audio

Support

AWS offers detailed API documentation for Transcribe. Logs can be generated for jobs to assess success or completion of jobs.

Integration capabilities

AWS makes it easy to integrate Transcribe into a pipeline of other AWS services, such as S3 for storage and Comprehend for NLP. If not using S3 for primary storage, assets must still be transferred to S3 in order to run them through Transcribe.

Training

Accuracy would likely benefit from using custom vocabularies, but they would need to be created custom for each group of like materials.

Notes: AWS returns diarization data in addition to speech-to-text. The AWS adapter should output both STT JSON (see schema above) and diarization/segmentation JSON. See below for details. Accuracy would likely benefit from using custom vocabularies, but they would need to be created custom for each group of like materials.

See Adapters

Example Usage

AWS Transcribe Example
[Tested through the AWS Console]

Example Output

AWS Transcribe Output
{
    "jobName": "Student-admin_forum",
    "accountId": "751719168081",
    "results": {
        "transcripts": [
            {
                "transcript": "So the 17th this morning, demonstrators blocked of access to the Brian administration building, and after that, picket lines were established around Ballantine and Rolls halls.            }
            }
        ],
        "speaker_labels": {
            "speakers": 10,
            "segments": [
                {
                    "start_time": "3.94",
                    "speaker_label": "spk_0",
                    "end_time": "4.95",
                    "items": [
                        {
                            "start_time": "3.94",
                            "speaker_label": "spk_0",
                            "end_time": "4.17"
                        },
                        {
                            "start_time": "4.18",
                            "speaker_label": "spk_0",
                            "end_time": "4.31"
                        },
                        {
                            "start_time": "4.31",
                            "speaker_label": "spk_0",
                            "end_time": "4.95"
                        }
                    ]
                },
                {
                    "start_time": "6.03",
                    "speaker_label": "spk_0",
                    "end_time": "14.91",
                    "items": [
                        {
                            "start_time": "6.03",
                            "speaker_label": "spk_0",
                            "end_time": "6.18"
                        },
                        {
                            "start_time": "6.18",
                            "speaker_label": "spk_0",
                            "end_time": "6.47"
                        },
                        {
                            "start_time": "6.47",
                            "speaker_label": "spk_0",
                            "end_time": "7.18"
                        },
                        {
                            "start_time": "7.19",
                            "speaker_label": "spk_0",
                            "end_time": "7.85"
                        },
                        {
                            "start_time": "8.34",
                            "speaker_label": "spk_0",
                            "end_time": "8.5"
                        },
                        {
                            "start_time": "8.51",
                            "speaker_label": "spk_0",
                            "end_time": "8.98"
                        },
                        {
                            "start_time": "8.98",
                            "speaker_label": "spk_0",
                            "end_time": "9.13"
                        },
                        {
                            "start_time": "9.13",
                            "speaker_label": "spk_0",
                            "end_time": "9.25"
                        },
                        {
                            "start_time": "9.25",
                            "speaker_label": "spk_0",
                            "end_time": "9.59"
                        },
                        {
                            "start_time": "9.59",
                            "speaker_label": "spk_0",
                            "end_time": "10.29"
                        },
                        {
                            "start_time": "10.29",
                            "speaker_label": "spk_0",
                            "end_time": "10.69"
                        },
                        {
                            "start_time": "11.06",
                            "speaker_label": "spk_0",
                            "end_time": "11.24"
                        },
                        {
                            "start_time": "11.25",
                            "speaker_label": "spk_0",
                            "end_time": "11.55"
                        },
                        {
                            "start_time": "11.55",
                            "speaker_label": "spk_0",
                            "end_time": "11.76"
                        },
                        {
                            "start_time": "11.77",
                            "speaker_label": "spk_0",
                            "end_time": "12.11"
                        },
                        {
                            "start_time": "12.11",
                            "speaker_label": "spk_0",
                            "end_time": "12.42"
                        },
                        {
                            "start_time": "12.42",
                            "speaker_label": "spk_0",
                            "end_time": "12.56"
                        },
                        {
                            "start_time": "12.56",
                            "speaker_label": "spk_0",
                            "end_time": "13.08"
                        },
                        {
                            "start_time": "13.08",
                            "speaker_label": "spk_0",
                            "end_time": "13.34"
                        },
                        {
                            "start_time": "13.34",
                            "speaker_label": "spk_0",
                            "end_time": "13.91"
                        },
                        {
                            "start_time": "13.91",
                            "speaker_label": "spk_0",
                            "end_time": "14.05"
                        },
                        {
                            "start_time": "14.05",
                            "speaker_label": "spk_0",
                            "end_time": "14.38"
                        },
                        {
                            "start_time": "14.38",
                            "speaker_label": "spk_0",
                            "end_time": "14.91"
                        }
                    ]
                }
            ]
        },
        "items": [
            {
                "start_time": "3.94",
                "end_time": "4.17",
                "alternatives": [
                    {
                        "confidence": "0.3686",
                        "content": "So"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "4.18",
                "end_time": "4.31",
                "alternatives": [
                    {
                        "confidence": "0.9998",
                        "content": "the"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "4.31",
                "end_time": "4.95",
                "alternatives": [
                    {
                        "confidence": "0.8866",
                        "content": "17th"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "6.03",
                "end_time": "6.18",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "this"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "6.18",
                "end_time": "6.47",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "morning"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "alternatives": [
                    {
                        "confidence": "0.0000",
                        "content": ","
                    }
                ],
                "type": "punctuation"
            },
            {
                "start_time": "6.47",
                "end_time": "7.18",
                "alternatives": [
                    {
                        "confidence": "0.9997",
                        "content": "demonstrators"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "7.19",
                "end_time": "7.85",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "blocked"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "8.34",
                "end_time": "8.5",
                "alternatives": [
                    {
                        "confidence": "0.7914",
                        "content": "of"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "8.51",
                "end_time": "8.98",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "access"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "8.98",
                "end_time": "9.13",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "to"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "9.13",
                "end_time": "9.25",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "the"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "9.25",
                "end_time": "9.59",
                "alternatives": [
                    {
                        "confidence": "0.6497",
                        "content": "Brian"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "9.59",
                "end_time": "10.29",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "administration"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "10.29",
                "end_time": "10.69",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "building"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "alternatives": [
                    {
                        "confidence": "0.0000",
                        "content": ","
                    }
                ],
                "type": "punctuation"
            },
            {
                "start_time": "11.06",
                "end_time": "11.24",
                "alternatives": [
                    {
                        "confidence": "0.9809",
                        "content": "and"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "11.25",
                "end_time": "11.55",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "after"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "11.55",
                "end_time": "11.76",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "that"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "alternatives": [
                    {
                        "confidence": "0.0000",
                        "content": ","
                    }
                ],
                "type": "punctuation"
            },
            {
                "start_time": "11.77",
                "end_time": "12.11",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "picket"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "12.11",
                "end_time": "12.42",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "lines"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "12.42",
                "end_time": "12.56",
                "alternatives": [
                    {
                        "confidence": "0.9788",
                        "content": "were"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "12.56",
                "end_time": "13.08",
                "alternatives": [
                    {
                        "confidence": "1.0000",
                        "content": "established"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "13.08",
                "end_time": "13.34",
                "alternatives": [
                    {
                        "confidence": "0.9956",
                        "content": "around"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "13.34",
                "end_time": "13.91",
                "alternatives": [
                    {
                        "confidence": "0.9549",
                        "content": "Ballantine"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "13.91",
                "end_time": "14.05",
                "alternatives": [
                    {
                        "confidence": "0.9985",
                        "content": "and"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "14.05",
                "end_time": "14.38",
                "alternatives": [
                    {
                        "confidence": "0.2558",
                        "content": "Rolls"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "start_time": "14.38",
                "end_time": "14.91",
                "alternatives": [
                    {
                        "confidence": "0.7122",
                        "content": "halls"
                    }
                ],
                "type": "pronunciation"
            },
        ]
    },
    "status": "COMPLETED"
}

Kaldi

URL: https://github.com/kaldi-asr/kaldi

Official documentation:  Kaldi Documentation | HIPSTAS Kaldi instance

Basic information

Open source or proprietary

Open source

Cost

Free

Input

16kHz wav; but this containerized version contains a bash script that converts MP3, MP4, WAV, and MOV files to the appropriate spec before sending to Kaldi scripts.

Output

- JSON and txt file
- word-timestamped JSON (see https://americanarchivepb.wordpress.com/2018/06/13/aapb-transcription-workflow-part-1/ for an example)

Speaker diarization/identification

No

Languages Supported

English

Other features


Custom vocabulary

No

Programming languages

Written in C++, bindings for Python and BASH

Other tech notes

This version is containerized.

Privacy/Access

Will inherit whatever security protocols IU requires in IT systems.

Evaluation

Input formats

16kHz WAV audio file

Output formats

The WGBH fork outputs JSON with word-level timestamps and confidence as well as plain text

Accuracy

46% average accuracy (or 64% WER) across our samples

Processing time

4x-real-time

Computing resources required

Each container requires at least 1 CPU an 6GB of memory.

Growth rate

N/A

Social impact

Uses open source tool. If AMP contributes to code this will contribute to the community. Kaldi can be trained, so there is more control over the output, but training a model is difficult and time consuming. The same level of care would need to be taken for Kaldi as for commercial services to ensure that the risks of unintended consequences are mitigated.

Cost

Cost is related to cost of IU servers and the average throughput by users.

Support

Open source community. Support is only via goodwill from the community. Kaldi has an active community forum.

Integration capabilities

N/A

Training

A requirement for successful training of Kaldi would be one or more expert speech recognition researchers to manage model development and training for all unique recording scenarios (e.g, clear english with one speaker and no background noise and one specific type of accent; clear english with two speakers and no background noise and noticeable southern accent, etc.) — additionally the type of content will require different models (e.g., english read out of the WSJ, poetry, weather news; sports news, theater, etc.).

Installation & requirements

Our version of Kaldi runs using Docker

docker pull hipstas/kaldi-pop-up-archive

Example Usage

See also this full walkthrough on Google Drive

<tool name> Example
 

Example Output

<tool name> Output
 

Other evaluated tools

Google Cloud STT

URL: https://cloud.google.com/speech-to-text/

Basic information

Open source or proprietary

Proprietary

Cost

$0.006 / 15 seconds ($0.009 / 15 seconds for video)

Input

Supported encodings: FLAC, LINEAR16, MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE

Output

JSON

Speaker diarization/identification

Yes (Beta)

Languages Supported

120 languages and dialects

Custom vocabulary

Yes, up to 500 phrases per request, 10,000 total characters per request, 100 characters per phrase

Programming languages

REST API (easy from many languages). Libraries for C#, Go, Java, Node.js, PHP, Python, Ruby

Privacy/Access

Google offers a lower rate for opting in to data logging, but they are not specific about uses of the data

Evaluation

Input formats

Encodings: MP3 (beta), FLAC, LINEAR16, MULAW, AMR, OGG_OPUS. See documentation for required sample rate for each: https://cloud.google.com/speech-to-text/docs/encoding

Output formats

JSON with word-level timestamps, and "sentence"-level confidence (ability to enable word confidence). JSON output needs to be reshaped by script into text or VTT formats for use in production.

Accuracy

55% average accuracy (or 45% WER) across our samples

Processing time

.16x-real-time

Computing resources required

N/A (cloud-based)

Growth rate

N/A

Social impact

Google's algorithm is proprietary, It is unknown what pre-processing steps or models are used to generate transcripts or how results will change over time as the algorithm and models are changed. Custom vocabularies can be used to train the model for certain words, but this training may not be useful for heterogeneous groups of materials. Google offers a lower rate for opting in to data logging, but they are not specific about uses of the data, which may make it an undesirable choice for processing materials that will not be made public.

Cost

$0.006 / 15 seconds ($0.0004/second) for audio; $0.009 / 15 seconds for video ($0.0006/second)

Support

Well documented. Professional support team and many community support forums

Integration capabilities

Google makes it easy to integrate Google Cloud STT into a pipeline of other Google services, such as Google Cloud for storage and Natural Language for NLP. If not using Google Cloud for primary storage, assets must still be transferred to Google Cloud in order to run them through STT.

Training

Google offers four models to choose from: command_and_search, phone_call, video (premium model that costs more), and default. (Currently only phone_call allows speaker diarization) It does not yet offer custom vocabularies, but does offer "phrase hints"

Mozilla DeepSpeech

URL: https://github.com/mozilla/DeepSpeech & https://research.mozilla.org/machine-learning/
Official documentation: https://github.com/mozilla/DeepSpeech

Basic information

Open source or proprietary

Open source

Cost

Compute cost (if running on cloud services)

Input

Currently only WAVE files with 16-bit, 16 kHz, mono are supported

Output

Plain text

Speaker diarization/identification

No

Languages Supported

English

Custom vocabulary

Yes

Training dataGlobal Speech dataset (from Common Voices project--diverse range of voices)

Programming languages

Python, but there are a few other bindings. Early java JNI https://github.com/mozilla/DeepSpeech/tree/master/native_client/java

Privacy/Access

Will inherit whatever security protocols IU requires in IT systems.

Evaluation

Input formats

16bit mono WAV audio file

Output formats

Plain text only

Accuracy

25% average accuracy (or 75% WER) across our samples

Processing time

.25x-real-time

Social impact

Deep Speech can be trained, so there is more control over the output, but training a model is difficult and time consuming. The same level of care would need to be taken for Deep Speech as for commercial services to ensure that the risks of unintended consequences are mitigated. Mozilla claims a more diverse body of training data (Project Common Voice) to represent a wider variety of dialects, so while current outputs are still very inaccurate, future versions may prove to offer more accurate transcripts.

Cost

Cost is related to cost of IU servers and the average throughput by users.

Support

Open source community. Support is only via goodwill from the community. Deep Speech has an active user community: https://discourse.mozilla.org/c/deep-speech

Integration capabilities

Deep Speech does not offer timestamps, so its use may be limited for other purposes beyond keyword searchable transcripts.

Training

Deep Speech allows training and looks easier to train than Kaldi, but this will still take time and expertise to get desired results. Different types of content will require different models.

Example Usage

<tool name> Example
 

Example Output

<tool name> Output
 

PocketSphinx/CMUSphinx

URL & official documentation: https://cmusphinx.github.io/

Basic information

Open source or proprietary

Open source

Cost

Free

Input

WAVE (16 bit, mono 8 kHz or 16kHz) only

Output

Plain text

Speaker diarization/identification

Integration with LIUM tools

Languages Supported

Prebuilt models for: Mandarin, Indian English, Catalan, German, Greek, French, Dutch, US English, Spanish, Italian, Hindi, Russian, Kazakh

Custom vocabulary

Yes

Training dataUnsure what the prebuilt models are trained on; ability to train your own

Programming languages

Java, C

Privacy/Access

Everything is handled locally, so it is what we make it

Evaluation

Input formats

16bit mono WAV audio file

Output formats

Sphinx4 outputs plain text, but timestamps could also be generated with some effort learning the tool. (Learning curve is steep.)

Accuracy

24% average accuracy (or 76% WER) across our samples

Processing time

.65x-real-time

Social impact

Sphinx can be trained, so there is more control over the output, but training a model is difficult and time consuming. The same level of care would need to be taken for Sphinx as for commercial services to ensure that the risks of unintended consequences are mitigated.

Cost

Cost is related to cost of IU servers and the average throughput by users.

Support

Open source community. Support is only via goodwill from the community.

Training

Sphinx 4 allows training, but this will still take time and expertise to get desired results.

Example Usage

<tool name> Example
 

Example Output

<tool name> Output
 


  • No labels