Segmentation MGMs detect when silence, speech, and/or music occur in an audio file. This information may be interesting in its own right for determining how much of an object in an archive has content (e.g. is half the tape silence?). Segment data could also be used to route files (or parts of files) to different MGMs based on the content (for example, sending the speech portions into a workflow that includes STT and the music portions into a music workflow).
Note: these tools do not split the audio files themselves, only output timestamped labels for the contents of a segment. Splitting would need to be handled by another tool, such as ffmpeg.
Summary: An array of segments, each with a label, start, and end. Start and end are timestamps in seconds. The label may be one of: "speech", "music", "silence." If the label is "speech," a gender" may be specified as either "male" or "female."
{ "$schema": "http://json-schema.org/schema#", "type": "object", "title": "Audio Segment Schema", "required": [ "media", "segments" ], "properties": { "media": { "type": "object", "title": "Media", "required": [ "filename", "duration" ], "properties": { "filename": { "type": "string", "title": "Filename", "default": "", "examples": ["myfile.wav"] }, "duration": { "type": "string", "title": "Duration", "default": "", "examples": ["25.888"] } } }, "segments": { "type": "array", "title": "Segments", "items": { "type": "object", "required": [ "label", "start", "end" ], "oneOf": [ { "additionalProperties": false, "properties": { "label": { "type": "string", "enum": ["speech"] }, "start": { "type": "string", "description": "Start time in seconds", "default": 0.0, "examples": ["123.45"] }, "end": { "type": "string", "description": "End time in seconds", "default": 0.0, "examples": ["123.45"] }, "gender": { "type": "string", "enum": ["male", "female"], "default": "unknown" }, "speaker_label": { "type": "string", "default":"unknown", "description": "speaker label from speaker diarization" } } }, { "additionalProperties": false, "properties": { "label": { "type": "string", "enum": ["music", "silence"] }, "start": { "type": "string", "description": "Start time in seconds", "default": 0.0, "examples": ["123.45"] }, "end": { "type": "string", "description": "End time in seconds", "default": 0.0, "examples": ["123.45"] } } } ] } }, "num_speakers": { "type": "integer", "description" : "number of speakers (if used for diarization)" }, } } |
{ "media": { "filename": "mysong.wav", "duration": "124.3" }, "segments": [ { "label": "speech", "start": "0.0", "end": "12.35", "gender": "male", "speaker_label": "speaker1" }, { "label": "music", "start": "10", "end": "20" } ] } |
Official documentation: GitHub
Language: Python
Description: inaSpeechSegmenter detects music, speech, and the apparent gender of the speaker. Zones of speech over music are tagged as speech.
Cost: Free (open source)
Social impact: Trained on French-language samples, so its idea of what male and female voices sound like are based on an unknown sample of French speakers. From initial testing, the results have been more or less accurate for our samples in English, but this is an important note.
Notes:
Requires ffmpeg and TensorFlow
Install via pip:
pip install inaSpeechSegmenter
None
Because inaSpeechSegmenter does not have any parameters for the minimum length of a segment or maximum length of silence allowed within a speech/music segment, it may be beneficial to add another step in the workflow (or built in to the ina adapter) that allows the output from ina to be filtered/altered based on such parameters.
All media formats accepted by ffmpeg (wav, mp3, mp4, etc.)
from inaSpeechSegmenter import Segmenter seg = Segmenter() segmentation = seg("path/to/file.wav") for s in segmentation: label = s[0] start = s[1] end = s[2] print("Detected {} from {} seconds to {} seconds".format(label, start, end)) |
# Output has been printed in the order start, end, label 0.0 23.76 Music 23.78 28.080000000000002 NOACTIVITY 28.080000000000002 36.6 Music 36.62 37.2 NOACTIVITY 37.2 38.04 Music 38.06 38.9 NOACTIVITY 38.9 44.72 Music 44.74 46.04 NOACTIVITY 46.04 46.58 Music 46.6 47.56 NOACTIVITY 47.56 254.24 Music 254.24 255.26000000000002 Female 255.28 274.82 Music 274.84000000000003 275.32 NOACTIVITY 275.32 277.90000000000003 Music 277.92 278.74 NOACTIVITY 278.74 279.88 Female 279.90000000000003 345.0 Music 345.02 347.5 NOACTIVITY 347.5 355.42 Music 355.44 356.34000000000003 NOACTIVITY 356.34000000000003 372.66 Music 372.68 378.12 NOACTIVITY 378.12 395.2 Music |
Official documentation: GitHub
Language: Python
Description:
Cost: Free (open source)
Social impact:
Notes:
Offset:
Onset:
import sys, os from datetime import datetime from pyannote.audio.labeling.extraction import SequenceLabeling from pyannote.audio.signal import Binarize def main(): if len(sys.argv) < 2: print("Arguments: input-file [output-file]") # Get input/output files audio_file = sys.argv[1] if len(sys.argv) > 2: out = sys.argv[2] else: out = "pyannote_{}_.txt".format(os.path.basename(audio_file)) #init model media = {'uri': 'filename', 'audio': audio_file} SAD_MODEL = ('pyannote-audio/tutorials/models/speech_activity_detection/train/' 'AMI.SpeakerDiarization.MixHeadset.train/weights/0280.pt') sad = SequenceLabeling(model=SAD_MODEL) sad_scores = sad(media) # Run segmentation print("\n\nSegmenting {}".format(media)) startTime = datetime.now() binarize = Binarize(offset = 0.94, onset = 0.70, log_scale = True) speech = binarize.apply(sad_scores, dimension = 1) # Write output print("\n\nWriting to {}".format(out)) with open(out, 'w') as o: for s in speech: result = "{}\t{}\t Speech \n".format(s.start, s.end) # start end label o.write(result) print(result) # Print run time endTime = datetime.now() print("Finished!\n Runtime: {}".format(endTime-startTime)) if __name__ == "__main__": main() |
Official documentation:
Language:
Description:
Cost: Free (open source)
Social impact:
Notes:
All media formats accepted by ffmpeg (wav, mp3, mp4, etc.)
Output
Official documentation:
Language:
Description:
Cost: Free (open source)
Social impact:
All media formats accepted by ffmpeg (wav, mp3, mp4, etc.)
Each tool was tested with a variety of (shorter) samples pulled from our sample collections files. Outputs were reviewed in Audacity