Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Category description and use cases

Segmentation MGMs detect when silence, speech, and/or music occur in an audio file. This information may be interesting in its own right for determining how much of an object in an archive has content (e.g. is half the tape silence?). Segment data could also be used to route files (or parts of files) to different MGMs based on the content (for example, sending the speech portions into a workflow that includes STT and the music portions into a music workflow).

Note: these tools do not split the audio files themselves, only output timestamped labels for the contents of a segment. Splitting would need to be handled by another tool, such as ffmpeg.

Workflow example:

Output standard

Summary: An array of segments, each with a label, start, and end. Start and end are timestamps in seconds. The label may be one of: "speech", "music", "silence." If the label is "speech," a gender" may be specified as either "male" or "female."

JSON Schema

Segmentation schema
{
  "$schema": "http://json-schema.org/schema#",
    "type": "object",
    "title": "Audio Segment Schema",
    "required": [
        "media",
        "segments"
    ],
    "properties": {
       "media": {
    		"type": "object",
    		"title": "Media",
    		"required": [
    			"filename",
    			"duration"
    		],
    		"properties": {
    			"filename": {
    				"type": "string",
    				"title": "Filename",
    				"default": "",
    				"examples": ["myfile.wav"]
    			},
    			"duration": {
    				"type": "string",
    				"title": "Duration",
    				"default": "",
    				"examples": ["25.888"]
    			}
    		}
    	},
        "segments": {
            "type": "array",
            "title": "Segments",
            "items": {
                "type": "object",
                "required": [
                    "label",
                    "start",
                    "end"
                ],
                   
                "oneOf": [
                  {                   
                    "additionalProperties": false,
                    "properties": {
 
                        "label": {
                            "type": "string",
                            "enum": ["speech"]
                        },
                        "start": {
                            "type": "string",
                            "description": "Start time in seconds",
                            "default": 0.0,
                            "examples": ["123.45"]
                        },
                        "end": {
                            "type": "string",
                            "description": "End time in seconds",
                            "default": 0.0,
                            "examples": ["123.45"]
                        },
                         "gender": {
                            "type": "string",
                            "enum": ["male", "female"],
                            "default": "unknown"
                        },
                      	"speakerLabel": {
							"type": "string",
                          	"default":"unknown",
                          	"description": "Speaker label from speaker diarization"
                        }
                    }
                  },
                  {
                    "additionalProperties": false,
                    "properties": {
                        "label": {
                            "type": "string",
                            "enum": ["music", "silence"]
                        },
                        "start": {
                            "type": "string",
                            "description": "Start time in seconds",
                            "default": 0.0,
                            "examples": ["123.45"]
                        },
                        "end": {
                            "type": "string",
                            "description": "End time in seconds",
                            "default": 0.0,
                            "examples": ["123.45"]
                        }
                    }
                  }
                ]
            }
        },
"numSpeakers": {
         	"type": "integer",
              "description" : "Number of speakers (if used for diarization)"
                
         },
    }
}

Sample output

Sample segmentation output
{
	"media": {
    	"filename": "mysong.wav",
     	"duration": "124.3"
 	 },
	"segments": [
        {
            "label": "speech",
            "start": "0.0",
            "end": "12.35",
            "gender": "male",
          	"speakerLabel": "speaker1"
        },
        {
            "label": "music",
            "start": "10",
            "end": "20"
        }
    ]
}


Recommended tool(s)

inaSpeechSegmenter

Official documentation: GitHub

Language: Python

Description: inaSpeechSegmenter detects music, speech, and the apparent gender of the speaker. Zones of speech over music are tagged as speech.

Cost: Free (open source)

Social impact: Trained on French-language samples, so its idea of what male and female voices sound like are based on an unknown sample of French speakers. From initial testing, the results have been more or less accurate for our samples in English, but this is an important note.

Notes: 

Installation & requirements

Requires ffmpeg and TensorFlow

Install via pip: 

pip install inaSpeechSegmenter

Parameters

None

Because inaSpeechSegmenter does not have any parameters for the minimum length of a segment or maximum length of silence allowed within a speech/music segment, it may be beneficial to add another step in the workflow (or built in to the ina adapter) that allows the output from ina to be filtered/altered based on such parameters.

Input formats

All media formats accepted by ffmpeg (wav, mp3, mp4, etc.)

Example Usage

inaSpeechSegmenter Example
from inaSpeechSegmenter import Segmenter

seg = Segmenter()
segmentation = seg("path/to/file.wav")

for s in segmentation:
	label = s[0]
	start = s[1]
	end = s[2]
	print("Detected {} from {} seconds to {} seconds".format(label, start, end))

Example Output

inaSpeechSegmentation Output
# Output has been printed in the order start, end, label

0.0	23.76	Music
23.78	28.080000000000002	NOACTIVITY
28.080000000000002	36.6	Music
36.62	37.2	NOACTIVITY
37.2	38.04	Music
38.06	38.9	NOACTIVITY
38.9	44.72	Music
44.74	46.04	NOACTIVITY
46.04	46.58	Music
46.6	47.56	NOACTIVITY
47.56	254.24	Music
254.24	255.26000000000002	Female
255.28	274.82	Music
274.84000000000003	275.32	NOACTIVITY
275.32	277.90000000000003	Music
277.92	278.74	NOACTIVITY
278.74	279.88	Female
279.90000000000003	345.0	Music
345.02	347.5	NOACTIVITY
347.5	355.42	Music
355.44	356.34000000000003	NOACTIVITY
356.34000000000003	372.66	Music
372.68	378.12	NOACTIVITY
378.12	395.2	Music

Other evaluated tools

pyannote-audio

Official documentation: GitHub

Language: Python

Description: 

Cost: Free (open source)

Social impact: 

Notes:

Installation & requirements


Parameters

Offset:

Onset:

Input formats


Example Usage

pyannote-audio Example
import sys, os
from datetime import datetime
from pyannote.audio.labeling.extraction import SequenceLabeling
from pyannote.audio.signal import Binarize

def main():
	if len(sys.argv) < 2:
		print("Arguments: input-file [output-file]")

	# Get input/output files
	audio_file = sys.argv[1]
	if len(sys.argv) > 2:
		out = sys.argv[2]
	else:
		out = "pyannote_{}_.txt".format(os.path.basename(audio_file))
		
	#init model
	media = {'uri': 'filename', 'audio': audio_file}
	SAD_MODEL = ('pyannote-audio/tutorials/models/speech_activity_detection/train/'
             'AMI.SpeakerDiarization.MixHeadset.train/weights/0280.pt')
	sad = SequenceLabeling(model=SAD_MODEL)
	sad_scores = sad(media)
	
	# Run segmentation
	print("\n\nSegmenting {}".format(media))
	startTime = datetime.now()
	binarize = Binarize(offset = 0.94, onset = 0.70, log_scale = True)
	speech = binarize.apply(sad_scores, dimension = 1)
	
	
	# Write output
	print("\n\nWriting to {}".format(out))
	with open(out, 'w') as o:
		for s in speech:
			result = "{}\t{}\t Speech \n".format(s.start, s.end)  # start  end  label
			o.write(result)
			print(result)
		
	
	# Print run time
	endTime = datetime.now()
	print("Finished!\n Runtime: {}".format(endTime-startTime))
	

if __name__ == "__main__":
	main()

Example Output

 


auditok

Official documentation: 

Language: 

Description: 

Cost: Free (open source)

Social impact: 

Notes: 

Installation & requirements


Parameters


Input formats

All media formats accepted by ffmpeg (wav, mp3, mp4, etc.)

Example Usage

inaSpeechSegmenter Example
 


Output

 


Sphinx with LIUM

Official documentation: 

Language: 

Description: 

Cost: Free (open source)

Social impact: 

Installation & requirements


Parameters


Input formats

All media formats accepted by ffmpeg (wav, mp3, mp4, etc.)

Example Usage

inaSpeechSegmenter Example
 

Example Output

 

Evaluation summary

Each tool was tested with a variety of (shorter) samples pulled from our sample collections files. Outputs were reviewed in Audacity 

  • No labels