Page tree
Skip to end of metadata
Go to start of metadata

Category description and use cases

Video OCR is the recognition of text in video content, for example, words on objects like signs or clothing, subtitles and captions, or opening/ending credits. Video OCR algorithms may use a variety of methods for detecting text over a series of video frames. 

Accuracy (precision, recall, and f1 scores) was assessed for short clips from 4 different video samples to evaluate the appropriate level of quality for human-mediated workflows. The process and results are described in more detail here. The team assessed one open-source toolset, Tesseract and FFmpeg, and two proprietary services, Microsoft Azure Video Indexer and Google Video Intelligence Text Detection. We recommended one open-source option (Tesseract and FFMpeg) and recommended Azure Video Indexer as a proprietary option because of its higher precision scores and the possibility of using some of the other outputs of the service, like object detection, transcription, and scene/shot detection. (Google scored very high in recall, but this results in possibly too many results for collection managers to review.)

Workflow example:

Output standard

Summary: 


Element

Datatype

Obligation

Definition

media

object

required

Wrapper for metadata about the source media file.

media.filename

string

required

Filename of the source file.

media.duration

string

required

The duration of the source file.

media.frameRate

number

required

The frame rate of the video, in FPS.

media.numFrames

number

required

The number of frames in the video.

media.resolution

object

required

Resolution of the video.

media.resolution.width

number

required

Width of the frame, in pixels.

media.resolution.height

number

required

Height of the frame, in pixels.

frames

array

required

List of frames containing text.

frames[*]

object

optional

A frame containing text.

frames[*].start

string (s.fff)

required

Time of the frame, in seconds.

frames[*].objects

list

required

List of instances in the frame containing text.

frames[*].objects[*]

object

required

An instance in the frame containing text.

frames[*].objects[*].text

string

required

The text within the instance.

frames[*].objects[*].language

string

optional

The language of the detected text, (in localized ISO 639-1 code, ex. “en-US”).

frames[*].objects[*].score

object

optional

A confidence or relevance score for the text.

frames[*].objects[*].score.type

string (confidence | relevance)

required

The type of score, confidence or relevance. 

frames[*].objects[*].score.value

number

required

The score value, typically a number in the range of 0-1.

frames[*].objects[*].vertices

object

required

The top left (xmin, ymin) and bottom right (xmax, ymax) relative bounding coordinates.

frames[*]objects[*].vertices.xmin

number

required

The top left x coordinate.

frames[*]objects[*].vertices.ymin

number

required

The top left y coordinate.

frames[*]objects[*].vertices.xmax

number

required

The bottom right x coordinate.

frames[*]objects[*].vertices.ymax

number

required

The bottom right y coordinate.



JSON Schema

Schema
{
	"$schema": "http://json-schema.org/schema#",
    "type": "object",
    "title": "Video OCR Schema",
    "required": [
        "media",
        "frames"
    ],
    "properties": {
        "media": {
            "type": "object",
            "title": "Media",
            "description": "Wrapper for metadata about the source media file.",
            "required": [
                "filename",
                "duration"
            ],
            "properties": {
                "filename": {
                    "type": "string",
                    "title": "Filename",
                    "description": "Filename of the source file.",
                    "default": "",
                    "examples": [
                        "myfile.wav"
                    ]
                },
                "duration": {
                    "type": "string",
                    "title": "Duration",
                    "description": "Duration of the source file audio.",
                    "default": "",
                    "examples": [
                        "25.888"
                    ]
                },
                "frameRate": {
                	"type": "number",
                	"title": "Frame rate",
                	"description": "The frame rate of the video, in FPS.",
                	"default": 0,
                	"examples": [
                		29.97
                	]
                },
                "numFrames": {
                	"type": "integer",
                	"title": "Number of frames",
                	"description": "The number of frames in the video.",
                	"default": 0,
                	"examples": [
                		1547
                	]
                },
                "resolution": {
                	"type": "object",
                	"title": "Resolution",
                	"description": "Resolution of the video.",
                	"required": [
                		"height",
                		"width"
                	],
                	"properties": {
                		"height": {
                			"type": "integer",
                			"title": "Height",
                			"description": "Height of the frame, in pixels.",
                			"default": 0
                		},
                		"width": {
                			"type": "integer",
                			"title": "Width",
                			"description": "Width of the frame, in pixels.",
                			"default": 0
                		}
                	}
                }
            }
        },
        "frames": {
        	"type": "array",
        	"title": "Frames",
        	"description": "List of frames containing text.",
        	"items": {
        		"type": "object",
        		"required": [
        			"start",
        			"objects"
        		],
        		"properties": {
        			"start": {
        				"type": "string",
        				"title": "Start",
        				"description": "Time of the frame, in seconds.",
        				"default": "",
        				"examples": [
        					"23.594"
        				]
        			},
        			"objects": {
        				"type": "array",
        				"title": "Objects",
        				"description": "List of instances in the frame containing text.",
        				"items": {
        					"type": "object",
        					"required": [
            					"text",
            					"vertices"
            				],
            				"properties": {
            					"text": {
            						"type": "string",
            						"title": "Text",
            						"description": "The text within the instance.",
            						"default": ""
            					},
                                "language": {
                                    "type": "string",
                                    "title": "Language",
                                    "description": "The language of the detected text, (in localized ISO 639-1 code, ex. “en-US”).",
                					"default": ""
                                },
                                "score": {
			                        "type": "object",
			                        "title": "Score",
			                        "description": "A confidence or relevance score for the entity.",
			                        "required": [
			                            "type",
			                            "scoreValue"
			                        ],
			                        "properties": {
			                            "type": {
			                                "type": "string",
			                                "title": "Type",
			                                "description": "The type of score, confidence or relevance.",
			                                "enum": [
			                                    "confidence",
			                                    "relevance"
			                                ]
			                            },
			                            "scoreValue": {
			                                "type": "number",
			                                "title": "Score value",
			                                "description": "The score value, typically a float in the range of 0-1.",
			                                "default": 0,
			                                "examples": [0.437197]
			                            }
			                        }
            					},
            					"vertices": {
            						"type": "object",
            						"title": "Vertices",
            						"description": "The top left (xmin, ymin) and bottom right (xmax, ymax) relative bounding coordinates.",
            						"required": [
            							"xmin",
            							"ymin",
            							"xmax",
            							"ymax"
            						],
            						"properties": {
            							"xmin": {
            								"type": "number",
            								"title": "Xmin",
            								"description": "The top left x coordinate.",
            								"default": 0
            							},
            							"ymin": {
            								"type": "number",
            								"title": "Ymin",
            								"description": "The top left y coordinate.",
            								"default": 0
            							},
            							"xmax": {
            								"type": "number",
            								"title": "Xmax",
            								"description": "The bottom right x coordinate.",
            								"default": 0
            							},
            							"ymax": {
            								"type": "number",
            								"title": "Ymax",
            								"description": "The bottom right y coordinate.",
            								"default": 0
            							}
            						}
            					}
            				}
        				}
        			}
        		}
        	}
        }
    }
}

Sample output

Sample Output
{
	"media": {
		"filename": "myfile.mov",
		"duration": "8334.335",
		"frameRate": 30.000,
		"frameNum": 1547,
		"resolution": {
			"width": 654,
			"height": 486
		}
	},
	"frames": [
		{
			"start": "625.024",
			"objects": [
				{
					"text": "Beliefs",
					"language": "en-US",
					"score": {
						"type": "confidence",
						"scoreValue": 0.9903119
					},
					"vertices": {
						"xmin": 219,
						"ymin": 21,
						"xmax": 219,
						"ymax": 21
					}
				}
			]
		}
	]
}


Recommended tool(s)

Microsoft Azure Video Indexer

Official documentation: https://api-portal.videoindexer.ai/

Basic Information

Open source or proprietary

Proprietary

Cost

$0.001/GB/month for storage in Block Blob. Video Indexer: 40 hours free (through API) for free trial account. After 40 hours, $0.15/minute. Appears to be included in Azure Unlimited Paid Account? Pricing from: https://azure.microsoft.com/en-us/pricing/details/media-services/#analytics

Input

Full list at: https://docs.microsoft.com/en-us/azure/media-services/latest/media-encoder-standard-formats

Output

JSON

Languages Supported

15 languages supported: https://docs.microsoft.com/en-us/azure/cognitive-services/language-support

Other features

"Included as part of a suite of video analysis services, including: 

Face detection

Celebrity identification

Account-based face identification:

Thumbnail extraction for faces (""best face"")

Visual text recognition (OCR)

Visual content moderation

Labels identification

Scene segmentation

Shot detection

Black frame detection

Keyframe extraction

Rolling credits

Animated characters detection (preview)

Editorial shot type detection

More info on how video indexer works in this post: https://azure.microsoft.com/en-us/blog/text-recognition-for-video-in-microsoft-video-indexer/"

Custom vocabulary

no

Programming languages

API (Developer portal at: https://api-portal.videoindexer.ai/)

Training data

no

Privacy/Access

"Privacy info at: https://docs.microsoft.com/en-us/azure/media-services/video-indexer/faq

Not very clear. Link to Azure Online Services Terms document does not return the policy."

Other tech notes

API reference: https://api-portal.videoindexer.ai/docs/services/Operations/operations/Cancel-Project-Render-Operation and https://api-portal.videoindexer.ai/docs/services/Operations/operations/Get-Video-Index?


Evaluation

Input formats

Wide range of video inputs accepted. Language(s) may either be specified in API call or auto-detection may be used.

Output formats

Output in JSON as part of a larger Video Insights file with results grouped by text, then instances by frame range, bounding coordinates, and confidence levels.Outputs language code in results.

Accuracy

see: https://docs.google.com/document/d/1CJA83fLvCABROGEtDDp78l9YcQEGouF6r3jJ5kIZSXc/edit?usp=sharing

Processing time

Hard to track using the web interface.

Computing resources required

N/A

Growth rate

N/A

Social impact

Microsoft's algorithm is proprietary, It is unknown what pre-processing steps or models are used to detect text or how results will change over time as the algorithm and models are changed. Documention is unclear about how input data and generated data is used or accessed on Microsoft Azure servers.

Cost

$0.001/GB/month for storage in Block Blob. Video Indexer: 40 hours free (through API) for free trial account. After 40 hours, $0.15/minute. Appears to be included in Azure Unlimited Paid Account? Pricing from: https://azure.microsoft.com/en-us/pricing/details/media-services/#analytics

Support

Well-documented API.

Integration capabilities

Part of a larger suite of video analysis tools: face detection, face identification (celebrity or account-based), content moderation, object identification, scene detection, shot type detection, rolling credits, audio transcription, speaker diarization, audio effects, speaker stats, emotion detection, language detection, keyword extraction, and more

Training

There don't appear to be training capabilities available.

Installation & requirements


Parameters


Example Usage

<tool name> Example
 

Example Output

<tool name> Output
..."keywords":[
     {
         "isTranscript":false,
         "id":1,
         "name":"david neely",
         "appearances":[
          {
             "startTime":"0:00:03.18",
             "endTime":"0:00:15.752",
            "startSeconds":3.2,
            "endSeconds":15.8
           },
           {
              "startTime":"0:01:32.96",
               "endTime":"0:01:39.499",
               "startSeconds":93,
                "endSeconds":99.5
           }
       ]
   },...

Tesseract + FFmpeg

Official documentation: https://github.com/tesseract-ocr/tesseract

Basic information


Open source or proprietary

Open source

Cost

Local compute/storage

Input

BMP, PNM, PNG, JFIF, JPEG, and TIFF

Output

Determined by script

Languages Supported

Packages for over 130 languages and over 35 scripts are also available directly from the Linux distributions.

Other features


Custom vocabulary

yes

Programming languages

Wrappers in almost every language

Training data

yes

Privacy/Access

N/A

Other tech notes


Evaluation

Input formats

Need to extract frames as images using ffmpeg or another tool and need to select an optimal nth frame to analyze. Managing images while analyzing may take more time and care than running a video through a cloud service. Supports multiple language detection, but languages must be identified in advance and respective libraries downloaded.

Output formats

Output is flexible and designed by the user writing the script. That said, it will take some effort to design and script an appropriate output format. Languages can be identified in output.

Accuracy

see: https://docs.google.com/document/d/1CJA83fLvCABROGEtDDp78l9YcQEGouF6r3jJ5kIZSXc/edit?usp=sharing

Processing time

Highly variable; between .5 and 3x real-time, with an average of 1.8x (for the whole process- saving frames, running OCR, and writing output)

Computing resources required


Growth rate


Social impact

It's not clear the source of the training data, but Tesseract is trainable, either through building on existing models or starting from scratch. Like all other video OCR tools/services, consideration should be given to the risk involved in false positives and false negatives in output.

Cost

Local compute and storage costs

Support

Extensive documentation on the Tesseract website and large user community with blogs and info on how to use and train.

Integration capabilities


Training

Font training capabilities, custom dictionaries, zonal ocr.

Installation & requirements

We've used the pytesseract wrapper which can be installed with pip. This also requires the packages tesseract-ocr and libtesseract-dev, which can be installed with Linux package managers.  See official install guide for more information.

Parameters

ffmpeg takes a video and a folder path for storing images (which can later be deleted)

tesseract takes a path to an image

Example Usage

Tesseract does not handle video by default. What we've done is use ffmpeg to save frames as images, then run tesseract on the images and saved it according to our JSON schema. Below are the basics of saving frames with ffmpeg and running OCR using pytesseract, but the full code can be viewed and run on Google Colaboratory here. The Colab code includes generating short samples from our full videos for testing and adjusting for the time offset that sampling creates (i.e. aligning the sample results with the full-video timestamps).

<tool name> Example
# Save 2 frames for every second using ffmpeg
name = s["filename"]
!mkdir "temp/{name}"
!ffmpeg -i "../Clips/{name}" -an -vf fps=2 "temp/{name}/frame_%05d.jpg" 


frame = {
            "start": str(start_time),
            "boundingBoxes": []
        }

img = "path/to/image"


from pytesseract import Output

 #Run OCR
 result = pytesseract.image_to_data(Image.open(directory+"/"+img), output_type=Output.DICT) 
      
 #For every result, make a box & add it to the list of boxes for this frame
 for i in range(len(result["text"])): 
   if result["text"][i].strip(): #if the text isn't empty/whitespace
      box = {
          "text": result["text"][i],
          "score": {
              "type":"confidence",
              "scoreValue": result["conf"][i]
          },
          # relative coords
          "vertices": {
            "xmin": result["left"][i]/output["media"]["resolution"]["width"],
            "ymin": result["top"][i]/output["media"]["resolution"]["height"],
            "xmax": (result["left"][i] + result["width"][i])/output["media"]["resolution"]["width"],
            "ymax": (result["top"][i] + result["height"][i])/output["media"]["resolution"]["height"]
            }
      }
         frame["boundingBoxes"].append(box)

Example Output

Tesseract outputs the following for a single image. We care about left, top, width, height, text, and conf.

<tool name> Output
{
'level': [1, 2, 3, 4, 5, 4, 5, 5, 5, 4, 5, 5, 2, 3, 4, 5, 5, 2, 3, 4, 5, 4, 5, 5, 4, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 5, 5, 5, 2, 3, 4, 5, 4, 5], 
'page_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
'block_num': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 
'par_num': [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1], 
'line_num': [0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 0, 0, 1, 1, 1, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 0, 0, 1, 1, 2, 2], 
'word_num': [0, 0, 0, 0, 1, 0, 1, 2, 3, 0, 1, 2, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 0, 0, 1, 0, 1], 
'left': [0, 152, 152, 567, 693, 347, 347, 483, 610, 152, 152, 287, 172, 172, 172, 172, 664, 173, 124, 262, 262, 173, 173, 1133, 195, 195, 533, 241, 241, 384, 592, 259, 259, 419, 576, 939, 604, 604, 640, 934, 1003, 472, 472, 540, 540, 472, 472], 
'top': [0, 82, 49, 82, 82, 108, 128, 128, 108, 198, 215, 198, 257, 257, 257, 257, 332, 346, 346, 397, 397, 346, 415, 346, 434, 447, 434, 493, 497, 507, 493, 464, 499, 502, 464, 528, 598, 618, 598, 616, 624, 74, 74, 74, 74, 464, 464], 
'width': [1280, 564, 564, 149, 23, 367, 27, 101, 104, 401, 109, 205, 502, 502, 502, 365, 10, 983, 1032, 40, 40, 983, 118, 23, 960, 101, 49, 352, 13, 2, 1, 837, 93, 17, 82, 157, 401, 12, 53, 5, 2, 687, 687, 291, 291, 687, 687], 
'height': [720, 248, 281, 42, 36, 116, 94, 96, 84, 132, 66, 132, 103, 103, 103, 103, 15, 308, 308, 11, 11, 101, 21, 101, 78, 65, 39, 19, 12, 5, 3, 175, 17, 55, 175, 24, 56, 19, 56, 22, 2, 646, 646, 279, 279, 256, 256], 
'conf': ['-1', '-1', '-1', '-1', 82, '-1', 0, 19, 7, '-1', 21, 37, '-1', '-1', '-1', 28, 24, '-1', '-1', '-1', 17, '-1', 13, 0, '-1', 33, 21, '-1', 5, 0, 74, '-1', 21, 84, 10, 6, '-1', 19, 17, 13, 0, '-1', '-1', '-1', 95, '-1', 95], 
'text': ['', '', '', '', '\\', '', 'N', 'Bi', 'Be', '', 'ya', 'NR', '', '', '', 'PE', 'ae', '', '', '', 'oe', '', 'ana', '\\', '', 'ak', 'rd', '', 'a', '.', '{', '', 'eels', '\\', 'ig', 'ee', '', 'lf', 'ys', "'", ':', '', '', '', '  ', '', '']
}

Other evaluated tools

Google Video Intelligence Text Detection

Official documentation: https://cloud.google.com/video-intelligence/docs/text-detection

Basic information

Open source or proprietary

Proprietary

Cost

First 1000 minutes free/month. Additional $0.15/minute after first 1000 minutes.

Input

.MOV, .MPEG4, .MP4, and .AVI

Output

JSON

Languages Supported

https://cloud.google.com/vision/docs/languages

Languages can be specified in the languageHints parameter, otherwise default language detection is used.

Other features

startTimeOffset and endTimeOffset can be specified to detect text in just a segment of the video.

Custom vocabulary

no

Programming languages

API

Training data

no

Privacy/Access

Data usage terms at: https://cloud.google.com/video-intelligence/docs/data-usage

Appears to not be used by Google's services. No tiered pricing like STT

Other tech notes

API reference: https://cloud.google.com/video-intelligence/docs/reference/rest/ and https://cloud.google.com/video-intelligence/docs/text-detection

Evaluation


Input formats

Wide range of video inputs accepted. Language(s) may either be specified in API call or auto-detection may be used.

Output formats

Output in JSON with results grouped by text, then instances by frame range, bounding coordinates, and confidence levels. Requires scripting to convert to alternate formats. Language is not denoted in output.

Accuracy

see: https://docs.google.com/document/d/1CJA83fLvCABROGEtDDp78l9YcQEGouF6r3jJ5kIZSXc/edit?usp=sharing

Processing time

Test clips took about 1 minute per 5 minute clip.

Computing resources required

N/A

Growth rate

N/A

Social impact

Google's algorithm is proprietary, It is unknown what pre-processing steps or models are used to detect text or how results will change over time as the algorithm and models are changed. Currently, Google does not appear to offer tiered opt-in/opt-out service similar to STT pricing, and terms state that input is not used by Google, but the privacy policy should be checked regularly, as terms may change without notice.

Cost

First 1000 minutes free/month. Additional $0.15/minute after first 1000 minutes.

Support

Well-documented API.

Integration capabilities

Google requires files be uploaded to Google storage for processing, so if using other services, using multiple services per file upload could be both cost and time efficient.

Training

The only customizations that appear to be available are specification of languages present in the video. AutoML not yet available for video OCR.


  • No labels