- Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006.
- It runs on images and produces a text output consisting of the text in the fed images.
- It has been added as a tool on AMP's Galaxy and performs video OCR on the input videos.
- This is achieved by embedding an FFmpeg before the tesseract. So the video is first passed through FFmpeg which produces image frames at an interval of 0.5 seconds throughout the duration of the video. These frames are passed as input to Tesseract.
- The output produced by this composite video OCR tool is a JSON consisting of the text and the corresponding bounding box information on each frame in the input.
- Source Code
- galaxy/tools/tesseract.xml : This is the configuration file that details the tools usage, its inputs, outputs, version, and other things.
- galaxy/tools/run-tesseract.py : This is a python wrapper to run the FFmpeg on input video. FFmpeg creates frames from the video. These frames are then passed through the tesseract tool which runs the OCR and produces a JSON output. The JSON output has all the text predictions with their corresponding bounding box coordinates for all the frames.
$ sudo apt-get install FFmpeg
$ pip install pytesseract
$ sudo apt install tesseract-ocr
$ sudo apt install libtesseract-dev
- The tool can be invoked from Galaxy UI as other tools. User needs to supply input data in the form of a video file.
- Input File: the video file to be passed through the OCR.
- $json_file: the output JSON file consisting of text recognized by the OCR.
- JSON file: It has the output of the OCR with all the recognized text in each frame and their bounding boxes. It also has other information like frame rate and resolution.
More inpormation about tesseract is here.