There are four Human MGMs (Transcript, Segmentation, NER, OCR) which all share a python library for handling HMGM related configurations, parsing context info, setting up input/output files to pass to/from external HMGM editors (accessible via Amppd UI), as well as creating/completing tasks based on various task management platforms. We currently implemented Jira as one task management option; the other options include open source platforms such as OpenProject and Redmine. They can be implemented by extending the general TaskManager class.
- Shared Python library: under galaxy/tools/hmgm
- hmgm_main.py: the main module containing the python script (main method) to be called by all HMGM tool wrappers. It also contains helper methods called by the main methods.
- task_manager.py: the abstract superclass TaskManager defines the API for all subclasses which shall implement the methods to create/close tasks.
- task_jira.py: TaskJira is a subclass extending TaskManager with implementation for Jira platform.
- task_openproject.py: TaskOpenproject is a subclass extending TaskManager with implementation for OpenProject platform.
- task_redmine.py: TaskRedmine is a subclass extending TaskManager with implementation for Redmine platform.
- Tool wrappers: under galaxy/tools/hmgm ( Note: all HMGM tools shall have their tool ID prefixed with hmgm_ )
- hmgm_transcript.xml: for correcting STT transcript using BBC transcript editor
- hmgm_segmentation.xml: for correcting segmentation using Metadata Structural Editor
- hmgm_ner.xml: for correcting named entity recognition using Timeliner
- hmgm_ocr.xml: for correcting optical character recognition using
- There is also a sample HMGM (hmgm_sample.py and hmgm_sample.xml), which is only meant to be used for dev testing on integration with AMPPD and Galaxy HMGM Job Runner. It should not be included in any user created workflow.
- JSON converters: under galaxy/tools/amp_json_schema/. Most of the HMGM editors use different format of JSON file as input/output than AMP standard JSON, so conversions are needed before/after editing for the input/output files to/from thesee editors.
- bbcEditor_to_schema.py: convert BBC Transcript Editor output DraftJs to AMP standard transcript JSON
- ner_to_iiif.py: convert AMP standard NER JSON to IIIF manifest
- iiif_to_ner.py: convert IIIF manifest to AMP standard NER JSON
Process flow - how HMGMs are invoked and what happens after that
- User creates a workflow in Galaxy including HMGMs as some of the steps.
- User invokes the workflow from AMPPD UI.
- AMPPD checks the workflow to see if any HMGM is involved, and if so, generates context information for each HMGM as a parameter, then send request to Galaxy to run the workflow.
- Galaxy executes the workflow, and when it hits an HMGM step, it invokes the HMGM tool
- The HMGM tool then creates a task in the task platform specified in the context. The task information (ID, key, URL) is recorded in a task json file as one of the HMGM outputs; also the input JSON file is copied into a designated location where the task editor can access. If the HGMG editor takes a different format for input file than what the previous MGM tool outputs, conversion is done before the JSON input is fed to the editor.
- The Galaxy HMGM job runner then put the HMGM tool into waiting status to free up job worker resources, and schedule it to run in a cycle.
- Meanwhile, the task URL is accessible to authorized users. The page includes specification of the task such as task type and description, which in turn includes various information passed down in the task context. In particular, it includes a link to the editor specifying the location of the input JSON file as well as the location of the associated primaryfile media.
- The task assignee can then click the above editor URL, which will open up the AMPPD UI page with the editor embedded. In the case of transcript correction, the BBC transcript editor will be presented, and the transcript JSON and the primaryfile media will be accessible in the editor.
- The assignee can use the editor to edit the transcript and play the media as a reference. Upon completion, he can click the "complete" button, upon which the editor will save the output JSON file into the same designated location; and the control will go back to the HMGM tool.
- Once the HMGM tool gets to run again, it will check and if it finds that the output JSON file exists it will move the file back to Galaxy output file location, and close the task. If the HGMG editor produces a different format for output file than what the next MGM tool takes as input, conversion is done before the JSON output is copied back to Galaxy.
- HMGM job then completes and Galaxy continues the workflow into the next step.
Context information is transparent to users and it's a json string containing the following fields (note that the field names are used internally and interpreted by HMGM, their label and value will be displayed with corresponding text on the Jira page)
- primaryfileUrl (URL for the media file)
- primaryfileMediaInfo (local path to media info JSON file)
The taskPlatform is defined in the collection to which the item-primaryfile belongs to.
- Set up dependencies:
- In order to run hmgm_main.py one needs to install the Jira package for python3: pip3 install --user jira
- The script also expects the temporary input/output directory to exist on the local file system
- On potassium the directory is defined in /srv/amp/config/hmgm.ini: section amppd, property io_dir, which points to /srv/amp/galaxy_logs/hmgm.
- For dev, one shall copy the above file from potassium into his local galaxy/config directory, modify the properties to point to local amppd-ui server and directory, and create the corresponding directory on his local file system
- Note: Do NOT commit hmgm.ini into repository ever. It contains security info.
- To run the HMGM python script hmgm_main.py:
- hmgm_main.py task_type root_dir input_json output_json task_json context_json
- Parameters passed to hmgm_main.py
- task_type: type of HMGM task: (Transcript, NER, Segmentation, OCR), there is one HMGM wrapper per type
- root_dir: path for Galaxy root directory; HMGM property files, logs and tmp files are relative to the root_dir
- input_json: input file for HMGM task in json format
- output_json: output file for HMGM task in json format
- task_json: json file storing information about the HMGM task, such as ticket # etc
- context_json: context info as json string needed for creating HMGM tasks