The Audiovisual Metadata Platform, or AMP, is a software platform that aims to generate metadata for digitized and born digital audiovisual materials using a combination of machine learning models and human intervention. It was created in collaboration between Indiana University, AVP, the University of Texas at Austin, and the New York Public Library. It is funded by a grant from the Mellon Foundation.
This guide explains AMP's front-end features and functionality, and additionally explains how to navigate the user interface. It is designed for users of AMP: collection managers, catalogers, or any other person using the platform. Presently, the guide reflects AMP as it currently exists in its pilot stage; it is very likely that some of this information will change over time.
These definitions are provided to make reading this guide easier.
- Collection: A set of items that are subject to the same access control settings
Bundle: A set of items from a collection or multiple collections. The bundle gathers items that the user wants to submit through a workflow at the same time
- Item: A bibliographic item. It contains metadata and A/V binary content. Items belong to a collection
- File: A file is a media file (sound recording, moving image) that is part of an item. Multiple related files can exist in one item
- Primary Files: Binary objects that are provided as primary resources from a collection-holding institution
- Supplemental File: Any file that is provided to supplement the information about a collection, an item, or a primary file
- Workflow: A representation of a graph that describes the routing rules for a set of MGMs. The input of a workflow may be an item or a group of items
- Metadata Generation Mechanism (MGM): A machine learning tool or other tool (e.g., automated non-machine-learning tools like ffmpeg, or manual tools like a transcript editor) provided to users to interact with AMP
- Job: One execution of a workflow for a particular Primary File
Unit: A tenant in a multi-tenant AMP; collections belong to a Unit
Currently, there are two ways to access AMP: https://amppd.dlib.indiana.edu, and http://calcium.dlib.indiana.edu:8500/#/, which requires a connection to the IU VPN if not connected to one of IU's networks on campus. The former method is preferable, as it allows users not affiliated with IU to access it more easily.
If this is your first time using AMP (or if you are not currently logged in), the URL for AMP will direct you to the login page. Before logging in for the first time, you will need to create an account. After creating your account, you will need to verify it prior to logging in for the first time. The email verification link will not be sent until approved by AMPPD staff currently, so for the short term, you may be required to wait until they are able to respond. Once your account has been verified, you should be able to log in with your email and password specified when creating your account.
Uploading Files via Batch Ingest
(Note: the following borrows significantly from Maria Whitaker's documentation on creating a batch manifest file; see more at Batch Manifest)
The Batch Ingest feature is how files are uploaded to AMP. To use the Batch Ingest, you will need to create a batch manifest in CSV format, and then upload it to AMP on the Batch Ingest page. All the files in the batch manifest first need to be uploaded to their respective collection's subfolder in the dropbox via an SFTP client; the Batch Ingest will fail if a file is not found in the expected dropbox subfolder. Detailed instructions for how to upload files to a dropbox using a tool like Cyberduck or WinSCP can be found here. It is highly recommended that users use Cyberduck, as it has built-in integration with Google Drive.
Batch manifests must conform to a specific format for the batch to be properly ingested. The following are the column names required for a batch manifest (note that some of these are optional):
Collection name (required)
The collection must already exist in AMP, and collection names must match exactly. The name of the collection is used to determine:
- which collection dropbox holds the media files to be uploaded
- to which collection in AMP the information in that line refers.
Collection names cannot be changed by end users; however, they are able to be changed by AMPPD staff.
External Source (optional, strongly recommended if using External Item ID)
This field is optional. It is used to tell AMP the source system of the item. This information is added by AMP in the bag it provides with AMP-generated metadata for target systems to consume. If External Item ID is being used, it is strongly recommended to provide an item's External Source, as it allows items in the same collection from multiple sources to have the same External Item ID.
External Item ID (optional, recommended)
This field is optional, although recommended. When provided, it is used as the unique identifier of the item during the batch ingest process (if not, Item Title is the unique identifier). This allows items within the same collection to have the same Title (this is a relatively common occurrence, at least in some of the pilot collections).
Item Title (required)
The bibliographic title of the item. If the External Item ID is not provided, this field serves as the unique identifier for an item in a collection; in this case, one cannot have multiple items with the same title within a collection.
Item Description (optional)
When provided, this field will be displayed as the item description in AMP.
Primary File (required)
This is the file name of the media file that has been placed in the Dropbox for ingestion. The uniqueness of a primary file is the combination of Collection, Item/External ID, and Primary File Label. File names are unique within an item, but not necessarily across items (that is, two distinct items can have the same primary filename).
Warning: if a batch manifest includes two lines with the same value in the Primary File column for different items in the same collection, the validation step will let it pass, but the ingestion process will have a runtime error because in the dropbox there can only be one copy of a file with that name. To resolve this conflict, ingest the primary files with conflicting names using separate batches.
Primary File Label (required)
Users must provide a label (or title) for the file. That label will be used to uniquely identify the file within this item. In other words, one cannot have multiple primary files with the same label if they are associated with the same item within the same collection.
Primary File Description (optional)
This field may be left blank. When provided, it will be displayed as the file description in AMP.
Supplemental File fields:
Multiple supplemental files can be specified per line. For each supplemental file, you need four fields:
- Supplemental File Type - The user must specify if they want to place the file at the Collection, Item, or Primary File level
- Supplemental File - the filename of the binary file (which needs to be found in the same Dropbox as the media files)
- Supplemental File Label - the user must provide a label (or title) for the supplemental file. That label will be used to uniquely identify the file in association with the item. One cannot have multiple supplementary files with the same name associated with the same item within the same collection.
Once files have been ingested, they can be submitted to a workflow. Currently, only primary files can be submitted to workflows, but supplementary files will soon be able to be submitted. To use the Workflow Submissions feature, first search for the item(s) you wish to submit to a workflow. Only items that have been submitted to AMP via the Batch Ingest feature will appear in the search results. The search feature allows users to limit search results by media type (audio, video, or other). The search results appear as items, with individual files contained in the item visible via a dropdown menu on each item, with an "add file" button to the right of the filename. For convenience, each item has an "add all files" button that adds every file to the "Selected Files" box. Once files have been added, they can either be submitted directly to a workflow or saved as a bundle. Saving a grouping of files as a bundle can be very helpful when adding a large number of files at once, as they can all be submitted simultaneously.
To select a workflow, click on the "Select Workflow" dropdown menu, which will provide a list of available workflows running on Galaxy. Once the workflow has been selected, the "submit to workflow" button will enable, allowing you to submit the files. Once one or more files have been submitted, AMP will display a message telling you how many jobs were successfully submitted to the workflow, and how many failed. Additionally, if one or more files fail, the message will display detailed information about the file(s), including the collection name, item name, file ID, filename, and file label. If one or more files fail, please let AMPPD staff know, providing the information given by the error message.
There are presently five working workflows: Transcript-NER-HMGM, Transcript-NER-no Human MGM, NER HMGM for Corrected Transcripts, Scene Detection with Contact Sheets, and Contact Sheets Only. The primary difference between the first two is that the former has human intervention at several steps to improve performance/the quality of the final deliverables. These workflows achieve two primary goals: generating a transcript (whether the transcript is human-edited or not), and recognizing named entities (people, places, etc.) in said transcript. The third, related to the former two, skips the transcript steps entirely, going directly to the named entity recognition steps. Scene Detection with Contact Sheets creates a contact sheet of video content by first automatically detecting shots using a Python library called PySceneDetect, then taking a frame in the middle of each of the said shots and placing them in order in a contact sheet. Contact Sheets Only creates only a contact sheet, taking frames from the video according to an arbitrary time interval. Additionally, a HPC (high-performance computing) workflow is currently being tested in hopes of processing large numbers of files much more quickly. This presently uses the INA speech segmentation tool and Kaldi (an open-source speech recognition toolkit) running in IU's HPC environment to output transcripts.
The Dashboard allows users both to find files that have already been submitted to a workflow, as well as track the progress of files in a workflow. It contains a fairly robust search feature that allows users to filter results in a multifaceted manner, as well as sort the results. The Dashboard by default displays all job steps by date in descending order, though you are able to sort by any column ascending or descending. Users are additionally able to export the data displayed on the Dashboard as a CSV file for easier analysis in software outside of AMP.
The attributes you are able to filter by are as follows:
- Date (filter uses a date range)
You are able to sort by all of these attributes as well, in addition to External Source, External ID, and Output.
The search function (either the main search bar or within the filters) works slightly differently than what may be expected (and, crucially, differently than the search function on the Workflow Submission page). Like many search functions, it will return suggestions (of existing files already submitted to a workflow) to you; unlike many search functions, you must select one of these suggestions, as the search function currently does not return partial matches.
Each workflow step has a status indicator to let you keep track of what is done, what is currently processing, what is waiting to be processed, and what has failed. The workflow steps are color-coded for easy readability. The color coding is as follows:
- Scheduled: Blue
- In Progress: Yellow
- Paused: Orange
- Error: Red
- Complete: Green
- Deleted: Grey
As many of the workflow steps are routed to either a local (to IU's Carbonate computing cluster) or cloud-based (Amazon Web Services, Azure) machine to undergo processing via the selected machine learning algorithm, it can occasionally take some time for a file to start processing, since it is sharing resources with many other jobs in a queue. Do not worry if this is the case, and a step is in Scheduled for a while: it is most likely waiting to process. This is especially the case with workflow steps sent to Carbonate, as it can occasionally take a few hours depending on what is ahead of it in Carbonate's queue. If a step fails, it will turn red and the status indicator will say "Error." All subsequent steps will then be set to "Paused," in orange. As with items failing to be submitted to a workflow, please let AMPPD staff know if this occurs, as they can figure out precisely what went wrong.
The Deliverables page allows users to toggle whether or not a given output is placed in an export bag to be delivered to another system. Presently, the only feature is toggling individual completed workflow steps to be delivered. As at this point in the pilot stage, AMP does not have any system to deliver metadata to, this feature does not have much functionality. However, it will play a very important role in the final product