The Audiovisual Metadata Platform, or AMP, is a software platform that aims to generate metadata for digitized and born digital audiovisual materials using a combination of machine learning models and human intervention. It was created in collaboration between Indiana University, AVP, the University of Texas at Austin, and the New York Public Library. It is funded by a grant from the Mellon Foundation.
The creation of mass digitization projects is one of the most important shifts in library and archival practices, both from a preservation and access standpoint. As analog media continue to deteriorate, digitization is the best hope for their long-term storage. However, the sheer amount of materials created by digitization makes it unfeasible for catalogers to catalog everything by hand. Making matters worse, many of these materials come with only scant metadata, meaning a cataloger would need to watch or listen to the item to generate metadata. As such, using machine learning to automate this sounds promising. While machine learning algorithms have significantly improved over the years, particularly in the last decade with the sudden viability of deep neural network models, they still have difficulties and biases. The goal of AMP is to introduce human intervention in its machine learning pipelines to circumvent some of these limitations, so the metadata it generates is more correct, or at least more useful.
This guide explains AMP's front-end features and functionality, and additionally explains how to navigate the user interface. It is designed for users of AMP: collection managers, catalogers, or any other person using the platform. Presently, the guide reflects AMP as it currently exists in its pilot stage; it is very likely that some of this information will change over time.
These definitions are provided to make reading this guide easier.
- Collection: A set of items that are subject to the same access control settings.
Bundle: A set of items from a collection or multiple collections. The bundle gathers items that the user wants to submit through a workflow at the same time.
- Item: A bibliographic item. It contains metadata and A/V binary content. Items belong to a collection.
- File: A file is a media file (sound recording, moving image) that is part of an item. Multiple related files can exist in one item
- Primary Files: Binary objects that are provided as primary resources from a collection-holding institution
- Supplemental File: Any file that is provided to supplement the information about a collection, an item, or a primary file.
- Workflow: A representation of a graph that describes the routing rules for a set of MGMs. The input of a workflow may be an item or a group of items.
- Metadata Generation Mechanism (MGM): A machine learning tool or a tool provided to users to interact with AMP.
- Job: One execution of a workflow for a particular Primary File
Unit: A tenant in a multi-tenant AMP; collections belong to a Unit.
Currently, there are two ways to access AMP: http://calcium.dlib.indiana.edu:8500/#/, which requires a connection to the IU VPN if not connected to one of IU's networks on campus, and https://amppd.dlib.indiana.edu, which does not.
If this is your first time using AMP (or if you are not currently logged in), the URL for AMP will direct you to the login page. Before logging in for the first time, you will need to create an account. After creating your account, you will need to verify it prior to logging in for the first time. The email verification link will not be sent until approved by hand currently, so please be patient. Once your account has been verified, you should be able to log in with your email and password.
Uploading Files via Batch Ingest
(Note: the following borrows significantly from Maria Whitaker's documentation on creating a batch manifest file; see more at Batch Manifest - this is a work in progress page)
The Batch Ingest feature is currently the only way to upload files to AMP. To use the Batch Ingest, you need to create a batch manifest in CSV format, and then upload it to AMP on the Batch Ingest page. All the files in the batch manifest first need to be uploaded to their respective collection's subfolder in the dropbox via an SFTP client; the Batch Ingest will fail if a file is not found. Detailed instructions for how to upload files to a dropbox using a tool like WinSCP or Cyberduck can be found here.
Batch manifests must conform to a specific format for the batch to be properly ingested. The following are the column names required for a batch manifest (note that some of these may be left blank):
Collection name (required)
The collection must already exist in AMP. The name of the collection is used to determine:
- which collection dropbox holds the media files to be uploaded
- to which collection in AMP the information in that line refers.
External Source (Source is the old column name)
This field may be left blank. It is used to tell AMP the source system of the item. This information is added by AMP in the bag it provides with AMP-generated metadata for target systems to consume. Within AMP this currently serves no purpose.
External Item ID (Source ID is the old column name)
This field may be left blank. When provided, it is used as the unique identifier of the item during the batch ingest process (if not, Item Title is the unique identifier). This allows items within the same collection to have the same Title (this is a relatively common occurrence, at least in some of the pilot collections).
Item Title (required)
The bibliographic title of the item. If the External Item ID is not provided, this title serves as the unique identifier for an item in a collection; in this case, one cannot have multiple items with the same title within a collection.
This field may be left blank. When provided, it will be displayed as the item description in AMP.
Primary File (required)
This is the file name of the media file that has been placed in the Dropbox for ingestion. The uniqueness of a primary file is the combination of Collection, Item/External ID, and Primary File Label. File names are unique within an item, but not across items (that is, two distinct items can have the same primary filename).
Warning: if a batch manifest includes 2 lines with the same value in the Primary file column for different items in the same collection, the validation step will let it pass, but the ingestion process will have a runtime error because in the dropbox there can only be one copy of a file with that name.
Primary File Label (required)
Users must provide a label (or title) for the file. That label will be used to uniquely identify the file within this item. In other words, one cannot have multiple primary files with the same label if they are associated with the same item within the same collection.
Primary File Description
This field may be left blank. When provided, it will be displayed as the file description in AMP.
Supplemental File fields:
Multiple supplemental files can be specified per line. For each supplemental file, you need 4 fields:
- Supplemental file Type - user must specify at if they want to place the file at the Collection, Item, or Primary file level
- Supplemental file - the filename of the binary file (which needs to be found in the same Dropbox as the media files)
- Supplemental file Label - the user must provide a label (or title) for the supplemental file. That label will be used to uniquely identify the file in association with item. One cannot have multiple supplementary files with the same name associated with the same item within the same collection.
Once files have been ingested, they can be submitted to a workflow. Currently, only primary files can be submitted to workflows, but supplementary files will soon be able to be submitted. To use the Workflow Submissions feature, first search for the item(s) you wish to submit to a workflow. The search feature allows users to limit search results by media type (audio, video, or other). The search results appear as items, with individual files contained in the item visible via a dropdown menu on each item, with an "add file" button to the right of the filename. For convenience, each item has an "add all files" button that adds every file to the "Selected Files" box. Once files have been added, they can either be submitted directly to a workflow or saved as a bundle. Saving a grouping of files as a bundle can be very helpful when adding a large number of files at once. To select a workflow, click on the "Select Workflow" dropdown menu. Once the workflow has been selected, the "submit to workflow" button will enable, allowing you to submit the files. Once one or more files have been submitted, AMP will display a message telling you how many jobs were successfully submitted to the workflow, and how many failed.
If any files fail, please let the dev team know. Unfortunately, the message that displays after submitting files to a workflow does not specify precisely which files fail, so to determine this, you will need to find the file(s) on the Dashboard. If the file labels are the same as an existing file, this could be difficult; please let the dev team know if this happens, as they can search for the file in the database. They can also determine specifically why a file failed.
There are presently three working workflows: Transcript-NER-HMGM, Transcript-NER-no Human MGM, and NER HMGM for Corrected Transcripts. The primary difference between the first two is that the former has human intervention at several steps to improve performance/the quality of the final deliverables. These workflows achieve two primary goals: generating a transcript (whether the transcript is human-edited or not), and recognizing named entities (people, places, etc.) in said transcript. The third skips the transcript step entirely.
The dashboard allows users both to find files that have already been submitted to a workflow, as well as track the progress of files in a workflow. It contains a fairly robust search feature that allows users to filter results in a multifaceted manner, as well as sort the results. The Dashboard by default displays all job steps by date in descending order, though you are able to sort by any column ascending or descending.
The attributes you are able both to filter and sort by are as follows:
- Date (filter uses a date range)
- Workflow Name
- Source Item
- Source File
- Workflow Step
The search function (either the main search bar or within the filters) works slightly differently than what you might be used to (and, crucially, differently than the search function on the Workflow Submission page). Like many search functions, it will return suggestions (of existing files already submitted to a workflow) to you; unlike many search functions, you must select one of these suggestions, as the search function currently does not return partial matches. While this is believed to no longer be an issue, until recently, the filters would not always clear properly, adding any new search parameters to all previously active parameters.
Each workflow step has a status indicator to let you keep track of what is done, what is currently processing, what is waiting to be processed, and what has failed. The workflow steps are color-coded for easy readability. The color coding is as follows:
- Scheduled: Blue
- In Progress: Yellow
- Paused: Orange
- Error: Red
- Complete: Green
- Deleted: Grey
As many of the workflow steps are routed to either a local (to IU's Carbonate computing cluster) or cloud-based (Amazon Web Services, Azure) machine to undergo processing via the selected machine learning algorithm, it can occasionally take some time for a file to start processing, since it is sharing resources with many other jobs in a queue. Do not worry if this is the case, and a step is in Scheduled for a while. This is especially the case with workflow steps sent to Carbonate, as it can occasionally take a few hours depending on what is ahead of it in Carbonate's queue. If a step fails, it will turn red and the status indicator will say "Error." All subsequent steps will then be set to "Paused," in orange. As with items failing to be submitted to a workflow, please let the dev team know if this occurs, as they can figure out precisely what went wrong.
The Deliverables page allows users to toggle whether or not a given output is placed in an export bag to be delivered to another system. Presently, the only feature is toggling individual completed workflow steps to be delivered. As at this point in the pilot stage, AMP does not have any system to deliver metadata to, this feature does not have much functionality. However, it will play a very important role in the final product.