Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

The Audiovisual Metadata Platform, or AMP, is a software platform that aims to generate metadata for digitized and born digital audiovisual materials using a combination of machine learning models and human intervention. It was created in collaboration between Indiana University, AVP, the University of Texas at Austin, and the New York Public Library. It is funded by a grant from the Mellon Foundation.

The creation of mass digitization projects is one of the most important shifts in library and archival practices, both from a preservation and access standpoint. As analog media continue to deteriorate, digitization is the best hope for their long-term storage. However, the sheer amount of materials created by digitization makes it unfeasible for catalogers to catalog everything by hand. Making matters worse, many of these materials come with only scant metadata, meaning a cataloger would need to watch or listen to the item to generate metadata. As such, using machine learning to automate the metadata generation is a promising solution. However, while machine learning algorithms have significantly improved over the years -- particularly in the last decade with the sudden explosion of deep neural networks -- they still have difficulties and biases. These pose problems for metadata in terms of usefulness, accuracy, and fairness. The goal of AMP is to introduce human intervention in its machine learning pipelines to circumvent some of these limitations, so the metadata it generates is more correct, or at least more useful, to collection managers and catalogers.

This guide explains AMP's front-end features and functionality, and additionally explains how to navigate the user interface. It is designed for users of AMP: collection managers, catalogers, or any other person using the platform. Presently, the guide reflects AMP as it currently exists in its pilot stage; it is very likely that some of this information will change over time.

Definitions

These definitions are provided to make reading this guide easier.

  • Collection: A set of items that are subject to the same access control settings
  • Bundle: A set of items from a collection or multiple collections. The bundle gathers items that the user wants to submit through a workflow at the same time

  • Item: A bibliographic item. It contains metadata and A/V binary content. Items belong to a collection
  • File: A file is a media file (sound recording, moving image) that is part of an item. Multiple related files can exist in one item
  • Primary Files: Binary objects that are provided as primary resources from a collection-holding institution
  • Supplemental File: Any file that is provided to supplement the information about a collection, an item, or a primary file
  • Workflow: A representation of a graph that describes the routing rules for a set of MGMs. The input of a workflow may be an item or a group of items
  • Metadata Generation Mechanism (MGM): A machine learning tool or other tool (e.g., automated non-machine-learning tools like ffmpeg, or manual tools like a transcript editor) provided to users to interact with AMP
  • Job: One execution of a workflow for a particular Primary File
  • Unit: A tenant in a multi-tenant AMP; collections belong to a Unit

Logging In

Currently, there are two ways to access AMP:  https://amppd.dlib.indiana.edu, and http://calcium.dlib.indiana.edu:8500/#/, which requires a connection to the IU VPN if not connected to one of IU's networks on campus. The former method is preferable, as it allows users not affiliated with IU to access it more easily.

If this is your first time using AMP (or if you are not currently logged in), the URL for AMP will direct you to the login page. Before logging in for the first time, you will need to create an account. After creating your account, you will need to verify it prior to logging in for the first time. The email verification link will not be sent until approved by AMPPD staff currently, so for the short term, you may be required to wait until they are able to respond. Once your account has been verified, you should be able to log in with your email and password specified when creating your account.

Image Removed

Uploading Files via Batch Ingest

(Note: the following borrows significantly from Maria Whitaker's documentation on creating a batch manifest file; see more at Batch Manifest - this is a work in progress page)

Image Removed

The Batch Ingest feature is how files are uploaded to AMP. To use the Batch Ingest, you will need to create a batch manifest in CSV format, and then upload it to AMP on the Batch Ingest page. All the files in the batch manifest first need to be uploaded to their respective collection's subfolder in the dropbox via an SFTP client; the Batch Ingest will fail if a file is not found in the expected dropbox subfolder. Detailed instructions for how to upload files to a dropbox using a tool like WinSCP or Cyberduck can be found here

Batch manifests must conform to a specific format for the batch to be properly ingested. The following are the column names required for a batch manifest (note that some of these are optional):

Collection name (required)

The collection must already exist in AMP, and collection names must match exactly. The name of the collection is used to determine:

  • which collection dropbox holds the media files to be uploaded
  • to which collection in AMP the information in that line refers.

Collection names cannot be changed by end users; however, they are able to be changed by AMPPD staff.

External Source (optional, strongly recommended if using External Item ID)

This field is optional. It is used to tell AMP the source system of the item. This information is added by AMP in the bag it provides with AMP-generated metadata for target systems to consume. If External Item ID is being used, it is strongly recommended to provide an item's External Source, as it allows items in the same collection from multiple sources to have the same External Item ID. 

External Item ID (optional, recommended)

This field is optional, although recommended. When provided, it is used as the unique identifier of the item during the batch ingest process (if not, Item Title is the unique identifier). This allows items within the same collection to have the same Title (this is a relatively common occurrence, at least in some of the pilot collections).

Item Title (required)

The bibliographic title of the item. If the External Item ID is not provided, this field serves as the unique identifier for an item in a collection; in this case, one cannot have multiple items with the same title within a collection. 

Item Description (optional)

When provided, this field will be displayed as the item description in AMP.

Primary File (required)

This is the file name of the media file that has been placed in the Dropbox for ingestion. The uniqueness of a primary file is the combination of Collection, Item/External ID, and Primary File Label. File names are unique within an item, but not necessarily across items (that is, two distinct items can have the same primary filename).

Warning: if a batch manifest includes two lines with the same value in the Primary File column for different items in the same collection, the validation step will let it pass, but the ingestion process will have a runtime error because in the dropbox there can only be one copy of a file with that name. To resolve this conflict, ingest the primary files with conflicting names using separate batches.

Primary File Label (required)

Users must provide a label (or title) for the file. That label will be used to uniquely identify the file within this item. In other words, one cannot have multiple primary files with the same label if they are associated with the same item within the same collection. 

Primary File Description (optional)

This field may be left blank. When provided, it will be displayed as the file description in AMP.

Supplemental File fields:

Multiple supplemental files can be specified per line. For each supplemental file, you need four fields:

  • Supplemental File Type - The user must specify if they want to place the file at the Collection, Item, or Primary File level
  • Supplemental File - the filename of the binary file (which needs to be found in the same Dropbox as the media files)
  • Supplemental File Label - the user must provide a label (or title) for the supplemental file. That label will be used to uniquely identify the file in association with the item. One cannot have multiple supplementary files with the same name associated with the same item within the same collection.

Workflow Submissions

Image Removed

Once files have been ingested, they can be submitted to a workflow. Currently, only primary files can be submitted to workflows, but supplementary files will soon be able to be submitted. To use the Workflow Submissions feature, first search for the item(s) you wish to submit to a workflow. Only items that have been submitted to AMP via the Batch Ingest feature will appear in the search results. The search feature allows users to limit search results by media type (audio, video, or other). The search results appear as items, with individual files contained in the item visible via a dropdown menu on each item, with an "add file" button to the right of the filename. For convenience, each item has an "add all files" button that adds every file to the "Selected Files" box. Once files have been added, they can either be submitted directly to a workflow or saved as a bundle. Saving a grouping of files as a bundle can be very helpful when adding a large number of files at once. To select a workflow, click on the "Select Workflow" dropdown menu. Once the workflow has been selected, the "submit to workflow" button will enable, allowing you to submit the files. Once one or more files have been submitted, AMP will display a message telling you how many jobs were successfully submitted to the workflow, and how many failed. Additionally, if one or more files fail, the message will display detailed information about the file(s), including the collection name, item name, file ID, filename, and file label. If one or more files fail, please let AMPPD staff know, providing the information given by the error message.

Image Removed

There are presently four working workflows: Transcript-NER-HMGM, Transcript-NER-no Human MGM, NER HMGM for Corrected Transcripts, and Scene Detection with Contact Sheets. The primary difference between the first two is that the former has human intervention at several steps to improve performance/the quality of the final deliverables. These workflows achieve two primary goals: generating a transcript (whether the transcript is human-edited or not), and recognizing named entities (people, places, etc.) in said transcript. The third, related to the former two, skips the transcript steps entirely, going directly to the named entity recognition steps. Scene Detection with Contact Sheets creates a contact sheet of video content by first automatically detecting shots, then taking stills from said shots and placing them in order in a contact sheet.

The Dashboard

The dashboard allows users both to find files that have already been submitted to a workflow, as well as track the progress of files in a workflow. It contains a fairly robust search feature that allows users to filter results in a multifaceted manner, as well as sort the results. The Dashboard by default displays all job steps by date in descending order, though you are able to sort by any column ascending or descending.

Image Removed

The attributes you are able both to filter and sort by are as follows:

  • Date (filter uses a date range)
  • Submitter
  • Workflow Name
  • Source Item
  • Source File
  • Workflow Step
  • Status

The search function (either the main search bar or within the filters) works slightly differently than what may be expected (and, crucially, differently than the search function on the Workflow Submission page). Like many search functions, it will return suggestions (of existing files already submitted to a workflow) to you; unlike many search functions, you must select one of these suggestions, as the search function currently does not return partial matches. 

Each workflow step has a status indicator to let you keep track of what is done, what is currently processing, what is waiting to be processed, and what has failed. The workflow steps are color-coded for easy readability. The color coding is as follows:

  • Scheduled: Blue
  • In Progress: Yellow
  • Paused: Orange
  • Error: Red
  • Complete: Green
  • Deleted: Grey

As many of the workflow steps are routed to either a local (to IU's Carbonate computing cluster) or cloud-based (Amazon Web Services, Azure) machine to undergo processing via the selected machine learning algorithm, it can occasionally take some time for a file to start processing, since it is sharing resources with many other jobs in a queue. Do not worry if this is the case, and a step is in Scheduled for a while: it is most likely waiting to process. This is especially the case with workflow steps sent to Carbonate, as it can occasionally take a few hours depending on what is ahead of it in Carbonate's queue. If a step fails, it will turn red and the status indicator will say "Error." All subsequent steps will then be set to "Paused," in orange. As with items failing to be submitted to a workflow, please let AMPPD staff know if this occurs, as they can figure out precisely what went wrong.

Deliverables

Image Removed

The Deliverables page allows users to toggle whether or not a given output is placed in an export bag to be delivered to another system. Presently, the only feature is toggling individual completed workflow steps to be delivered. As at this point in the pilot stage, AMP does not have any system to deliver metadata to, this feature does not have much functionality. However, it will play a very important role in the final product.

Children Display