To help decide how WGBH metadata should be structured for ingest into the system, below are examples for how the technical metadata is created as well as the current digital preservation workflow.
Incoming File Processing
When files are delivered to the WGBH Media Library and Archives for digital preservation they can come in on several format types.
- External Hard Drive
- Avid ISIS Network Storage
- Department Server
- LTO Tape
In most cases, it's critical to the other departments that we preserve the folder structure of the contents they are delivering to us. Ultimately, they want a spinning disk solution but can't afford the costs but also don't like a traditional HSM system where files need to be staged. Our IT department isn't helpful and we've been forced to find a solution on our own.
Because of this, we developed a Ruby script to help us process.
This script will:
- Scan source directory for viruses using Sophos
- Create text file list of all source directory folders and files
- Copy source directory and files to one or more destination directories, these are one onsite copy and one offsite copy of LTO-6 LTFS formatted tapes
- Compare source and destination files for differences and report those results in a text file
- Run FITS (File Information Tool Set) on source directory and create FITS xml for all files
The FITS xml is the foundation of our records. One FITS xml file is generated for every single file found on a source directory. Not just A/V files. Each LTO tape the files end up on could have dozens to tens of thousands of files.
FITS will generate the technical metadata such as file size, file path on the source drive, file name as well as format and wrapper characteristics by using Media Info. FITS will also generate a MD5 checksum and embed it in the xml, giving us most of the technical information we care about.
When it's finished processing, the user is left with a folder containing all the individual FITS xml files.
When we're processing files, we try and do it in batches based on the Series and Program those files come from. So an LTO tape may contain files from program A and program B but as we're processing the technical metadata xml, we do it one at a time, program A then program B.
Ingest Into Filemaker DAM Database
While waiting for the HydraDAM2 application, files have continued to be delivered to the WGBH MLA and processed for digital preservation. As a result, we created a Filemaker DAM database that functions very closely to what we want HydraDAM2 to do.
1) FITS xml gets ingested into Filemaker DAM and one record is created for one FITS xml file.
2) Specific technical metadata is extracted out and put into Tags. Tags are created to have searchable values for specific technical metadata. These Tags are associated with each FITS xml file record. Tags are not in a specific schema but take inspiration from PBCore and can be mapped to it.
3) The next step that happens is manually adding tags for metadata that can not be extracted from the technical metadata.
- LTO Barcode
- LTO Serial Number
- PIM Identifier (from descriptive database)
- Basic Series and Program Information
- Fixity Information (PREMIS-ish)
Without this step there would be very little descriptive information and you would not be able to associate the FITS xml with it's LTO tape.
4) There is then a process to add tags for any derivative, proxy A/V files. Basically, they are given what is called a Generations tag in Filemaker DAM. The data in those tags is used to point the record to the location of the proxy files on Amazon S3. The files are displayed and made available for download on the user view of the records.
We have a script in place that is able to export all the tags for a record as a PBCore instantiation document.
Lastly, we create descriptive metadata records in a separate database. There is where regular users go to find information on series, program, department, etc. There are also records that describe what is on the digital folders and files found on an LTO tape.
This database also contains the physical location information of the LTO tape in our vault as well as offsite location. We don't need to store all that descriptive information in HydraDAM2/PHYDO, all we need is the link that gets established by the LTO barcode, serial number or PIM identifier that is shared between both the technical and descriptive metadata records.
See below, a user is logged into the database, can search on descriptive terms and find the proxy versions of the files described in the record.
When a department is requesting file retrieval, they most commonly will want a complete copy of everything we put on the LTO tape and preserving the folder structure.
Questions About How to Ingest Metadata Into HydraDAM2/PHYDO
First there should be a discussion on what the WGBH LTO model should be.
There could be several different possibilities for how WGBH metadata is ingested into HydraDAM2, but it's likely we'll just be going with the Ingest Normalization work IU's been working on.
FITS XML Files and PBCore to YAML
+ Able to account for metadata not found in FITS xml as it's created (LTO barcode, fixity info)
+ Easily exported from Filemaker DAM
+ No need to create custom ingest for WGBH
- Additional work to map export metadata to YAML template
I think the best option would be FITS XML FILES and PBCore to YAML mostly because ingest will more closely line up with how IU will be doing it and as a system we're providing to other organizations it'll be nice to say, "All ingest happens this way, just map to the YAML and you're good to go."
Will we still need to construct the FITS xml in a BagIt/SIP like structure?
1) Export individual FITS xml
2) Export individual PBCore xml
3) Merge PBCore xml
4) Map to YAML
5) Create Bag for ingest with:
- individual FITS xml
- individual PBCore xml
6) Ingest to PHYDO