Skip to end of metadata
Go to start of metadata

As we develop Fedora Content Models, we should keep this basic philosophy in mind. The philosophy is driven by a desire to structure data in the "appropriate" way, making things as easy to use as possible, while maintaining a structure that can be searched and updated efficiently.

Most items in our repository will live in a hierarchy. A "book" object will have one or more parents that are "collection" objects. Each book will also have many chilren, which are "page" objects. For an object at any level in the Fedora hierarchy, all metadata associated with that object will appear in a single METS document, in the METADATA datastream. This differs from many other Fedora implementations that break the data up into multiple pieces which can be retrieved individually (descriptive, technical, administrative, etc.).

Reasons to store all metadata in a single METS record:

  • All objects in the repository are treated the same. There is a predictable datastream for every type of data.
  • The METS schema provides a methods for different pieces of metadata to reference each other.
  • May make it easier to share data with other digital libraries.
  • Ingest tools can do more consistency checking, because all objects are treated the same.
  • It will be easier to guarantee that all representations of the object refer to the same list of files.
  • We cannot accurately predict all of the uses for the metadata, so any separation we make could be less meaningful as time passes.

Within the METS document, every section should have an ID. The ID should start with the name of the element in which it appears. If there is more than one section of each type (like <file>), or if more descriptive information is needed, a hyphen should be added, followed by a description of the section. Some examples:

  • file-thumbnail
  • dmdSec-mods
  • techMD-thumbnail
  • techMD-page-11
  • structMap-logical

Philosophy, by metadata type

Technical metadata (tmd)

We must keep technical/provenance data for all derivative files. Technical metadata should always be stored as close to the files as possible. This means in the same Fedora object in which the files appear as datastreams.

Why? We want a consistent place to look for this information in all collections. "At the collection level" doesn't work, because there will almost always be too much data to handle. In the case of an item with internal structure, storing technical data at the item level can cause very large XML files as well. Tech data will almost always be different for each file, so we cannot eliminate redundancy very easily.

Note that this arrangement closely mirrors the reasoning of the published METS profiles, where only a <file> element should reference technical metadata. In our case, the <file> elements have a one-to-one mapping with Fedora datastreams. Within the Fedora object that contains these datastreams, the METS document will contain <file> elements that map each datastream to the correct tmd.

Descriptive metadata (dmd)

Descriptive metadata should be stored in the most logical place. If the authoritative dmd is a collection-level document (like EAD), it
should be stored at the collection level. If the authoritative dmd is item-level, it should be stored at the item level. Every object should
have a disseminator for retrieving the data it contains, transforming it into the relevant format if necessary. If performance suffers due
to retrieving data from other objects, local copies of the relevant data may be stored, but there must be some way to indicate that these
are not the authoritative copies.

We will use MODS to describe all image collections. However, some image collections (like Hohenberger) will have their "master" metadata in an EAD document, and the MODS will be derived from this. Since EAD is very flexible, we will need to write custom XSL to do the mapping. In the future, we will design recommended EAD practices to minimize the amount of XSL that has to be written.

We will need a method to indicate when metadata has been derived from the authoritative metadata, and describe the proper method for updating the data.

Note that this arrangement closely mirrors the reasoning of the published METS profiles, where only a <div> element in a <structMap> should reference desciptive metadata. In our case, the <div> is represented by the hierarchy of Fedora objects. Within the Fedora object that contains the dmd, there will be a structmap whose top-level <div> references the dmd.

Dublin Core

Fedora requires each object to have a DC datastream holding a Dublin Core record for internal use. This datastream should be as minimal as possible. It currently includes only these items:

  • Title
  • PURL for the object (if this is an item-level object) in an Identifier field
  • Fedora PID for the object in an Identifier field

The "real" DC record (if present) will be in the dmd of the METS document, alongside any other descriptive metadata.

Administrative/Holdings data

For some internal processes, we have a need to track administrative/holdings information. This information doesn't fit well into the METS amdSec, so it will go in a dmdSec using our own metadata type, iudlAdmin (TODO: We need to define a schema and a namespace URI for this). This information will include:

  • Locally-used identifiers (including all versions of the PURL for an object)
  • Status/workflow information
  • Holdings information

Structural metadata (smd)

Structural data should go where it is needed, usually at the same level as descriptive.

For large items, we do not want to store structural data with technical data. A large book can result in a 2MB METS file, just to store the filepointers and structmaps. Adding technical data to this file would make it difficult to process quickly. In this case, structural data should be at the book level, while technical data is at the page level.

Transcriptions

Text transcriptions can appear either at the level of the descriptive data, or at the level of individual files. It may be useful to have a single TEI document at the same level as the descriptive data, or a datastream (still probably TEI) with each page object that transcribes the content of that page. We may need to do some performance testing, but it's quite possible that we will want the transcriptions to be available from both levels, which means we will likely want to store this information at the book level (though not necessarily in the same datastream as the regular metadata).

Logical vs physical structure

For many collections, the physical structure (pages, photographs) will fit neatly into the logical structure. In these cases, it is reasonable to combine logical and physical structure into a single hierarchy of Fedora objects. It is even reasonable to put both logical and physical information into a single METS document within a Fedora object.

In other collections, the logical structure and physical structure may not map so well. This primarily occurs with audio/video files, like Variations2 and Eviada.

We will always need Fedora objects to represent the physical structure, because it is impractical to put all of the needed datastreams in a single object. For example, if we tried to represent a 400-page book without separate page objects, we would need at least 400 datastreams, and possibly as many as 1600 depending on the storage of varying image sizes, transcriptions, etc.

Descriptive data will usually (always?) be at the hierarchical level in which physical and logical data come together.

Meta-metadata issues

Derived metadata

Sometimes, it becomes necessary to derive metadata from another location. We will indicate that metadata has been derived by putting a human-readable indication in each derived record (preferably in a "notes" field), and a machine-readable indication in the METS record with label "derivedSections", shown below. Linebreaks have been added to the derivationProcess for readability. They should not be included in the actual record.

A schema for the derivedSections data is available at http://www.dlib.indiana.edu/lib/xml/derivedSections/

The human-readable form must be in each derived record in case the record is exported.

An alternative is for the metadata to be derived on the fly. This isn't technically difficult, but performance reasons may require pre-generating the derived data.

Creating a Dissemination Information Package (DIP)

Although metadata will be scattered at various levels of the hierarchy, there needs to be a way to disseminate a single METS record that contains all metadata for a given item. This will likely involve a disseminator that is customized to a particular media type, which can collect the appropriate metadata and build up the METS document.

We will not remove derived metadata from our DIPs. We have taken pains to put this information in place, and as long as we clearly mark it as derived, it can be useful for others as well.

In general, it is a good idea to include all of the object's data in a DIP, even when that data is stored in nonstandard formats. This maximizes the ability of the receiving repository to manage an object, and it is the only way to have the data available in the event we would need to retrieve a copy of the object from the other repository.

Ingesting a Submission Information Package (SIP)

There will need to be a system (similar to Eviada's METS-to-Fedora converter) that accepts a METS document and breaks it up into the relvant Fedora objects. If Fedora objects already exist for this item, they will need to be updated.

Note: There is also relevant information on the Ingest Tool page, describing how to handle SIPs that will need to be reproduced at some later time.

Metadata storage examples

Wright American Fiction:

  • Some descriptive metadata may be available at the collection level.
  • Most descriptive metadata will be at the book level.
  • Structural metadata will be at the book level.
  • Technical metadata will be at the page level.
  • Text will be stored at the book level.

Scrapbooks in the Hoagy Carmichael collection:

  • Some descriptive metadata may be available at the collection level.
  • Most descriptive metadata will be at the scrapbook level, though there may not be any.
  • Structural metadata will be at the scrapbook level.
  • Technical metadata will be at the clip level, with the first "clip" of each series being the
    picture of the page containing all of the other clips.
  • Text (if available) will be stored at the clip level.

IU Sheet music:

  • Some descriptive metadata may be available at the collection level.
  • Most descriptive metadata will be at the item level.
  • Structural metadata will be at the item level.
  • Technical metadata will be at the page level.
  • Lyrics (if available) will be stored at the page level.

Hohenberger/Steel:

  • The HohenbergerMETS.xml file provides an example of how the Hohenberger data is stored.
  • Some descriptive metadata will be available at the collection level, including
    the EAD full finding aid.
  • Most descriptive metadata will be at the item level, derived from the full finding aid (because it
    would take too long to generate it on the fly).
  • Structural metadata (practically nonexistent) will be at the item level.
  • Technical metadata will be at the item level. (This is the lowest level.)

Variations2 items (if we ever move them to the repository):

  • Some descriptive metadata may be available at the collection level.
  • Most descriptive metadata will be at the item (Container) level.
  • Physical structural metadata will be at the item level.
  • Logical structural metadata is currently stored in separate objects (Instantiation Bindings), but these
    are always associated with a single item (Container).
  • Some logical structure is stored in Playlist objects, timelines, etc.
  • Technical metadata should be at the file/page level. Some technical data is currently in the
    MediaObject, and other data is only available from the actual media files. For audio,
    files and MediaObjects have a one-to-one correspondence. For scores, metadata in the MediaObject
    purports to apply to all files, but this is not always true.
  • The item (Container) level is where the physical structure (Container Structure, Media Object) meets
    the logical structure (Instantiation).

Evia:

  • Some descriptive metadata may be available at the collection (Evia) level.
  • Most descriptive metadata is at the "field collection" level. Objects lower in the logical hierarchy
    refer back to the collection for their data.
  • Logical structural metadata is rooted at the "field collection" level. Objects lower in the logical
    hierarchy refer back to the collection for their data.
  • Physical structural metadata is at the "field collection" level.
  • Technical metadata is at the "field collection" level, but is fairly minimal.
  • No labels