Child pages
  • Ingest Source File Requirements
Skip to end of metadata
Go to start of metadata

This documents shows the file structures required by the ingest process. The source files (e.g. master images, MIX metadata files, derivative images, OCR text output, PDF files, TEI files, metadata, etc.) are usually processed by the new ImageProc system developed by Brian and placed in a directory in a central location to be found by the ingest process. Here, in an attempt to standardize that structure, I'm going to document the file structure expected by the ingest process.

Image based collections

These collections usually consist of digitized photos that are standalone items. The files for the photos usually consist of the following. Cushman, Hohenberger, US Steel are some examples of this type of collection.

  • Master image file (usually in TIF format)
    • Or a MIX technical metadata file generated from the master image file
  • Derivative images
    • Large size (1000px), screen size (600px) and thumbnail size (200px) images in either JPG or GIF formats
    • A larger JPEG 2000 format is going to be added for some collections (e.g. Somali posters)
  • Descriptive metadata
    • There are multiple ways the descriptive metadata is presented or generated. Two of these ways are a) pre-generated individual metadata records for each image, b) a finding aid and an XSLT to generate item-level metadata records.
    • A DC record is also generated and stored with the image (With the increased performance of disseminators, instead of pre-generating and storing these records, we might be able to move this functionality to a disseminator. The main purpose of having DC records right now is OAI)

Proposed directory structure for passing files to Ingest

For image collections, a simple one level directory hieararchy would be sufficient.

File Type

Directory

Examples

Master image or MIX

<ingest-dir>/masters

Master TIFF or a MIX files generated from the master file. [item-id].mix or [item-id].tif

Derivative images

<ingest-dir>/derivatives

Thumbnail, screen and large size JPEGs and JPEG 2000 file. [item-id]-thumb.jpg, [item-id]-screen.jpg, [item-id]-large.jpg. [item-id].j2k

Descriptive metadata

<ingest-dir>/metadata

MODS or DC XML files if available (they are usually generated on the fly though). [item-id]-mods.xml, [item-id]-dc.xml

Ready-to-ingest flag

<ingest-dir>

An empty file used just as a flag. [item-id]-finished

For some cases, the masters and derivatives could be generated on the fly and placed in the <ingest-dir> before an image is fully processed by the image processing system. For these cases, a flag file would help the Ingest processing to decide if an item is ready to be ingested. The flag file will be named <item-id>-finished and placed in the <ingest-dir>.

Paged document collections

Books, journals, music scores are some examples of this type of collections. In the past, we also treated photos with scanned back sides. The main components of this type collections are page image objects and document (book, score, journal, etc.) objects. Page image objects are the same as the objects in image objects but might also contain OCR'ed text but lacks any descriptive information. Page image objects are linked to their parent document objects using Fedora-supported RDF relationships.

Document objects:

  • PDF file if exists
  • TEI text encoding if exists (for some collections generated from OCR'ed page text)
  • Descriptive metadata if exists.

Page-image objects

  • Master image files
  • Derivative images
  • Text from OCR

Proposed directory structure for passing files to Ingest

File Type

Directory

Examples

Master image or MIX

<ingest-dir>/masters/[item-id]

Master image file or the MIX file generated from the master file. [item-id]-[page-no].mix

TEI file

<ingest-dir>/tei

[item-id].xml

Derivative images

<ingest-dir>/derivatives/[item-id]

[item-id]-[page-no]-thumb.jpg, etc.

PDF file

<ingest-dir>/derivatives/[item-id]

[item-id].pdf

OCR file

<ingest-dir>/derivatives/[item-id]

[item-id]-[page-no].txt

Descriptive metadata

<ingest-dir>/metadata

[item-id]-mods.xml (but this is usually retrieved on the fly, e.g., from IUCat)

Ready-to-ingest flag

<ingest-dir>

[item-id]-finished

The flag file should be set when all the pieces of the paged document object are in place. The exception to this is when an updated source component is added after the paged document ingest is complete. For example, we might decide to add PDF files to existing paged document objects at a later date. In this case, only PDF files can be put in the derivative images directory but the flag files should also be created to signal readiness.

It is usually the case with paged document objects that they have descriptive records stored in IUCat database. These are best accessed using the "Cataloging Key" identifier (Title control # is not very suitable as it might not be unique). For these types of items, usually a VAA number to "cataloging key" mapping is required. This mapping has been provided to us by Spencer Anspach based on the digitization spreadsheet we sent to Spencer. The spreadsheet contains titles, call number, etc. information which Spencer uses to retrieve the "cataloging keys". The corresponding cataloging keys -> VAA mapping is put in a properties file and passed to the ingest processing scripts.

Multi-copy paged document collections

The project with this type of structure is InHarmony. Structurally, this type of collections have three-level hierarchy: A top abstract level, a secondary document level and a third page image level. The document level is the same as the Paged Document collections whereas the top level combines multiple copies of the same work (the same printed score but with different cover and back).

Multi-copy (top level) objects:

  • Descriptive metadata

Document objects

  • PDF file if exists
  • TEI text encoding is exists

Page-image objects

  • Master image files
  • Derivative images
  • Text from OCR

Proposed directory structure for passing files to Ingest

TBD.

Print journal based collections

We currently have two instances of this type of collections: I-Witness and Indiana Magazine of History. This is relatively complicated with 3 levels of object hierarchy and multiple connections among objects. The three levels are: Volume, issue and page image levels. There is also an additional sublevel for article objects.

Volume objects

  • TEI file

Issue objects (paged document object)

  • PDF file if exists

Page-image objects

  • Master image files or MIX files
  • Derivative images
  • Text from OCR

Article objects

  • Article PDF file
  • TEI header file
  • Article level metadata file

Proposed directory structure for passing files to Ingest

File Type

Directory

Examples

Master Image or MIX

<ingest-dir>/masters/[issue-id]

[page-id].mix

Derivative page images

<ingest-dir>/derivatives/[issue-id]

[page-id]-thumb.jpg, [page-id]-screen.jpg, etc.

OCR

<ingest-dir>/derivatives/[issue-id]

[page-id].txt

PDF of the whole issue

<ingest-dir>/derivatives/[issue-id]

[issue-id].pdf

PDF of articles

<ingest-dir>/pdf

[article-id].pdf

TEI

<ingest-dir>/tei

[issue-id].xml

Multi-volume Paged documents

Indiana Authors Encyclopedia is an example of this content model. This encyclopedia contains three volumes. Since each volume making up the whole is effectively a Paged document, we will treat them as such.

File Type

Directory

Examples

Master image or MIX

<ingest-dir>/masters/[volume-id]

Master image file or the MIX file generated from the master file. [item-id]-[page-no].mix

TEI file

<ingest-dir>/tei

[volume-id].xml

Derivative images

<ingest-dir>/derivatives/[volume-id]

[volume-id]-[page-no]-thumb.jpg, etc.

PDF file

<ingest-dir>/derivatives/[volume-id]

[volume-id].pdf

OCR file

<ingest-dir>/derivatives/[volume-id]

[volume-id]-[page-no].txt

Descriptive metadata

<ingest-dir>/metadata

[volume-id]-mods.xml (but this is usually retrieved on the fly, e.g., from IUCat)

Ready-to-ingest flag

<ingest-dir>

[volume-id]-finished

  • No labels