Child pages
  • InHarmony Ingest Tool
Skip to end of metadata
Go to start of metadata

Note

This tool is being superseded by the Object Ingest Tool

This document explains the initial specs of how the InHarmony ingest tool should work:

File structure

InHarmony cataloging tools place files into this structure:

  • metadata
    • MODS filename: isl-aad-8761-MODS.xml
      • <instid>-<uniqid>-MODS.xml
  • masters
    • Master filename: isl-aad-8761-01-01.tif
      • <instid>-<uniqid>-<copyno>-<pageno>.tif
    • copyno: 01..99
    • pageno: 01..99 (possible can go above 99)
  • derivatives
    • Derivative image filename: isl-aad-8761-01-01-full.jpg
      • <instid>-<uniqid>-<copyno>-<pageno>-<version>.jpg
    • Derivative PDF filename: isl-aad-8761-01.pdf
      • <instid>-<uniqid>-<copyno>.pdf

InHarmony ingest tool configuration

The InHarmony ingest tool reads the path of a config directory and looks in that directory for:

  • IngestTool.properties
  • InHarmonyIngest.properties

The format of the IngestTool.properties file is the same as the infrastructure Ingest Tool config file. The InHarmonyIngest.properties file contains these configuration options:

# Properties for the In harmony ingest
metadataPath=C:/projects/infrastructure/WebRoot/tests/InHarmony/metadata
mods2dcTransform=C:/projects/infrastructure/WebRoot/tests/InHarmony/ead2dc.xsl
masterPath=C:/projects/infrastructure/WebRoot/tests/InHarmony/masters
derivPath=C:/projects/infrastructure/WebRoot/tests/InHarmony/derivs
collectionPID=iudl:10
adminEmail=
stopFile=C:/projects/infrastructure/WebRoot/tests/InHarmony/stop
log4jConfigFile=C:\\projects\\infrastructure\\WebRoot\\log4j-inharmony.properties
jhovePath=C:/projects/infrastructure/WebRoot/jhove/

InHarmony ingest tool algorithm

  • Check if the stop file exists, if so delete the stop file and quit
  • Check if there's a MODS file (isl-aad-8761-MODS.xml) in the metadata directory (metadataPath)
    • Check if the flag file is present isl-aad-8761-MODS.xml-process
    • Read in the MODS file into a MODS record
    • Get the item id in the form of /isl/sheetMusic/isl-aad-8761
      • note: right now the InHarmony cataloging tool IDs are isl-aad-8761 but will be changed to include the prefix
    • Get the item title from the MODS record
    • Apply the mods2dc transformation to the MODS record to get the DC record
    • move the MODS file to: metadataPath/completed
    • try to load the item from the repository
      • if it doesn't exist in the repository, create a new manifestation object in the repository
    • Update/Ingest the item on the repository with the new MODS and DC records
    • Link the manifestation object to the collection level object
  • Check if there's a tif file (isl-aad-8761-01-01.tif) in the master in masterPath
    • Check if the flag file isl-aad-8761-01-01.tif-process exists
    • Get the item id from the file name: isl-aad-8761-01-01.tif -> isl-aad-8761-01-01
    • Check if the derivatives (thumbnail, screen and full) are in the derivatives directory: derivPath
    • If derivatives are ready;
    • try to load the manifestation object from the repository with id isl-aad-8761
    • If it doesn't exist, prepare to create a new manifestation object
    • try to load the page object from the repository with id isl-aad-8761-01
    • If it doesn't exist, prepare to create a new page object
    • try to load the image object from the repository with id isl-aad-8761-01-01
    • If it doesn't exist, prepare to create a new image object
    • Create MIX records from the derivatives and upload them to the repository
    • Update the structural metadata in the image object to point to the derivative images
    • Move the master file to masterPath/completed and derivative files to derivPath/completed
    • Ingest the image object
    • Update the structural metadata in the page object to point to the page image object
    • If the pdf file exists in the derivatives directory,
      • add it to the page object
      • Q: how should the structural metadata belonging to the page object be modified for the PDF file?
      • move it to derivPath/completed
    • Ingest the page object
    • Link the image object to the page object
    • Update the structural metadata in the manifestation object to point to the page (copy) object
    • Ingest the manifestation object
    • Link the page object to the manifestation object
  • Wait 15 secs before scanning the metadata and masters directories
  • Repeat

Questions/Issues

  • The current implementation uses flag files as described here and in the algorithm aboveIt is possible that when the ingest tool scans the input directories, a file transfer is in progress (especially for large master files). The InHarmony ingest tool can try to process those files before they are fully copied.
    • One possible solution is to use empty files as flags. For example, the InHarmony ingest tool can wait until isl-aad-8761-01-01.tif-process exists in the master directory before picking isl-aad-8761-01-01.tif up for processing. The same can be applied to the metadata files, i.e. wait until isl-aad-8761-MODS.xml-process is placed in the metadata directory before processing it.
    • For this scheme to work, files should be placed by the producer in this order:
      • For metadata files: copy the mods file (isl-aad-8761-MODS.xml) to the metadata directory and wait until it's complete
      • Create a flag file in the metadata directory: isl-aad-8761-MODS.xml-process
      • For master files, copy the derivative files to the derivative directory (InHarmony ingest tool won't process process master files unless all the derivatives are ready)
      • Copy the master file (isl-aad-8761-01-01.tif) to the master directory, wait it's fully created in the directory
      • Create a flag file isl-aad-8761-01-01.tif-process
  • Should there be multiple fedora collection objects for each institution? Institutions include isl, lilly, etc. (see InHarmony workspace for the full list).
  • No labels