Child pages
  • Mass Digitization Ingest Tool
Skip to end of metadata
Go to start of metadata

Note

This tool is being superseded by the more general Object Ingest Tool

Initial implementation of Mass Digitization ingest tool:

Directory structure

(This explains the structure for the samples we recieved)

  • OCR text directory: VAA4276-OCR (<id>-OCR)
  • PDF file: VAA4276.pdf (<id>.pdf)
  • Page images directory: vaa4276-tif (<id>-tif)
    • Page image file: VAA4276-0092.tif (<id>-<pageno>.tif)
  • (proposed) Derivative page images directory: vaa4276-derivs (<id>-derivs)
    • VAA4276-0092-screen.jpg (<id>-<pageno>-<size>.jpg)
  • (proposed) Marc XML file: vaa4276.xml (<id>.xml)

Q: Should we try to standardize the type case (upper/lower) of file and directory names?

Config options

The MassDig Ingest Tool uses the ingest tool to perform ingests, see relevant pages for its configuration.

Here's the options specific to the Mass Dig Ingest Tool:

derivativeDir=c:/projects/infrastructure/WebRoot/tests/MassDig/derivatives
marc2dcTransform=c:/projects/infrastructure/WebRoot/tests/MassDig/ead2dc.xsl
collectionPID=iudl:100
adminEmail=someone@indiana.edu
log4jConfigFile=C:/projects/infrastructure/WebRoot/log4j-massdig.properties
jhovePath=C:/projects/infrastructure/WebRoot/jhove/

Ingest Algorithm

  • Get the paths to config directory and item directory from the command line
  • get the item ID from the item directory name
    • Read in the Marc record and generate a DC from it
    • Try to load the book object
    • If it doesn't exist, create a new book object
    • Scan the item directory for page images, for each master image
      • Try to load the page image object
      • Create a new object if it doesn't exist
      • Generate MIX technical data from master and derivative image files
      • Add/Update the image item
      • Update structural metadata of the book object with the new page
      • Link page item to book item
    • Generate MIX from PDF and add to METS
    • Upload PDF
    • Link book item to collection item

Some issues and questions

  • What is the structure of the PURLs for book and page items? general/pageturner
  • Are we going to generate and store MODS and DC records? Do we have a sample Marc record corresponding to these?
  • Do we have an existing case where generated PDF technical metadata is stored in the book item metadata?
  • No labels