This tool is being superseded by the more general Object Ingest Tool
Initial implementation of Mass Digitization ingest tool:
(This explains the structure for the samples we recieved)
- OCR text directory: VAA4276-OCR (<id>-OCR)
- PDF file: VAA4276.pdf (<id>.pdf)
- Page images directory: vaa4276-tif (<id>-tif)
- Page image file: VAA4276-0092.tif (<id>-<pageno>.tif)
- (proposed) Derivative page images directory: vaa4276-derivs (<id>-derivs)
- VAA4276-0092-screen.jpg (<id>-<pageno>-<size>.jpg)
- (proposed) Marc XML file: vaa4276.xml (<id>.xml)
Q: Should we try to standardize the type case (upper/lower) of file and directory names?
The MassDig Ingest Tool uses the ingest tool to perform ingests, see relevant pages for its configuration.
Here's the options specific to the Mass Dig Ingest Tool:
- Get the paths to config directory and item directory from the command line
- get the item ID from the item directory name
- Read in the Marc record and generate a DC from it
- Try to load the book object
- If it doesn't exist, create a new book object
- Scan the item directory for page images, for each master image
- Try to load the page image object
- Create a new object if it doesn't exist
- Generate MIX technical data from master and derivative image files
- Add/Update the image item
- Update structural metadata of the book object with the new page
- Link page item to book item
- Generate MIX from PDF and add to METS
- Upload PDF
- Link book item to collection item
Some issues and questions
- What is the structure of the PURLs for book and page items? general/pageturner
- Are we going to generate and store MODS and DC records? Do we have a sample Marc record corresponding to these?
- Do we have an existing case where generated PDF technical metadata is stored in the book item metadata?