Child pages
  • Ingest Tool Development
Skip to end of metadata
Go to start of metadata

Refactorings that would make the ingest tool more useful:

  1. Make it more like (or more integrated with?) the Directory Ingest Service.
  2. The current ingest tool contains a lot of complexity for dealing with cases where an "item" is defined by the presence of metadata, the presence of a media file, or a list of IDs. It would be better if the "item" could be simply defined as "whatever is in the SIP", where the SIP is a file or set of files dropped in some drop box. This would allow removal of the whole "Ingestion ID" concept.
    • Should we require that the SIP includes METS to describe the relationship between the files? If so, processing becomes much simpler, and we may even be able to use the Fedora Directory Ingest tool. The big disadvantage is that something must always create this METS document.
  3. We require that the metadata is always the definitive source of information like "which objects to ingest". An object without metadata must have metadata created before it can be ingested, otherwise it is useless. (This is relatively trivial, just add a title like "Unknown", and a valid ID number.)
  4. Move things around:
    • existingItem should go under items.
  5. Remove comments element. (Any metadata comments should be in the actual metadata, right?)
  6. Logging. Remove all logger settings (except email address) from the collection config file. A summary of success/failure should be sent to the email address, along with a URL for the full log file, but nothing else. Users will seldom need more, and the current system is unnecessarily complicated.
  7. In IngestToolWorker, refactor to make all public/package variables private.
  8. Why is the IngestToolWorker called as a separate process? Is there anything preventing it from being a thread?
  9. Put XPath for the primary identifier in collection config, instead of in
  10. Remove all the switch statements; replace with inheritance.
  11. Instead of using nested Hashtables and Pairs, which are difficult to follow, use real classes and have appropriate methods, so the abstraction is clear.
  12. Provide a simple interface that makes it easier to run.
  13. Remove the "requestID" functionality. We seldom need an external ID to manage an ingest process. Eventually, we will want a way to cancel a particular ingest (particularly if it is a long one), but we could easily allow the "ingest" function to return an ID number that could later be used to access the process.

Ingest tool requirements

  • [Done] basic object creation – Add one or more objects to the repository. All needed metadata and media is supplied at the same time. In some cases, the metadata will be very minimal (ID only), but it will always be present. In some cases, media files may not be present, and only the metadata will be ingested.
  • [High want] metadata change – Also known as "cataloging". The metadata has changed in some way, and it must be updated without disturbing other types of metadata.
  • [Want] media change – A media file is removed and replaced with another file. This is rare, and may be handled by a manual process, unless it is easy to create an automatic process. We may do a "media add" (a change without deleting the previous) in cases where metadata pre-exists (usually derived from EAD), and an item has been newly digitized. This type of add would be much more common than a change.
  • [Want] media change with structural update – insert a page into an existing book
  • [Must] status tracking – so newly-created items can be made available only after they have passed a QC process.
  • [May want] locking system – records being edited are unavailable for other editors. The need for this depends on how the repository is used. If everything is "finalized" before going into the repository, we may not need a locking system, or we may be able to integrate it into the cataloging tool.
  • [May want] SIP ingest – collection configuration can specify how to pull metadata out of a METS file and store it in Fedora datastreams. It may be simpler to process the SIP as a normal set of metadata from an external source, where we have to apply custom transforms to create the datastreams we want.
  • [High want] integrated derivative creation – collection configuration should specify how to create derivative media files and metadata records. While the derivative creation process should be in a separate module (to improve understanding of the system, and allow it to be used more flexibly), the normal workflow should be triggered by a change in the master files, with changes automatically propagated to the derivatives.

Refactoring plan

We will slowly make the ingest tool more modular, and at the same time incorporate features that are needed by new development. Eventually, we may move some pieces (like JHOVE processing or the actual upload to Fedora) into separate tools.

Possible goal configuration

  • Run JHOVE beforehand, similar to creating derivatives
  • Provide XSL for combining JHOVE with metadata into our METS format
  • Create small tool to generate foxml (or diringest METS?) from our METS and media files
  • Use DirIngest (or the core of the current ingest tool) to perform the actual ingest
  • Use batchmodify to attach disseminators? Could our ingest tool generate the batchmodify commands and run them, or is it easier to just keep this in the ingest tool?

This works for object creation, but how would it work for the change operations?

  • metadata change – Get pre-existing METS, replace the appropriate parts, upload it. Check whether the object title needs to change. No need for DirIngest or batchmodify steps.
  • media change – Run JHOVE, combine with existing METS. Run DirIngest. May need to run batchmodify.

Work in Progress

Currently, we are working on making the the ingest tool more modular. The ingest tool is developed mainly for making bulk ingests possible (although single ingests are possible by providing only one object). Therefore, updates to existing objects are not directly supported. We are working on the existing ingest tool code to reorganize/reimplement some parts. Since the code has been tested and working, we'll be reusing components widely in future modifications. Here's how the new architecture looks like:

           FEDORA CORE API(-A and -M) (prov. by Fedora)
                FEDORA-DLP API  (prov. by IU DL)
                 /    \      \
                /      \      \
             Ingest   Update  Catalog ...
          Batch Ingest ....

The ingest tool implements useful classes that can be part of a core FEDORA-DLP API. We're trying to separate those classes from the ingest tool and trying to provide cleaner interfaces. FEDORA-DLP API consists of objects that are specific to the IU DLP vision of the repository architecture.

Open questions

  1. If we build a (very simple) GUI on top of the ingest tool for developer use (better GUIs would be built for catalogers), what functionality would be required?
    • I think, this is a good idea. One important part of the GUI right now is the CollectionConfiguration.xml file. A GUI to create and manipulate this file would be useful. Another GUI to call the actual IngestTool servlet might be useful, too. So, a user can select the CollectionConfiguration file to send to the INgestTool using a browse interface and see the ingests that are going on or completed. Right now, to do these, you have to manipulate a jsp file on the server.
  • No labels