This page is approaching stability. We have agreed on the basic structure of textual documents, but the disseminators are not complete.
Textual documents will store all text in a single Fedora object. When the textual document content model is combined with the paged document content model, there may be sub-objects representing individual pages, but these sub-objects will not contain textual content.
In some cases (like the Newton project), the rendering of textual information will include images. These images will be Fedora objects that are children of the main "textual document" object. In the case of a paged document with separate "detail" images, the images may be two or more levels below the object containing the text.
Disseminator for text (bdef:iudlText)
getSummary – text (UTF-8) suitable for rendering in a result list. Does not include the title (which should be retrieved through the default disseminator). It is possible that this will be very collection-dependent, and may require its own disseminator.
getRawText – returns the text of the document, marked up in XML (often TEI, but could be something else)
getFriendlyText – returns the text of the document, in an XML form with no external dependencies (often TEI, but could be something else)
getChunkList – returns a simple XML list of all chunks in the document along with their labels
getChunk(label) – returns a portion of the source XML document, possibly the entire document if the "root" label is used
getTextPage(num) – returns a single page of marked-up text from the XML document
When a textual document is ingested (including EAD and TEI), any structure present will be translated into a METS structmap. This structmap will serve as the authoritative version of the document's structure. It will be used to drive MetsNav, as well as to guide any communication between MetsNav and the text rendering system.
Simply selecting a single page out of TEI is difficult, because there is no set of elements that encloses each page.
The encyclopedia for the Indiana authors project may need different handling than a typical TEI document.
Implementations of similar models
- Virginia's TEI model
- Slides 34-44 of the Virginia Content Models presentation
- At Tufts – Note that for behaviors that return XML, you may have to "View Source" in your browser. They're not returning the correct MIME type.
- A TEI encyclopedia in Fedora and in user view
- A TEI score in Fedora and in user view
- A TEI essay in Fedora and in user view
- An EAD file that is mapped to a paged view in Fedora and in user view
- In general, Tufts stores paged documents as TEI, and has the TEI point to "figure" elements, which are indicated by a unique ID. When rendering, this ID is sent through their name resolver to produce a Fedora URL.
- The MANU project for TEI manuscript storage in Fedora. Warning: MANU is GPLed. MANU is an end-to-end solution. It is tied to MySQL. For these reasons, we are unlikely to use their code, but it is still a worthwhile reference.
Possibile implementation of getTextPage(num)
Here is how Manu retrieves single pages out of TEI (according to Eric Jansson):
The goal of the getPage dissemination is to return well-formed (and hopefully valid) TEI, and the way that is done is to execute this algorithm:
- remove all text node content not between the <pb/> tags in the TEI file
- remove all matching tags with empty content (no text nodes, for example '<p></p>') before or after the <pb/> tags
- repeat step 2 until no more matching tags are removed
- remove all TEI outside the <body> element
So a TEI file like this:
A request for page #2 would return this:
Is this really a "TEI page"? I'm not sure, but doing this considerably simplified our problems and seems to have worked well in practice. I hope that explanation helps.
- Should we include something like getXMLBundle that exports a source document with all of the necessary supporting files? If so, this should probably be part of a separate disseminator. Or maybe a separate tool that lies on top of Fedora. In fact, it may be the same thing as the DIP generator.
- Validating complex XML documents is easier when the supporting DTDs or schemas are abstracted out. Should we use XML catalogs? If so, can we find a catalog-aware tool that will integrate easily into Fedora?
- Can this same set of disseminators be used for EAD documents? We think so, but they may need more functionality.
- How do we handle a document that has more than one TEI file (as in multiple translations)?