Data Processing
Initially, remove out of scope records.
- Records with zero location/url or identifier/uri elements. UM processing: taken care of through the re-exposed records.
- Records with a subject/hierarchicalGeographic element, that:
- Has a country subelement but none with the value United States (or reasonable other values, such as U.S., America, etc.), or;
- Has a state subelement but none with a value that matches any of the 50 states, written out or in abbreviation
Basic search implies Google-like functionality, so when basic is noted in the following table, it means the element[s] are part of the basic search index. This index should also be an option in the advanced search page, as "keyword".
Levels of adoption, according to SWG's page:
- Level A. A user is able to identify a resource (to reference, for future re-discovery).
- Level B. A user is able to find resources through a process (search and/or browse) that offers a modest amount of precision.
- Level C. Everything else. These fields allow users to perform searches with a high degree of precision, browse winnowing, and disambiguation between related resources.
Item being Processed |
Processing Notes |
XPath Query |
Level of Adoption |
Brief/Full Display |
Record Display Label |
Basic/Advanced Search Index |
In Advanced Search? |
Browse Facet |
---|---|---|---|---|---|---|---|---|
title |
For brief display, clicking on the title hotlinks to the URL. |
indexing:
|
level A |
brief, full |
Title |
basic, advanced |
yes, selectable from drop-down |
no |
date1 |
First, use keyDate, if it exists. Should be one and only one keyDate. @w3cdtf strongly preferred. |
(in order) |
level A |
brief, full |
Date
|
basic, advanced |
yes, choice of single date entry, range, and era/decade |
yes |
language |
UM processing: re-exposed records contain exploded language codes. If there is a @type="code", another sub-element is added with @type="text" that includes the exploded code. If @type="text" already exists, it is left alone. |
mods/language/languageTerm[@type='text'] |
level C |
full |
Language |
advanced |
yes, selectable from drop-down |
yes |
URL |
Two fields can contain clickable URLs: location/url and identifier@uri. For display, only the primary URL in location/url should be used, if available. For brief display, clicking on the title hotlinks to the primary URL. For full display, the URL displays as-is. |
mods/location/url[@usage='primary display'] or mods/identifier[@type='uri'] |
level B |
[brief], full |
URL |
neither |
no |
no |
creator |
Separate name and namePart, affiliation, role or description with a space comma. |
mods/name/namePart|affiliation|description and role/roleTerm[@type='text'] |
level B |
full |
Related Names |
basic, advanced |
yes, selectable from drop-down |
no |
subject2 |
Record display and browse facets are driven by the subject indexes. They should be generated from all subelements of subject, regardless of whether they appeared within a single subject container. Therefore, split pre-coordinated headings (e.g., United States - Social conditions - 1980- - Juvenile literature - Bibliography) into their component parts for indexing and browse display, but not for record display.
|
mods/subject/geographic|hierarchicalGeographic|geographicCode |
level B |
full |
Subject |
basic, advanced |
yes, limiter by subject index type |
yes
|
physical description |
Sub-elements should be separated by a space semicolon space. |
mods/physicalDescription/* |
level C |
full |
Physical Description |
basic |
no |
no |
publisher and place |
placeTerm@code should be exploded as described above for subject/geographicCode, roleTerm and language. |
mods/originInfo/place/placeTerm[@type='text'] and publisher |
level B |
full |
Publisher |
basic (publisher), advanced (publisher and place) |
yes, selectable from drop-down |
no |
origin aspects |
Sub-elements should be separated by a space semicolon space. |
mods/originInfo/edition|issuance|frequency and mods/part/* |
level C |
full |
Publication Specifics |
basic, advanced |
no |
no |
resource type |
Ignore attributes. |
mods/typeOfResource |
level B |
full |
Resource Type |
basic, advanced |
yes, limiter by value |
no |
genre |
Ignore attributes. |
mods/genre |
level B |
full |
Genre |
basic, advanced |
yes, selectable from drop-down |
yes |
location |
Separate multiple instances of physicalLocation by a comma space. |
mods/location/physicalLocation |
level C |
full |
Physical Location |
advanced |
no |
no |
identifiers |
If @type="uri" is used for URL, exclude it here. |
mods/identifier |
level C |
full |
Identifier |
neither |
no |
no |
classification3 |
Ignore attributes, for now. |
mods/classification |
level C |
full |
Classification |
neither |
no |
no |
table of contents |
Ignore attributes. |
mods/tableOfContents |
level C |
full |
Table of Contents |
basic |
no |
no |
abstract |
Ignore attributes. |
mods/abstract |
level C |
full |
Abstract |
basic, advanced |
yes, selectable from drop-down |
no |
note |
Ignore attributes. |
mods/note |
level C |
full |
Note |
basic |
no |
no |
audience |
Ignore attributes. |
mods/targetAudience |
level B |
full |
Audience |
basic, advanced |
no |
yes |
rights |
Ignore attributes. |
mods/accessCondition |
level B |
full |
Terms and Conditions of Use |
neither |
yes, limiter by value |
no |
related item |
Exclude the "dlfaqcoll" attribute here, because used for collection. |
mods/relatedItem/* |
level C |
full |
Related Item |
basic |
no |
no |
preview |
UM processing: re-exposed records contain a thumbnail image in the @access="preview" attribute. If the re-exposed records do not contain a preview image, the Thumbgrabber can be used to gather them. |
mods/location/url[@access='preview'] |
level A |
brief, full |
n/a |
n/a |
n/a |
n/a |
collection |
UM processing: re-exposed records contain the "dlfaqcoll" attribute that concatenates repository name and OAI setName into a readable collection phrase. |
mods/relatedItem/titleInfo[@authority='dlfaqcoll']/title |
level A |
brief, full |
Collection |
basic, advanced |
yes, limiter by collection |
yes |
1 We would recommend including the following in this methodology:
- Some set-level analysis to determine which date to use (only feasible for relatively small harvesters)
- If more than one date appears, throw out any dates after about 1996 or so as they're likely digitization dates
- Use the one that's machine readable if some are not
2 Investigate supplementing the time browse facet that contains mods/subject/temporal with data from date elements. Also, investigate using @authority to determine if certain controlled vocabularies (e.g., LCSH) can help us create more consistent subject indexes. If clustering is a possibility, this will also aid this effort.
3 Look into whether classification can supplement genre or subject. For instance, High Level Browse at UM can be used to map classification numbers to a set of topics.
4 Date processing rules per the MWG and the SWG:
for both indexing and sorting:
- choose keyDate="yes" and w3cdtf="yes", if exists
- if those two attributes don't exist, choose keyDate="yes"
- if no keyDate, choose one of these:
(in order) dateCreated, dateIssued, copyrightDate, dateOther - if none of those dates exist, choose one of these:
(in order) dateCaptured, dateValid, dateModified - assumption is there is only one sort date and only one indexing field
(which may have multiple values)
other indexing rules:
- all chosen dates are normalized to a year value
- for a single date, e.g., 1986, index only that date
- for a range of known dates, e.g., 1944-1950, index each of those dates
inclusive - for uncertain dates, e.g., 198?, 1908s, 198-, each date is indexed
inclusive, e.g., 1980-1989 - for circa dates, e.g., ca. 1945, each date is expanded for indexing +/- 5
years, e.g., 1940-1950 - for expanded indexing, these dates will be searchable across decades,
e.g., ca. 1945 will be searchable in the 1940s and the 1950s - non-dates, e.g., n.d., don't get indexed or normalized
- centuries are indexed as such, e.g., 17th century/cent. as 1601-1700
- for date elements with start and end attributes, use first start/end pair
and treat these as a known date, e.g., start=1900, end=1920, index 1900-1920
sorting:
- for a range of known or circa dates, choose the mid-point date, e.g., for
1944-1950, choose 1947; for ca. 1945, choose 1945 (because indexing expanded
to 1940-1950) - for a range of uncertain dates, choose the beginning date, e.g., for
1940?, choose 1940 - for a single date, choose that date
- non-dates, e.g., n.d., [no date], should sort at the end, no matter
whether sort is chronological or reverse chronological
display:
- display all dates in the original encoding
- do not display the normalized value for the indexed date field
- records with no date should not appear if a date or date range is searched
- display copyright, circa and uncertain dates as is, e.g., c1945, 1845?,
ca. 1944
MODS fields not used for data processing, although they may be used for other things, are:
- mods
- modsCollection
- recordInfo
- dateCaptured
- dateModified
- dateValid
- extension -- being used to contain asset action information, but not correctly; on hold for now
Remaining questions:
- Should all elements be displayed in full display?
- How does one choose the best URL to use for display? Is /mods/location/url[@usage='primary display'] sufficient?
- Should we ignore location/url@displayLabel?
- Is typeOfResource beneficial as a browse facet?