Child pages
  • Data Processing for Aquifer Records

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Data Processing

...

Initially, remove out of scope records. *

  • Records with zero location/url or identifier/uri elements. UM processing: taken care of through the re-exposed records.

...

  • Records with a subject/hierarchicalGeographic element, that:

      ...

        • Has a country subelement but none with the value United States (or reasonable other values, such as U.S., America, etc.), or;

      ...

        • Has a state subelement but none with a value that matches any of the 50 states, written out or in abbreviation

      Wiki Markup
      Basic search implies Google-like functionality, so when basic is noted in the following table, it means the element\[s\] are part of the basic search index. This index should also be an option in the advanced search page, as "keyword".
      Wiki Markup

      Levels of adoption, according to \[SWG's page\|http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferServ/Service-Oriented+Levels+of+Metadata+Adoption\]: \* Level A. A user is able to identify a resource (to reference, for future :

      • Level A. A user is able to identify a resource (to reference, for future re-discovery).

      ...

      • Level

      ...

      • B.

      ...

      • A

      ...

      • user

      ...

      • is

      ...

      • able

      ...

      • to

      ...

      • find

      ...

      • resources

      ...

      • through

      ...

      • a

      ...

      • process

      ...

      • (search

      ...

      • and/or

      ...

      • browse)

      ...

      • that

      ...

      • offers

      ...

      • a

      ...

      • modest

      ...

      • amount

      ...

      • of

      ...

      • precision.

      ...

      • Level

      ...

      • C.

      ...

      • Everything

      ...

      • else.

      ...

      • These

      ...

      • fields

      ...

      • allow

      ...

      • users

      ...

      • to

      ...

      • perform

      ...

      • searches

      ...

      • with

      ...

      • a

      ...

      • high

      ...

      • degree

      ...

      • of

      ...

      • precision,

      ...

      • browse

      ...

      • winnowing,

      ...

      • and

      ...

      • disambiguation

      ...

      • between

      ...

      • related

      ...

      • resources.

      ...

      Item being Processed

      Processing Notes

      XPath Query

      Level of Adoption

      Brief/Full Display

      Record Display Label

      Basic/Advanced Search Index

      In Advanced Search?

      Browse Facet

      title

      For brief display, clicking on the title hotlinks to the URL.

      In the brief display, only show one title, and prefer one with no type attribute. If more than one title lacks a type attribute, display the first title without a type. If all titles include a type attribute, display the first.

      In the full display, show all titles with the label as described elsewhere on this page, none hyperlinked.

      nonSort and the title should be separated by a space. The title and the subTitle, partName or partNumber should be separated by a space colon space.

      The non-sort attribute should not be in the title for sorting purposes.

      indexing:
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="079dbd80-ed68-449a-9a7f-7870fb18be47"><ac:plain-text-body><![CDATA[mods/titleInfo/[title and subTitle|partName|partNumber]
      ]]></ac:plain-text-body></ac:structured-macro>
      brief display:
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f24d8a02-e102-44eb-876e-f8dc1e604e8f"><ac:plain-text-body><![CDATA[mods/titleInfo/[nonSort|title[0] and subTitle|partName|partNumber]
      ]]></ac:plain-text-body></ac:structured-macro>
      full display:
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="cf47fef0-4158-47e5-aefd-87a402b0e0f2"><ac:plain-text-body><![CDATA[* mods/titleInfo/[nonSort|title and subTitle|partName|partNumber]]]></ac:plain-text-body></ac:structured-macro>
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="4bb4edce-52a8-4211-98b6-71da639caf6f"><ac:plain-text-body><![CDATA[* mods/titleInfo/[nonSort|title[@abbreviated|alternative|translated|uniform]

      level A

      brief, full

      Title 
      ]]></ac:plain-text-body></ac:structured-macro>
      For type attributes: Abbreviated Title Alternative Title  
      Translated Title
      Uniform Title

      basic, advanced

      yes, selectable from drop-down

      no

      date1

      First, use keyDate, if it exists. Should be one and only one keyDate. @w3cdtf strongly preferred.

      Second, dateIssued and dateCreated are the priority dates for indexing and display. One or the other of these sub-elements should be available in the record. If neither is, copyrightDate or dateOther should be used.

      Exclude dateCaptured, dateModified, dateValid.

      When normalized dates are available, these should be used for sorting and searching purposes only, not for display.

      (in order) 
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2ca2f6f7-5560-468e-84aa-b938419f5c3a"><ac:plain-text-body><![CDATA[1. mods/originInfo/date*[@keyDate][@*='w3cdtf']
      ]]></ac:plain-text-body></ac:structured-macro>
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="fab46a68-aa10-498b-a3a5-4a539b1e597c"><ac:plain-text-body><![CDATA[2. mods/originInfo/[dateIssued or dateCreated]
      ]]></ac:plain-text-body></ac:structured-macro>
      3. mods/originInfo/copyrightDate|dateOther

      Further processing rules below.4

      level A

      brief, full

      Date

      Specifically,

      • Issue Date
      • Creation Date
      • Copyright Date
      • Date (for dateOther)

      basic, advanced

      yes, choice of single date entry, range, and era/decade

      yes

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="0568adae-1633-41fa-9c3f-8e9fbd072a59"><ac:plain-text-body><![CDATA[

      language

      UM processing: re-exposed records contain exploded language codes. If there is a @type="code", another sub-element is added with @type="text" that includes the exploded code. If @type="text" already exists, it is left alone.

      mods/language/languageTerm[@type='text']

      level C

      full

      Language

      advanced

      yes, selectable from drop-down

      yes

      ]]></ac:plain-text-body></ac:structured-macro>

      URL

      Two fields can contain clickable URLs: location/url and identifier@uri. For display, only the primary URL in location/url should be used, if available. For brief display, clicking on the title hotlinks to the primary URL. For full display, the URL displays as-is.

      In the event there is no location/url, identifier@uri may be used. UM processing: both location/url and identifier@uri are used to filter digital object records, in the event the latter may be useful.

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="0ad46c58-cb15-41f4-8ced-e421423102a7"><ac:plain-text-body><![CDATA[Exclude any that lead to a 404.

      mods/location/url[@usage='primary display'] or mods/identifier[@type='uri']

      level B
      ]]></ac:plain-text-body></ac:structured-macro>
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="6c50ad72-b814-4553-b6c6-023c6915e460"><ac:plain-text-body><![CDATA[

      [brief], full

      URL

      neither

      no

      no

      ]]></ac:plain-text-body></ac:structured-macro>

      creator

      Separate name and namePart, affiliation, role or description with a space comma.

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="a2fa8fe8-181a-4377-bfb2-5edb3a12b444"><ac:plain-text-body><![CDATA[[Explode

      http://www.loc.gov/marc/sourcecode/relator/relatorlist.html] the role/roleTerm@type="code" attributes. Add a new sub-element that contains the exploded code in a @type="text" attribute.

      mods/name/namePart|affiliation|description and role/roleTerm[@type='text']

      level B

      full

      Related Names

      basic, advanced

      yes, selectable from drop-down

      no

      ]]></ac:plain-text-body></ac:structured-macro>

      subject2

      Record display and browse facets are driven by the subject indexes. They should be generated from all subelements of subject, regardless of whether they appeared within a single subject container. Therefore, split pre-coordinated headings (e.g., United States - Social conditions - 1980- - Juvenile literature - Bibliography) into their component parts for indexing and browse display, but not for record display.

      As noted, geographicCode should be exploded, as language codes are at UM and as roleTerm is recommended to be done.

      Indexes should:

      • Combine geographic, hierarchicalGeographic, geographicCode (exploded) into one "geographic" index
      • Combine topic, occupation, titleInfo into one "topic" index
      • All other subject subelements (cartographic, temporal, name, genre) should be their own indexes
        • Genre facet should include data from both mods/genre and mods/subject/genre

      mods/subject/geographic|hierarchicalGeographic|geographicCode
      mods/subject/topic|occupation|titleInfo
      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="5f65fe57-cead-4b5d-8992-556abead9cf5"><ac:plain-text-body><![CDATA[mods/subject/[cartographic or temporal or name or //genre]
      ]]></ac:plain-text-body></ac:structured-macro>

      level B

      full

      Subject

      basic, advanced

      yes, limiter by subject index type

      yes

      Specifically, 

      • Subject: Geographic
      • Subject: Topical
      • Subject: Cartographic
      • Subject: Temporal
      • Subject: Genre
      • Subject: Related Names

      physical description

      Sub-elements should be separated by a space semicolon space.

      Ignore element and note sub-element attributes.

      mods/physicalDescription/*

      level C

      full

      Physical Description

      basic

      no

      no

      publisher and place

      placeTerm@code should be exploded as described above for subject/geographicCode, roleTerm and language.

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="cc251889-3985-4039-b716-aaa8f580f912"><ac:plain-text-body><![CDATA[placeTerm and publisher should be separated by a space semicolon space.

      mods/originInfo/place/placeTerm[@type='text'] and publisher

      level B

      full

      Publisher

      basic (publisher), advanced (publisher and place)

      yes, selectable from drop-down

      no

      ]]></ac:plain-text-body></ac:structured-macro>

      origin aspects

      Sub-elements should be separated by a space semicolon space.

      mods/originInfo/edition|issuance|frequency and mods/part/*

      level C

      full

      Publication Specifics

      basic, advanced

      no

      no

      resource type

      Ignore attributes.

      mods/typeOfResource

      level B

      full

      Resource Type

      basic, advanced

      yes, limiter by value

      no

      genre

      Ignore attributes.

      mods/genre

      level B

      full

      Genre

      basic, advanced

      yes, selectable from drop-down

      yes

      location

      Separate multiple instances of physicalLocation by a comma space.

      mods/location/physicalLocation

      level C

      full

      Physical Location

      advanced

      no

      no

      identifiers

      If @type="uri" is used for URL, exclude it here.

      Separate multiple instances of identifier by a comma space. 

      mods/identifier

      level C

      full

      Identifier

      neither

      no

      no

      classification3

      Ignore attributes, for now.

      Separate multiple instances of classification by a comma space. 

      mods/classification

      level C

      full

      Classification

      neither

      no

      no

      table of contents

      Ignore attributes.

      mods/tableOfContents

      level C

      full

      Table of Contents

      basic

      no

      no

      abstract

      Ignore attributes.

      mods/abstract

      level C

      full

      Abstract

      basic, advanced

      yes, selectable from drop-down

      no

      note

      Ignore attributes.

      Separate multiple instances of note by a comma space. 

      mods/note

      level C

      full

      Note

      basic

      no

      no

      audience

      Ignore attributes.

      mods/targetAudience

      level B

      full

      Audience

      basic, advanced

      no

      yes

      rights

      Ignore attributes.

      mods/accessCondition

      level B

      full

      Terms and Conditions of Use

      neither

      yes, limiter by value

      no

      related item

      Exclude the "dlfaqcoll" attribute here, because used for collection.

      Separate multiple instances of relatedItem by a semicolon space. If possible, use the processing logic enumerated above to handle subelements of relatedItem.

      mods/relatedItem/*

      level C

      full

      Related Item

      basic

      no

      no

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="72955e7e-b1bc-4f20-9d1b-6a9cd1c0c871"><ac:plain-text-body><![CDATA[

      preview

      UM processing: re-exposed records contain a thumbnail image in the @access="preview" attribute. If the re-exposed records do not contain a preview image, the [Thumbgrabber

      http://sourceforge.net/project/showfiles.php?group_id=47963&package_id=159364

      ...

      ]

      ...

      can

      ...

      be

      ...

      used

      ...

      to

      ...

      gather

      ...

      them.

      ...


      mods/location/url

      ...

      [@access='preview'

      ...

      ]

      ...

      level

      ...

      A

      brief,

      ...

      full

      n/a

      n/a

      ...


      n/a

      ...

      n/a

      ...


      ]]></ac:plain-text-body></ac:structured-macro>

      <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="759bdc13-5e79-4a81-b37e-4639c67e27fa"><ac:plain-text-body><![CDATA[

      collection

      UM processing: re-exposed records contain the "dlfaqcoll" attribute that concatenates repository name and OAI setName into a readable collection phrase.

      mods/relatedItem/titleInfo[@authority='dlfaqcoll'

      ...

      ]/title

      ...

      level A

      brief,

      ...

      full

      Collection

      basic,

      ...

      advanced

      yes,

      ...

      limiter

      ...

      by

      ...

      collection

      yes

      ]]></ac:plain-text-body></ac:structured-macro>

      1 We would recommend including the following in this methodology:

      • Some set-level analysis to determine which date to use (only feasible for relatively small harvesters)
      • If more than one date appears, throw out any dates after about 1996 or so as they're likely digitization dates
      • Use the one that's machine readable if some are not

      2 Investigate supplementing the time browse facet that contains mods/subject/temporal with data from date elements. Also, investigate using @authority to determine if certain controlled vocabularies (e.g., LCSH) can help us create more consistent subject indexes. If clustering is a possibility, this will also aid this effort.

      Wiki Markup<span style="color: #3300ff"><strong><sup>3</sup></strong></span> Look into whether classification can supplement genre or subject. For instance, \[High Level Browse\|http://www.lib.umich.edu/browse/categories/\] at UM can be used to map classification numbers to a set of 3 Look into whether classification can supplement genre or subject. For instance, High Level Browse at UM can be used to map classification numbers to a set of topics.

      4 Date processing rules per the MWG and the SWG:

      for both indexing and sorting: -

      • choose keyDate="yes" and w3cdtf="yes", if exists

      ...

      • if those two attributes don't exist, choose keyDate="yes"

      ...

      • if no keyDate, choose one of these:
        (in order) dateCreated, dateIssued, copyrightDate, dateOther

      ...

      • if none of those dates exist, choose one of these:
        (in order) dateCaptured, dateValid, dateModified

      ...

      • assumption is there is only one sort date and only one indexing field
        (which may have multiple values)

      other indexing rules: -

      • all chosen dates are normalized to a year value

      ...

      • for a single date, e.g., 1986, index only that date

      ...

      • for a range of known dates, e.g., 1944-1950, index each of those dates
        inclusive

      ...

      • for uncertain dates, e.g., 198?, 1908s, 198-, each date is indexed
        inclusive, e.g., 1980-1989

      ...

      • for circa dates, e.g., ca. 1945, each date is expanded for indexing +/

      ...

      • - 5
        years, e.g., 1940-1950

      ...

      • for expanded indexing, these dates will be searchable across decades,
        e.g., ca. 1945 will be searchable in the 1940s and the 1950s

      ...

      • non-dates, e.g., n.d., don't get indexed or normalized

      ...

      • centuries are indexed as such, e.g., 17th century/cent. as 1601-1700

      ...

      • for date elements with start and end attributes, use first start/end pair
        and treat these as a known date, e.g., start=1900, end=1920, index 1900-1920

      Wiki Markupsorting: \-

      • for

      ...

      • a

      ...

      • range

      ...

      • of

      ...

      • known

      ...

      • or

      ...

      • circa

      ...

      • dates,

      ...

      • choose

      ...

      • the

      ...

      • mid-point

      ...

      • date,

      ...

      • e.g.,

      ...

      • for

      ...


      • 1944-1950,

      ...

      • choose

      ...

      • 1947;

      ...

      • for

      ...

      • ca.

      ...

      • 1945,

      ...

      • choose

      ...

      • 1945

      ...

      • (because

      ...

      • indexing

      ...

      • expanded

      ...


      • to

      ...

      • 1940-1950)

      ...

      • for

      ...

      • a

      ...

      • range

      ...

      • of

      ...

      • uncertain

      ...

      • dates,

      ...

      • choose

      ...

      • the

      ...

      • beginning

      ...

      • date,

      ...

      • e.g.,

      ...

      • for

      ...


      • 1940?,

      ...

      • choose

      ...

      • 1940

      ...

      • for

      ...

      • a

      ...

      • single

      ...

      • date,

      ...

      • choose

      ...

      • that

      ...

      • date
      • Wiki Markup
        non-dates, e.g., n.d., \[no date\], should sort at the end, no matter
        whether sort is chronological or reverse chronological

      display: -

      • display all dates in the original encoding

      ...

      • do not display the normalized value for the indexed date field

      ...

      • records with no date should not appear if a date or date range is searched

      ...

      • display copyright, circa and uncertain dates as is, e.g., c1945, 1845?,
        ca. 1944

      MODS fields not used for data processing, although they may be used for other things, are: *

      • mods

      ...

      • modsCollection

      ...

      • recordInfo

      ...

      • dateCaptured

      ...

      • dateModified

      ...

      • dateValid

      ...

      • extension -

      ...

      • - being used to contain asset action information, but not correctly; on

      ...

      • hold for now

      Remaining questions:

      • Should all elements be displayed in full display?
      • Wiki Markup
        How does one choose the best URL to use for display? Is

      ...

      •   /mods/location/url\[@usage='primary display'\] sufficient?

      ...

      • Should

      ...

      • we

      ...

      • ignore

      ...

      • location/url@displayLabel?

      ...

      • Is

      ...

      • typeOfResource

      ...

      • beneficial

      ...

      • as

      ...

      • a

      ...

      • browse

      ...

      • facet?

      Wiki Markup\[original page\]