Child pages
  • Metadata Harvesting and Processing

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Metadata harvesting and processing

A new collection is added

  • Update the collections registry
    • Make sure the setSpec values are entered correctly.  If there is not a set_spec you need to create a unique value which is added to DB manually in the Collections.identifier column; this needs to be improved in the future.
    • Find a thumbnail image to represent the new collection.  The image must be a JPEG about 144x144 pixels in size, and it must be named like this: set_spec_value.thumb.jpg.  Any colons in the set_spec must be replaced by underscores when creating the file name.  The image is placed in the RAILS public/images/collection directory.
  • Harvest the new collection into the portal.  This is a fairly quick rake task:
    Code Block
    rake aquifer:reharvest_collection_descriptions
    
    It will harvest all the collection descriptions, not just the new one. 
  • Harvest the MODS metadata for the new collections.  This is also a rake task, but one which can run for a long time depending on the number of new records:
    Code Block
    rake aquifer:reharvest_metadata
    
    This will harvest metadata from the aggregation which has been added or changed since the harvest.  If a harvest fails at some point you can restart it by specifying the resumption token, such as
    Code Block
    rake aquifer:reharvest_metadata resume=token
    
    The resumption token can be found in the log output for this task.
  • Next you need to update the collection counter cache (count_set_spec).  This is a count of records for each collection.  Run this rake task:
    Code Block
    rake aquifer:update_one_collection_count set_spec=?
    
    where the set_spec parameter value is the collection_code or set_spec of the newly added collection.
  • Next the raw_xml needs to be transformed and copied into the items table.  Run this rake task: 
    Code Block
    rake aquifer:transform_raw_xml incr=yyyy-mm-dd
    
    The date is the date on which the last incremental harvest was run.  This is the date of the last aquifer:reharvest_metadata task.  This will process just the records which have been modified or added since that date.
  • Sometimes the linkage between the collection and the item or raw_xml is not properly set when the items are harvested. (This is an issue that is under investigation.) To correct this you can run this task:
    Code Block
    rake aquifer:fix_orphaned_items
    
  • Sometimes items may also be orphaned from their corresponding raw_xml records. (This is an issue that is also under investigation, but we think it occurs when a metadata harvest fails and needs to be restarted.) This task will delete any orphaned item records:
    Code Block
    rake aquifer:delete_orphaned_items
    
  • Next you need to update the geographic name mappings for the newly harvested records. Run this task:
    Code Block
    rake aquifer:geo_resolver incr=yyyy-mm-dd
    
    The date is the date on which the last transformation from raw_xml to items was run.  This is the date of the last aquifer:transform_raw_xml task.  This will process just the records which have been modified or added since that date.
  • Next the statistics used to create the browse tag clouds and related functionality need to be updated. There are several rake tasks for this:
    Code Block
    rake aquifer:clear_all_stats (optional)
    rake aquifer:heading_builder
    rake aquifer:update_stat_totals
    
  • Finally, the SOLR indexes need to be updated. The rake task is this:
    Code Block
    rake aquifer:index_items incr=yyyy-mm-dd
    
    The date is the date on which the last transformation from raw_xml to items was run.  This is the date of the last aquifer:transform_raw_xml task.  This will process just the records which have been modified or added since that date.

NOTE:  Rake tasks can be run in the background like this:

Code Block
nohup rake task:task > log/task.log &

And most should be run in the background because they can run for a long time.  They are also prone to failing for various reasons, so do check the resulting log file.  You can also generally restart a failed job at a specific point by adding a start_id= parameter whose value is the id of the record at which to restart processing.  See the aquifer.rake file details on other parameters.

You must also indicate the Rails production environment by using the RAILS_ENV=production parameter, such as:

Code Block
nohup rake RAILS_ENV=production task:task > log/test.log &