This page contains miscellaneous scripts that were used for data verification and correction during the process and in the aftermath of upgrading from R5 to R6. This scripts are for reference only and not applicable generally, but rather they serve as examples of how to work with the data in R5 and R6.

Missing Posters

After migration we noticed that some of our videos didn't have thumbnails or posters. We ran these scripts to check the status of these items in R6. We noticed they either didn't have derivatives, or they had a duration of 0. So we checked the status of these items in R5 and found that they had the same issues there.

Scripts I ran to determine state of problem masterfiles: 

# On Fedora 4 system, find migrated MasterFiles that don't have posters or durations 

mfs = [] 
durations = [] 
count = 0 
MasterFile.find_each({},{batch_size:5}) do |mf| 
  count += 1 
  mfs << mf.id unless mf.file_format == 'Sound' or mf.has_poster? 
  durations << mf.id if mf.duration.nil? 
  if count%200 == 0 
    puts " #{count} - #{mfs.count}, #{durations.count}" 
  else 
    print "." 
  end 
end 

# Get the Fedora 3 pids for those items 

s = MigrationStatus.where( f4_pid: mfs|durations ) 
s.each do |s| 
  f3s << s.f3_pid 
end 

f3s 

# On Fedora 3 system, see if those MasterFiles had derivatives/duration before migrating 

f3s.each do |id| 
  mf = MasterFile.find(id) 
  puts "#{id}: #{mf.duration} #{mf.derivatives.count}" 
end 

Permalink Collisions

For some reason, after migrating we had permalinks that pointed to more than one MasterFile. We ran this script to identify those collisions and then fix them.

#!/usr/bin/env ruby
require 'open-uri'
require 'nokogiri'
def report(collisions)
  puts "Found #{collisions.keys.size} Collisions" unless collisions.empty?
  collisions.each do |id,field_values|
    puts "-------------" 
    puts "#{id} ->"
    field_values.each {|doc| puts doc}
  end
  puts "-------------"
  f4_ids = collisions.values.collect {|arr| arr.collect {|h| h[:id]}}.flatten.uniq
  puts "Collisions involve #{f4_ids.size} objects:"
  puts f4_ids.inspect
end
result = Nokogiri::XML(open("http://localhost:8983/solr/avalon/select?q=*&rows=0&facet=on&facet.field=identifier_ssim&facet.limit=-1&facet.mincount=2"))
collided_ids = result.xpath('//lst[@name="identifier_ssim"]/int/@name').collect(&:value)
collisions = {}
collided_ids.each do |id|
  fields = ["id", "has_model_ssim", "system_create_dtsi", "system_modified_dtsi"]
  collided_docs = Nokogiri::XML(open("http://localhost:8983/solr/avalon/select?q=identifier_ssim:#{id}&fl=#{fields.join(',')}"))
  collided_docs.xpath('//doc').each do |doc|
    field_values = {}
    fields.each {|f| field_values[f.to_sym] = doc.xpath("*[@name='#{f}']").text }
    obj_id = field_values[:id]
    collisions[id] ||= []
    collisions[id] << field_values
  end
end
report(collisions)

# use the above to generate a list of collisions, then munge the into this form:
# h={permalink_id1: [masterfile_id1, masterfile_id2], ... } 
# once your h hash looks good, pass it to split_mfs to correct collisions
def split_mfs h
  mo_cache = {}
  ids_cache = {}
  h.values.each do |vals|
    mf1, mf2 = vals
    m1 = MasterFile.find(mf1) rescue nil
    m2 = MasterFile.find(mf2) rescue nil
    good_mf = nil
    bad_mf = nil
    mo = nil
    print "#{m1.id} (1) / #{m2.id} (2): MediaObject "

    if m1.derivatives.count > 0
      good_mf = m1
      bad_mf = m2
    elsif m2.derivatives.count > 0
      good_mf = m2
      bad_mf = m1
    end

    if good_mf.present? and bad_mf.present?

      mo_id = good_mf.media_object_id || bad_mf.media_object.id
      mo_cache[mo_id] ||= MediaObject.find(mo_id) rescue nil
      mo = mo_cache[mo_id]

      if mo.present?
        print "#{mo_id} "
        ids_cache[mo_id] ||= mo.master_files.collect(&:id)
        if ids_cache[mo_id].include? good_mf.id
          puts " correctly associated with good mf #{good_mf.id}, deleting bad_mf #{bad_mf.id}"
          bad_mf.delete
        elsif ids_cache[mo_id].include? bad_mf.id
          mf_index = mo.ordered_master_file_ids.index bad_mf.id
          puts " dropping association with and deleting bad_mf #{bad_mf.id} at index #{mf_index}. Associating good_mf #{good_mf.id}"
          good_mf.media_object_id = mo_id
          good_mf.save!
          mo.ordered_master_files.delete_at( mf_index )
          mo.ordered_master_files.insert_at( mf_index, good_mf )
          mo.master_files -= [bad_mf]
          mo.save!
          bad_mf.delete
        else
          puts " not associated with either mf #{good_mf.id} or #{bad_mf.id}"
        end
      else
        puts " media_object not found"
      end
    else
      puts " derivatives not found "
    end
  end
end

Update Permalinks

def permalink_https!(obj)
  permalink_uri = URI.parse(obj.permalink)
  if permalink_uri.scheme == 'http'
    permalink_uri.scheme = 'https'
    obj.permalink = permalink_uri.to_s
    obj.save!
  else
    false
  end
end
 
#puts "Moving media object permalinks to https"
#count = 0
#MediaObject.find_each({},{batch_size: 5}) do |obj|
#  begin
#    result = permalink_https!(obj)
#    if result
#      count = count + 1
#      puts "#{count}: Updated #{obj.id}: #{obj.permalink}"
#    else
#      puts "#{count}: Skipping #{obj.id}: #{obj.permalink}"
#    end
#  rescue URI::InvalidURIError => e
#     puts "Failed for #{obj.id}: #{obj.permalink}"
#     puts e.backtrace
#  end
#end
#puts "Solr commit and optimize"
#ActiveFedora::SolrService.instance.conn.commit
#ActiveFedora::SolrService.instance.conn.optimize
puts "Moving master file permalinks to https"
count = 0
MasterFile.find_each({},{batch_size:5}) do |obj|
  begin
    result = permalink_https!(obj)
    if result
      count = count + 1
      puts "#{count}: Updated #{obj.id}: #{obj.permalink}"
    else
      puts "#{count}: Skipping #{obj.id}: #{obj.permalink}"
    end
  rescue URI::InvalidURIError => e
     puts "Failed for #{obj.id}: #{obj.permalink}"
     puts e.backtrace
  end
end
puts "Solr commit and optimize"
ActiveFedora::SolrService.instance.conn.commit
ActiveFedora::SolrService.instance.conn.optimize
puts "Done."

Reindex PURLS

puts "Reindexing media objects"
count = 0
MediaObject.find_each({},{batch_size: 5}) do |obj|
  obj.update_index
  count = count + 1
  puts "#{count}: Updated index for #{obj.id}: #{obj.identifier.join(', ')}"
end
puts "Solr commit and optimize"
ActiveFedora::SolrService.instance.conn.commit
ActiveFedora::SolrService.instance.conn.optimize
puts "Reindexing master files"
count = 0
MasterFile.find_each({},{batch_size: 5}) do |obj|
  obj.update_index
  count = count + 1
  puts "#{count}: Updated index for #{obj.id}: #{obj.identifier.join(', ')}"
end
puts "Solr commit and optimize"
ActiveFedora::SolrService.instance.conn.commit
ActiveFedora::SolrService.instance.conn.optimize
puts "done"

Permalink Validation

Script to double check that permalinks for MediaObjects and MasterFiles ended up with the correct values and in the correct places after migration.

load '/srv/avalon/avalon_r6/f3_mopermalinks.rb' # includes F3P = { f3_pid: permalink } 
load '/srv/avalon/avalon_r6/f3_mfpermalinks.rb' # includes F3MFP = { f3_pid: permalink } 

puts 'Inspecting F4 permalinks for MediaObjects' 
puts '1) If permalink in Fedora 3, the Fedora 4 object should have the same permalink' 
puts '2) If permalink in Fedora 4, the permalink noid should be persisted in Fedora 4 in identifiers:local ' 
puts '3) If permalink in Fedora 4, the permalink noid should be persisted in Solr in identifer_ssim ' 

count = 0 

MediaObject.find_each({},{batch_size: 5}) do |obj| 
  count += 1 
  migration_status = MigrationStatus.where(f4_pid: obj.id).first 
  unless migration_status.nil? 
    f3_permalink = F3P[migration_status.f3_pid] 
    unless f3_permalink.nil? 
      unless f3_permalink.split('://').last == obj.permalink.split('://').last 
        puts "#{obj.id} F3/F4 permalink mismatch: #{f3_permalink} / #{obj.permalink}" 
      end 
    end 
  end 
  if obj.permalink.present? 
    unless obj.identifier.include? obj.permalink.split('/').last 
      puts "#{obj.id} F4 permalink not found in identifiers" 
    end 
    unless obj.to_solr['permalink_tesim'].include? obj.permalink 
      puts "#{obj.id} F4 permalink not found in solr" 
    end 
  end 
  print '.' if (count%10==0) 
  puts count if (count%1000==0) 
  STDOUT.flush if (count%10==0) 
end 

count = 0 

puts '' 
puts 'Inspecting F4 permalinks for MasterFiles' 

MasterFile.find_each({},{batch_size: 5}) do |obj| 
  count += 1 
  migration_status = MigrationStatus.where(f4_pid: obj.id).first 
  unless migration_status.nil? 
    f3_permalink = F3MFP[migration_status.f3_pid] 
    unless f3_permalink.nil? 
      unless f3_permalink.split('://').last == obj.permalink.split('://').last 
        puts "#{obj.id} F3/F4 permalink mismatch: #{f3_permalink} / #{obj.permalink}" 
      end 
    end 
  end 
  if obj.permalink.present? 
    unless obj.identifier.include? obj.permalink.split('/').last 
      puts "#{obj.id} F4 permalink not found in identifiers" 
    end 
    unless obj.to_solr['permalink_tesim'].include? obj.permalink 
      puts "#{obj.id} F4 permalink not found in solr" 
    end 
  end 
  print '.' if (count%10==0) 
  puts count if (count%1000==0) 
  STDOUT.flush if (count%10==0) 
end