Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 49 Next »

This page holds notes on the current Fedora configuration, as well as misc information that must be understood when
configuring Fedora.

Also see: Asset Definitions, Fedora Resource Index, Behavior Mechanisms, Directing PURLs to Fedora, Fedora Batch Modify


Requirements

In order to run a large repository (more than 100k Fedora Objects) while using the ResourceIndex, you must have a 64-bit version of Java running on a 64-bit OS.

Rhyme setup

Connects to Oracle database on ora-iudlu-dev (urania). When we move Fedora to production, we will need to convert to a production database.

Uses port 9090 for service, 9005 for shutdown, and 9443 for redirect. This keeps it from conflicting with any Tomcat instances running on the same machine.

The fedora.sh script has been modified to increase the available Java heap space (-Xmx768m).

Since Fedora isn't very tolerant of losing its database connection, there is a cron job to stop it before the database is shutdown for backups, and another cron job to start it afterwards.

To start Fedora on Rhyme

  1. Login to your rhyme account
  2. Login to the Fedora account: su - fedora
  3. Type in password (same as the Fedora administrator password)
  4. Ensure that Fedora is not running: fedora-stop
  5. Start Fedora: fedora-start oracle
  6. Log out

Current test setup (on mallow)

Must fedora-convert-demos to put correct hostname in demo objects.

Current McKoi username & password: fedora

Running on port 8080

For Fedora 2.0, MUST INSTALL the patch available at http://scripta.lib.virginia.edu/bugs/show_bug.cgi?id=83 (the attacments are near the top of the page, and they download with a CGI extension that must be changed to the correct filetype)

Demo objext XML is in My Documents\fedora-2.0-src\dist\client\demo\foxml (there is a parallel directory for the METS versions, but it's unlikely that we will use these)

Start with:

  mckoi-start
  fedora-start mckoi

Stop with:

  fedora-stop
  mckoi-stop username password

Administration tool:

fedora-admin mallow.dlib.indiana.edu 8080 fedoraAdmin fedoraAdmin

General notes

Fedora runs on its own (modified?) instance of Tomcat. It is currently not advisable to run anything besides Fedora on this version of Tomcat, because it has been tune to give some performance enhancements for Fedora use. Be very carful when selecting ports so they don't conflict with another Tomcat that may be running on the same machine. If you change the port on which Fedora runs, it will automatically reconfigure the Fedora Tomcat, since this is really the service that's running on that port. Certain types of changes to the Tomcat config are overwritten by Fedora, so it is unlikely that we could use this copy of Tomcat for anything else.

When ingesting objects, use the admin password, not the database password.

The fedora server must be restarted for any configuration changes to take effect.

The documentation makes it seem fairly easy to move data from one repository to another: just tell the new Fedora instance to ingest all of the data from the old instance. No idea how long this would take, though.

Object records must be in XML form (METS or FOXML) to be ingested.

In the sample web interface, "View the Item Index" means "View the datastreams".

Fedora provides a lot of undocumented services. See the <fedora-home>/server/tomcat/webapps/fedora/WEB-INF/web.xml file for a full listing. The more interesting ones are:

  • report: information on objects that were recently created/modified
  • risearch: search the resource index (Kowari)
  • getObjectHistory/<pid>: list timestamps of changes to the object

More documentation of API-A-LITE can be found at fedora-2.0-src/dist/userdocs/client/browser/webexp.html

Once created, behavior definitions cannot be changed. Behavior mechanisms can only be changed marginally.

When copying Fedora objects between repositories, Fedora-level references to the local repository are changed. This means that for a datastream that redirects to another object in the same repository, or a behavior mechanism that contains the URL of the local saxon, the machine name and port number will be updated. However, references inside a datastream (like a reference in XSL) will not be updated.

When making a change to an XSL file, there is no simple way to reset the cache, unless the behavior mechanism explicitly uses the clear-stylesheet-cache option. The only thing you can do is restart Fedora (which restarts Tomcat).

Fedora bugs

Bugs can be reported to Fedora's Bugzilla
user: fedora-bugreport at comm.nsdl.org
pass: bugreport

OAI

OAI export works automatically.

For example, see:

However, we will probably devise a separate export system to provide more data (unless recent updates to the OAI provider can meet all of our needs).

Data storage

The XML records that represent Fedora objects are stored in Fedora's objects directory (fedora2_0_objects by default). Underneath this directory, they are organized by a crazy date/time directory structure. Even though they don't have an XML extension, the files are really XML.

Objects that are loaded as "Fedora managed" content have their datastreams stored in the datastreams directory (fedora2_0_datastreams) using the same crazy directory structure. The file content is unchanged from the file that was loaded, but the filename is changed to reflect the PID and datastream ID.

The database coordinates all of these objects and datastreams using a fairly straightforward table setup.

If we want to convert from Managed to External content, we can just purge and re-create the datastreams. Of course, this would lose any version information.

Space issues

We are going to initially use rhyme.dlib.indiana.edu. Its core stats are: Dual CPU 3GHz each, 6G RAM, 420G usable disk space.

Current stats for other collections:

  • Hohenberger: 2143 images, each with master, thumbnail, screen, and large JPG. Masters take 13GB. Derivatives take 1.4GB. Full Fedora storage (derivatives, metadata, and resource index) takes 2.5GB.
  • DIDO: 40,000 images, each with master, thumbnail, screen, and large images. Derivatives take 15GB.
  • US Steel: 2200 images with master, thumbnail, and screen. Masters take 20GB.
  • Cushman: 14,500 images. Derivatives (including notebooks) 3.2GB.
  • Victorian Women Writers: (text only)
  • Wright American Fiction: 400,000 pages. Derivatives are generated on the fly. Master images and text files are 111GB.
  • Sheet Music (DeVincent): 50,000 pages. Screen and full/large for all pages; thumbnails for some. Derivatives 11GB.
  • Sheet Music (Starr): Derivatives 600MB.
  • Jane Johnson: Derivatives 300MB.
  • Letopis
  • Hoagy
  • FLI
  • Eviada
  • Newton
  • IN Harmony
  • Camva
  • IU Archives: 600 images. Does not have large size yet. Derivatives 113MB.

Rough guidline: If we create three derivative sizes for each image, 1GB per thousand images/pages.

An incredibly rough estimate: 100GB to store derivatives for all current collections.

Note that if we move our derivatives to JPEG2000, storage needs may change slightly, but not by very much.

The Fedora project has done some performance testing on a repository with 1 million objects.

Other system limits

How many items can share a PID prefix? A PID is a 64-digit string, so if we use the prefix "iudl:", we have plenty of options for numerical data. We could even add a collection code, like "iudl:hohenberger-1214".

Purging a repository

(from the Fedora mailing list)
The best way to purge all objects from a Fedora repository is to reset the
repository. Here are the steps:

  1. Stop the Fedora instance.
  2. Drop the Fedora database (MySQL?)
  3. Create a blank Fedora database with the same permission/privileges
  4. Delete the files and subdirectories from the Fedora objects, datastreams, temp, and resourceIndex directories.
  5. Start the Fedora server.

Multiple repositories

We will likely want to run more than one repository, at least one for cataloging/testing use and one for production use. Will thinks it may be useful to keep one centralized repository for the master metadata and periodically export that data to one or more production repositories.

If we do split up the repositories like this, will we want to also have duplicate copies of the media files? Or should all media files be stored outside the repositories, on a separately managed filesystem (or set of filesystems)?

Moving data between repositories can be an issue if relationships are present.

Fedora will eventually have built in support for federated repositories.

  • No labels