Skip to end of metadata
Go to start of metadata

We will need to track various types of statistics, and each type will likely need to be tracked at a different location.

  • The Ingest Tool will need to track ingests an updates to objects in the repository.
  • The SRU Server will need to track searching information.
  • Fedora (or the PURL resolver) will need to track access to actual media files.
  • In some cases, individual collection webapps will need to track their own information.

Fedora Statistics page has some information about how to collect statistics.
Also see the types of statistics currently collected below.

Note on current statistics collection: For collections based on the searchWebapp, two types of detail pages are tracked. One includes "action=detail" in the URL. This indicates that the page was accessed as the result of a search, browse, or some other action that carries contextual information. The type of URL that does not include "action=detail" indicates that the detail page was accessed directly, probably via a PURL.

Summary of current directions

We'll be using a combination of server level access statistics and application level statistics. There are discussions going on for establishing a DLP level statistics framework that will be used across applications. Server levels statistics will be done by AWStats using:

  • Web application logs
  • Fedora web application logs

The main point of access logs for web applications will be production web servers. These logs will be processed using AWStats. In these web applications, individual items are served by detail page processors. In AWStats outputs, we'll see entries like http://erato.dlib.indiana.edu/slocum/results/detail.do?action=detail&fullItemID=/lilly/slocum/LL-SLO-011279. We might need a way to aggregate these as "Slocum detail accesses" without referring to the actual item being served. We can also have aggregate statistics such as "Slocum Browses", etc. This can be accomplished using a customization of AWStats configuration (todo: add a section that describes how this is done in AWStats). This is currently already done on the webapp1 statistics (see the link at the top of the page). Here's an example awstats.erato.html (note the extra sections at the end of the file). We can also have collection level AWStats statistics, see awstats.erato-slocum.html for an example for the Slocum Puzzles collection.

We can also gather some statistics from Fedora Tomcat logs. Tomcat logging needs to be enabled because it might be disabled by default. AWStats can be used to process these logs as well. We can customize AWStats to aggregate accesses to datastreams, for image items, such as thumbnail, large or screen. However, the log entries don't have readable identifiers, only Fedora PIDs. (See below for a sample log entry) This will make interpreting the statistics much more difficult since no collection name or image label is logged. Here's an example output of AWstats generated from Fedora Tomcat logs on Thalia: awstats.fed-thalia.html. (See also the customized access section at the end of the attached html file)

In short, using AWStats at the web application level, we can see which individual pages are accessed. We can use the extensibility capabilities of AWStats to calculate some aggregated statistics similar to the ones currently done. On the back end fedora level, we can access raw item-level logs but making sense of these will require a mapping between Fedora PIDs and item labels (or titles) and their appropriate parent collections.

Sample URLs for accessing collections

Slocum

Hoagy

Issues

  • Privacy: Visitor IP numbers cannot be made public and there are rules for how long logs can and should be stored.
  • There are different requirements for different collections. (Randall) IUScholarWorks needs to report to publishers and authors how many readers accessed their documents. (Michelle) We might need a general framework to profile user navigation in dlib applications to be used in usability studies. There is a general requirement for funding and internal reporting purposes to gather usage statistics.

How we use the collected statistics:

Statistics collection in the Fedora Repository

There are several levels where statistics can be collected with different granularities.

At the web application server level, we can collect statistics similar to the ones that are currently collected. These statistics are at the html-page and image/object level. Every request from a user browser to a web page will create a server log entry for the HTML page and one or more log entries for each object (e.g. images on the page) referenced from the page. One drawback of this scheme is that logos and other static images that are displayed on every page will get high number of hits. Most of the time, most meaningful "hit" is a request to a HTML page. AWStat can be configured to disregard log records caused by image (or css, etc) accesses. Since the log collection is handled by the web server, this type of statistics collection accompanied by a log display software like AWStat requires much less effort to implement.

At the web application level, we can have finer control of what gets logged and hence what counts as a meaningful hit. This requires a) a mechanism to be implemented by the web application to store the logs (or aggragated statistics) and b) all links to objects served by this app need to be wrapped so that they go through a statistics layer first, before handled by the content service (i.e. web server). Another option is doing what Google Analytics does, adding a javascript call to a statistics service on every page of the web application. This will only work if the user enabled javascript. The statistics service needs to log the access to a database or a log file. We will need an application to analyze the logs and display the statistics.

We'll be using PURLs instead of direct links to Fedora objects. So, if we collect the statistics from the PurlResolve service (which matches Fedora PIDs and PURL/Item IDs), we might be able to solve some of the problems above. Note that web statistics collection is usually not precise due to the statelessness of HTTP and caching.

Search requests need to be collected, too. Those should be collected by the search service.

Statistics from different statistics collectors might need to be combined in some situations. For example, collection front pages (e.g. http://www.dlib.indiana.edu/collections/slocum/) are not served by the PURL service but should be included in the respective collections statistics.

One advantage of using a tool like AWStats is that it can nicely present the data in tables and charts. We can write small scripts to preprocess the web server logs and pass these processed portions to AWStats for analysis and presentation.

Here' an example output from AWStats: awstats.erato.html

Web/Application server logs

Purl Server on gigue (won't be used, webapp links don't go thru this)

  • This is a NCSA httpd server located in /usr/local/etc/purl/current
  • Server logs files are in WebRoot/logs, access_log and error_log
  • Log files are in the NCSA/Apache common log format:
    65.6.79.180 - - [27/Apr/2006:23:00:09 -0400] "GET /iudl/archives/cushman/full/P10411.jpg HTTP/1.1" 302 259
    65.6.79.180 - - [27/Apr/2006:23:00:10 -0400] "GET /iudl/archives/cushman/full/P10413.jpg HTTP/1.1" 302 259
    66.249.72.231 - - [27/Apr/2006:23:00:11 -0400] "GET /iudl/archives/cushman/screen/P03571.jpg HTTP/1.1" 302 261
    
  • It seems like the server stopped logging in April 2006, there are no records after that
  • Log entries consist of HTTP redirect codes (302), so AWStats does not consider them as hits, however, we can configure it to display PURL access statistics.

Tomcat on rhyme (won't be used)

  • Tomcat 5.5 located in /opt/tomcat, I believe this Tomcat runs the infrastructure web apps such as IngestTool and PurlResolver
  • Log files are in the logs directory.
  • This page describes the config options for configuring Apache/NCSA style logging in Tomcat: Tomcat access log config

Fedora Tomcat on rhyme (won't be used, this is dev)

  • Fedora is installed in /usr/local/fedora
  • Tomcat logs are in server/jakarta-tomcat-5.0.28/logs
  • Logs are in the NCSA Apache common log format:
    129.79.184.183 - - [11/Oct/2006:20:58:39 -0500] "GET /fedora/search?terms=iudl:10&xml=true&pid=true HTTP/1.0" 200 200
    129.79.184.183 - - [11/Oct/2006:21:05:09 -0500] "GET /fedora/search?terms=iudl:10&xml=true&pid=true HTTP/1.0" 200 200
    129.79.184.183 - - [11/Oct/2006:21:11:49 -0500] "GET /fedora/search?terms=iudl:10&xml=true&pid=true HTTP/1.0" 200 200
    

Tomcat on thalia (won't be used)

  • Similar config as rhyme Tomcat but located in /usr/local/tomcat

Fedora Tomcat on thalia (will be used for direct Fedora accesses)

  • The same config as rhyme
  • Log entries are like this:
    129.79.184.183 - - [14/Oct/2006:12:28:36 -0500] "GET /fedora/search?terms=iudl:10&xml=true&pid=true HTTP/1.0" 200 200
    156.56.241.27 - - [14/Oct/2006:12:33:34 -0500] "GET /fedora/get/iudl:19860/SCREEN HTTP/1.1" 200 58223
    156.56.241.27 - - [14/Oct/2006:12:34:21 -0500] "GET /fedora/get/iudl:19888/THUMBNAIL HTTP/1.1" 200 45609
    

Apache web server on clio (won't be used, this is dev)

  • This is Apache+JServ, Apache acccess logs are stored in the /www/log/apache directory and Jserv access logs in /www/log/apache-jserv
  • Logs are rolled monthly (compared to Tomcat's default daily rolling)
  • Logs are in the Apache combined format, including the referrer and the user agent fields.
  • AWStats is installed on clio

Apache web server on erato (webapp1) (will be used for main web app logs)

  • Apache+mod_jk?, Apache access logs are in /www/log/apache.
  • Rotated monthly
  • AWStats is installed and logs can be seen here
Statistics that are currently collected


Here's a list of statistics collected from erato (follow the links at the top to see the actual statistics pages)
Monthly statistics from the web server logs of erato, using AWStats:

  • Monthly
    • Unique visitors
    • Number of visits
    • Pages
    • Hits
    • Bandwidth
  • Daily
    • Number of visits
    • Pages
    • Hits
    • Bandwidth
  • By day of week
    • Pages
    • Hits
    • Bandwidth
  • Hourly
    • Pages
    • Hits
    • Bandwidth
  • Visitors by domain/country (doesn't work right now)
  • Visits from hosts
    • Top 10
    • all hosts
  • Bot/Spider visits (e.g. Googlebot, Askjeeves)
    • Top 10
    • All spiders/bots
  • Worms
  • Duration of visits
  • File types served (jpg, html, etc.)
  • Pages by URLs (e.g. /cushman/images/helpIcon.gif)
    • Top 10
    • Full list
  • Visitor operating systems
  • Visitor browser
  • Referred by application
    • direct address or bookmark
    • Links from newsgroups
    • Links from search engines
    • Other web sites
  • Search key phrases used to find the URL
    • Top 10
    • Full list
  • Search key words used to find the URL
    • Top 10
    • Full list
  • HTTP status codes
  • Total number of hits for:
    • Cushman Searches
    • Cushman Browses
    • Cushman Accesses
    • Sheet Music Accesses
    • Dido Searches
    • Dido Accesses

Combined statistics of collection accesses and searches (using erato web server logs)
*Hits and Bandwidth statistics of:
Cushman Accesses
Cushman Browses
Cushman Searches
Dido Accesses
Dido Searches
Hoagy Carmichael Accesses
Hoagy Carmichael Browse
Hoagy Carmichael Searches
Hohenberger Accesses
Hohenberger Searches
Nuer Accesses
Sheet Music Accesses
Sheet Music Browse
Sheet Music Search
Steel Accesses
Steel Browses
Steel Searches
Swinburne Accesses
Swinburne Browses
Swinburne Searches
Victorian Women Writers Accesses
Victorian Women Writers Search
Wright American Fiction Accesses
Wright American Fiction Accesses (cont.)
Wright American Fiction Browse
Wright American Fiction Search

  • No labels