We will need to track various types of statistics, and each type will likely need to be tracked at a different location.
- The Ingest Tool will need to track ingests an updates to objects in the repository.
- The SRU Server will need to track searching information.
- Fedora (or the PURL resolver) will need to track access to actual media files.
- In some cases, individual collection webapps will need to track their own information.
Note on current statistics collection: For collections based on the searchWebapp, two types of detail pages are tracked. One includes "action=detail" in the URL. This indicates that the page was accessed as the result of a search, browse, or some other action that carries contextual information. The type of URL that does not include "action=detail" indicates that the detail page was accessed directly, probably via a PURL.
Summary of current directions
We'll be using a combination of server level access statistics and application level statistics. There are discussions going on for establishing a DLP level statistics framework that will be used across applications. Server levels statistics will be done by AWStats using:
- Web application logs
- Fedora web application logs
The main point of access logs for web applications will be production web servers. These logs will be processed using AWStats. In these web applications, individual items are served by detail page processors. In AWStats outputs, we'll see entries like http://erato.dlib.indiana.edu/slocum/results/detail.do?action=detail&fullItemID=/lilly/slocum/LL-SLO-011279. We might need a way to aggregate these as "Slocum detail accesses" without referring to the actual item being served. We can also have aggregate statistics such as "Slocum Browses", etc. This can be accomplished using a customization of AWStats configuration (todo: add a section that describes how this is done in AWStats). This is currently already done on the webapp1 statistics (see the link at the top of the page). Here's an example awstats.erato.html (note the extra sections at the end of the file). We can also have collection level AWStats statistics, see awstats.erato-slocum.html for an example for the Slocum Puzzles collection.
We can also gather some statistics from Fedora Tomcat logs. Tomcat logging needs to be enabled because it might be disabled by default. AWStats can be used to process these logs as well. We can customize AWStats to aggregate accesses to datastreams, for image items, such as thumbnail, large or screen. However, the log entries don't have readable identifiers, only Fedora PIDs. (See below for a sample log entry) This will make interpreting the statistics much more difficult since no collection name or image label is logged. Here's an example output of AWstats generated from Fedora Tomcat logs on Thalia: awstats.fed-thalia.html. (See also the customized access section at the end of the attached html file)
In short, using AWStats at the web application level, we can see which individual pages are accessed. We can use the extensibility capabilities of AWStats to calculate some aggregated statistics similar to the ones currently done. On the back end fedora level, we can access raw item-level logs but making sense of these will require a mapping between Fedora PIDs and item labels (or titles) and their appropriate parent collections.
Sample URLs for accessing collections
- Item Access:
- (Search and browse is handled by another system right now)
- Item Access:
- (note: full access not logged?)
- Privacy: Visitor IP numbers cannot be made public and there are rules for how long logs can and should be stored.
- There are different requirements for different collections. (Randall) IUScholarWorks needs to report to publishers and authors how many readers accessed their documents. (Michelle) We might need a general framework to profile user navigation in dlib applications to be used in usability studies. There is a general requirement for funding and internal reporting purposes to gather usage statistics.
How we use the collected statistics:
Statistics collection in the Fedora Repository
There are several levels where statistics can be collected with different granularities.
At the web application server level, we can collect statistics similar to the ones that are currently collected. These statistics are at the html-page and image/object level. Every request from a user browser to a web page will create a server log entry for the HTML page and one or more log entries for each object (e.g. images on the page) referenced from the page. One drawback of this scheme is that logos and other static images that are displayed on every page will get high number of hits. Most of the time, most meaningful "hit" is a request to a HTML page. AWStat can be configured to disregard log records caused by image (or css, etc) accesses. Since the log collection is handled by the web server, this type of statistics collection accompanied by a log display software like AWStat requires much less effort to implement.
We'll be using PURLs instead of direct links to Fedora objects. So, if we collect the statistics from the PurlResolve service (which matches Fedora PIDs and PURL/Item IDs), we might be able to solve some of the problems above. Note that web statistics collection is usually not precise due to the statelessness of HTTP and caching.
Search requests need to be collected, too. Those should be collected by the search service.
Statistics from different statistics collectors might need to be combined in some situations. For example, collection front pages (e.g. http://www.dlib.indiana.edu/collections/slocum/) are not served by the PURL service but should be included in the respective collections statistics.
One advantage of using a tool like AWStats is that it can nicely present the data in tables and charts. We can write small scripts to preprocess the web server logs and pass these processed portions to AWStats for analysis and presentation.
Here' an example output from AWStats: awstats.erato.html
Web/Application server logs
Purl Server on gigue (won't be used, webapp links don't go thru this)
- This is a NCSA httpd server located in /usr/local/etc/purl/current
- Server logs files are in WebRoot/logs, access_log and error_log
- Log files are in the NCSA/Apache common log format:
- It seems like the server stopped logging in April 2006, there are no records after that
- Log entries consist of HTTP redirect codes (302), so AWStats does not consider them as hits, however, we can configure it to display PURL access statistics.
Tomcat on rhyme (won't be used)
- Tomcat 5.5 located in /opt/tomcat, I believe this Tomcat runs the infrastructure web apps such as IngestTool and PurlResolver
- Log files are in the logs directory.
- This page describes the config options for configuring Apache/NCSA style logging in Tomcat: Tomcat access log config
Fedora Tomcat on rhyme (won't be used, this is dev)
- Fedora is installed in /usr/local/fedora
- Tomcat logs are in server/jakarta-tomcat-5.0.28/logs
- Logs are in the NCSA Apache common log format:
Tomcat on thalia (won't be used)
- Similar config as rhyme Tomcat but located in /usr/local/tomcat
Fedora Tomcat on thalia (will be used for direct Fedora accesses)
- The same config as rhyme
- Log entries are like this:
Apache web server on clio (won't be used, this is dev)
- This is Apache+JServ, Apache acccess logs are stored in the /www/log/apache directory and Jserv access logs in /www/log/apache-jserv
- Logs are rolled monthly (compared to Tomcat's default daily rolling)
- Logs are in the Apache combined format, including the referrer and the user agent fields.
- AWStats is installed on clio
Apache web server on erato (webapp1) (will be used for main web app logs)
- Apache+mod_jk?, Apache access logs are in /www/log/apache.
- Rotated monthly
- AWStats is installed and logs can be seen here
Statistics that are currently collected
Here's a list of statistics collected from erato (follow the links at the top to see the actual statistics pages)
Monthly statistics from the web server logs of erato, using AWStats:
- Unique visitors
- Number of visits
- Number of visits
- By day of week
- Visitors by domain/country (doesn't work right now)
- Visits from hosts
- Top 10
- all hosts
- Bot/Spider visits (e.g. Googlebot, Askjeeves)
- Top 10
- All spiders/bots
- Duration of visits
- File types served (jpg, html, etc.)
- Pages by URLs (e.g. /cushman/images/helpIcon.gif)
- Top 10
- Full list
- Visitor operating systems
- Visitor browser
- Referred by application
- direct address or bookmark
- Links from newsgroups
- Links from search engines
- Other web sites
- Search key phrases used to find the URL
- Top 10
- Full list
- Search key words used to find the URL
- Top 10
- Full list
- HTTP status codes
- Total number of hits for:
- Cushman Searches
- Cushman Browses
- Cushman Accesses
- Sheet Music Accesses
- Dido Searches
- Dido Accesses
Combined statistics of collection accesses and searches (using erato web server logs)
*Hits and Bandwidth statistics of:
Hoagy Carmichael Accesses
Hoagy Carmichael Browse
Hoagy Carmichael Searches
Sheet Music Accesses
Sheet Music Browse
Sheet Music Search
Victorian Women Writers Accesses
Victorian Women Writers Search
Wright American Fiction Accesses
Wright American Fiction Accesses (cont.)
Wright American Fiction Browse
Wright American Fiction Search