Dashboard > LETRS/ETDC/E-Text Planning > Home > OCR guidelines
  LETRS/ETDC/E-Text Planning Log In   View a printable version of the current page.  
  OCR guidelines
Added by Michelle Dalmau, last edited by Michelle Dalmau on Jul 06, 2007  (view change)
Labels: 
(None)

OCR Guidelines for E-text Projects

Status

This page is a work in progress.

File Conventions

When generating full text either in-house or outsourced, the following is required:

  • Files should be UTF-8 compatible
  • Filenames should follow conventions as required by our infrastructure policy and as defined by the project:
    • (e.g., VAA#-page#.txt => VAA2222-001.txt)

Concatenating Files for Encoding

When generating full text in-house, the individual .txt files need to be concatenated for encoding.
Below are instructions for running a script that generates a P4 TEI shell document:

  • Place script in appropriately named directory on a unix server; the XML/TEI file will inherit the directory name.
  • Run:
    java -cp $CLASSPATH:. OCR2TEI <path_to_directory_of_OCR_files>

This script will create an xml file in the same directory where you ran the command. For example, if you ran the script in the /data/VAA2345 directory, you should find a newly created file called VAA2345.xml in that directory.

Quality Control

  • To be completed with Kara's input

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.4 Build:#809 Jun 12, 2007) - Bug/feature request - Contact Administrators