OCR Guidelines for E-text Projects
 | Status
This page is a work in progress. |
File Conventions
When generating full text either in-house or outsourced, the following is required:
- Files should be UTF-8 compatible
- Filenames should follow conventions as required by our infrastructure policy
and as defined by the project:
- (e.g., VAA#-page#.txt => VAA2222-001.txt)
Concatenating Files for Encoding
When generating full text in-house, the individual .txt files need to be concatenated for encoding.
Below are instructions for running a script that generates a P4 TEI shell document:
- Place script
in appropriately named directory on a unix server; the XML/TEI file will inherit the directory name.
- Run:
This script will create an xml file in the same directory where you ran the command. For example, if you ran the script in the /data/VAA2345 directory, you should find a newly created file called VAA2345.xml in that directory.
Quality Control
- To be completed with Kara's input