This page is stable, but sections are still in progress and/or being updated. Certain links to specific DLP project information are restricted to Digital Library Program staff.
A Perl script has been developed to concatenate text files produced as a result of OCR for TEI encoding. The script creates a TEI "shell" file that contains a TEI Header with boilerplate information particular to a given project and full text contained in anonymous block <ab> tags. Page breaks tags <pb> with corresponding ID attributes are automatically inserted. Encoders then update the headers and encode the texts according to specific project encoding guidelines.
Step 1: Download the Concatenation Script
- Download the Perl script.
- Upload the Perl (.pl) file to your home account on http://bleu.dlib.indiana.edu or other Unix server.
- Change the file permissions by executing the command >> chmod 755 filename.pl
Step 2: Customize Concatenation Script per Project
The teiHeader information will likely change from project to project. Update subroutine print_header to reflect the project specific teiHeader. Make sure that the corresponding ending tags in the print_footer are correct. Use backslash ( \ ) as an escape character every time you want to print special characters like forward slash ( / ), quotes ("), dot (.), etc. Perl is very sensitive about these special characters.
- Copy the existing script for modifications. For example:
- Update teiHeader information in the subroutine print_header. For example:
print OUTF "<biblScope type=\"issue\">$issue<\/biblScope>\n"; (double quotes with backslash)
print OUTF "<biblScope type=\"issue\">$issue<\/biblScope>\n"; (single quotes; necessary when using variables)
- Update VAA number regular expression
- The code snippet below shows the regular expression used to extract the VAA number from the folder
path provided while executing the script:
if ($dirname =\~ /(VAA\[0-9\]+)/)
$id = $1;
- For example, the regular expression to update the VAA# for the IMH is as follows:
if ($dirname =\~ /(VAA\d\d\d\d-\d\d\d-\d)/)
$id = $1;
Step 3: Execute the Concatenation Script
- Use the command perl, followed by the path to the perl file, followed by the path to the folder containing the OCR text files. The resulting xml file will be placed in the VAA directory. For example:
Another example: Assuming that the user is logged into bleu and the perl file is in his/her home directory, the following code will concatenate the OCR files for an Indiana Authors book. The xml file will be named VAA3893.xml and will be saved in the VAA3893 directory.
perl ocr2tei.pl VAA0001/OCR
perl ocr2tei.pl ../../digitize/data/Indiana/Indiana_Authors/TIFFs_for_unencoded_books/VAA3893/OCR