NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Merging Annolex corrections with adorned TEI XML

AnnoLex is a collaborative data curation tool for use with Text Creation Partnership texts. Annolex allows for the identification and correction of incompletely or incorrectly transcribed words. It can also be used for the manual correction of algorithmically applied lemmatization and part-of-speech tagging. Annolex was developed by Craig Berry and Martin Mueller.

MergeAnnolexCorrectionsIntoAdornedXML merges corrections developed in Annolex back into the source adorned TEI XML files.

Usage:

mergeannolexcorrectionsintoadornedxml correctionsdirectory outputdirectory inputfiles

where

  • correctionsdirectory is the input directory with Annolex correction files in tabular format.
  • outputdirectory is the output directory for the corrected adorned TEI XML files.
  • inputfiles contains the input adorned XML files with which to merge the AnnoLex produced corrections. These must be in the base adorned format, not the simplified TEI P5 format.

The corrections file is a tab-separated utf-8 file containing the following columns.

  1. Work ID.
  2. Word ID.
  3. Old spelling.
  4. Corrected spelling.
  5. Standard spelling.
  6. Corrected lemmata.
  7. Corrected parts of speech.
  8. Operation: 1=update, 2=insert, 3=delete, 5=delete nearest gap.

The corrected spelling, lemmata, and parts of speech may all be empty when the operation is 3 (delete).

The value of the "ord" (word ordinal) attribute for each word is adjusted to account for inserted and deleted words. The value of the "reg" (standard spelling) and "tok" attributes (original token) are generated as needed for updated and inserted words.

Whitespace markers " " are added and deleted as needed when tokens are added or deleted. In general, most added punctuation and symbols do not require added whitespace markers. When tokens are deleted, sequences of "<c> </c><c> </c> ..." are compressed to a single "<c> </c>" entry.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk