NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Removing cruft from TEI XML files

RemoveCruft cleans Text Creation Partnership TEI XML files by replacing long "s" characters with regular "s", removing brace-enclosed entities and certain superscripts, splitting ligatures into separate characters, and so on.

Usage:

removecruft outputdirectory superscriptmap.tab input1.xml input2.xml ...

where

  • outputdirectory is the output directory containing the resultant XML files.
  • superscriptmap.tab is a two-column tab-separated file. The first column contains tokens containing tagged superscript characters. The second column contains replacement tokens with the superscript characters replaced by unicode superscript characters. This file can be empty if replacements are not wanted.
  • input*.xml are the input TEI XML files.
Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk