public class AdornWithNamedEntities
extends java.lang.Object
AdornWithNamedEntities adorns texts with named entities such as person, location, time, date, and organization.
Usage:
java edu.northwestern.at.morphadorner.tools.namedentities.AdornWithNamedEntities outputdirectory input1.xml input2.xml ...
outputdirectory -- output directory to receive xml files adorned with named entities.
input*.xml -- input TEI XML files.
Note: The named entity adorner does not always recognize entities which cross soft tags. Thus "<hi>Emma</hi> Woodhouse" may be recognized as two separate entities. AdornedWithNamedEntities should be run on the input files before their submission to MorphAdorner.
Modifier and Type | Field and Description |
---|---|
protected static Annie |
annie
Annie annotator.
|
protected static int |
currentDocNumber
Current document.
|
protected static int |
docsToProcess
Number of documents to process.
|
protected static org.w3c.dom.Document |
document
DOM document.
|
protected static java.util.List<PatternReplacer> |
fixupsList
Fixups list.
|
protected static java.lang.String |
fixupsURL
Fixups list resource URL.
|
protected static int |
INITPARAMS
# params before input file specs.
|
protected static java.lang.String |
outputDirectory
Output directory.
|
protected static java.lang.String |
teiHeaderPattern
TEI header element pattern.
|
Modifier | Constructor and Description |
---|---|
protected |
AdornWithNamedEntities()
Allow overrides but not instantiation.
|
Modifier and Type | Method and Description |
---|---|
protected static java.lang.String |
addNamedEntities(java.lang.String text)
Adorn text with named entities.
|
protected static java.lang.String |
applyFixups(java.lang.String text)
Apply fixups.
|
protected static org.w3c.dom.Node |
findTextNodesParent(org.w3c.dom.Document document)
Find parent of text nodes in a DOM document.
|
protected static boolean |
initialize(java.lang.String[] args)
Initialize.
|
protected static boolean |
loadFixups()
Load fixup definitions.
|
static void |
main(java.lang.String[] args)
Main program.
|
protected static int |
processFiles(java.lang.String[] args)
Process files.
|
protected static void |
processOneFile(java.lang.String xmlFileName)
Process one file.
|
protected static java.lang.String[] |
splitDocumentText(java.lang.String docText,
java.lang.String splitString)
Split document text.
|
protected static void |
terminate(int filesProcessed,
long processingTime)
Terminate.
|
protected static void |
traverse(org.w3c.dom.Node node)
Traverse DOM tree and fix quotes.
|
protected static org.w3c.dom.Document document
protected static final int INITPARAMS
protected static int docsToProcess
protected static int currentDocNumber
protected static java.lang.String outputDirectory
protected static Annie annie
protected static java.lang.String fixupsURL
protected static java.util.List<PatternReplacer> fixupsList
protected static final java.lang.String teiHeaderPattern
protected AdornWithNamedEntities()
public static void main(java.lang.String[] args)
args
- Program parameters.protected static boolean initialize(java.lang.String[] args)
protected static boolean loadFixups()
protected static void processOneFile(java.lang.String xmlFileName)
xmlFileName
- XML input file name.protected static int processFiles(java.lang.String[] args)
protected static void terminate(int filesProcessed, long processingTime)
filesProcessed
- Number of files processed.processingTime
- Processing time in seconds.protected static void traverse(org.w3c.dom.Node node)
node
- Root node of tree.protected static java.lang.String addNamedEntities(java.lang.String text)
text
- The text.protected static java.lang.String applyFixups(java.lang.String text)
text
- The text to which to apply fixups.protected static java.lang.String[] splitDocumentText(java.lang.String docText, java.lang.String splitString)
docText
- The document text.splitString
- The regular expression string at which to
split the document.
If this appears more than once, the
document is split at the first appearance.protected static org.w3c.dom.Node findTextNodesParent(org.w3c.dom.Document document)
document
- The document.