edu.northwestern.at.morphadorner.tools.namedentities
Class AdornWithNamedEntities

java.lang.Object
  extended by edu.northwestern.at.morphadorner.tools.namedentities.AdornWithNamedEntities

public class AdornWithNamedEntities
extends java.lang.Object

Adorn XML files with named entities.

AdornWithNamedEntities adorns texts with named entities such as person, location, time, date, and organization.

Usage:

        java edu.northwestern.at.morphadorner.tools.namedentities.AdornWithNamedEntities outputdirectory input1.xml input2.xml ...
        

outputdirectory -- output directory to receive xml files adorned with named entities.
input*.xml -- input TEI XML files.

Note: The named entity adorner does not always recognize entities which cross soft tags. Thus "<hi>Emma</hi> Woodhouse" may be recognized as two separate entities. AdornedWithNamedEntities should be run on the input files before their submission to MorphAdorner.


Field Summary
protected static Annie annie
          Annie annotator.
protected static int currentDocNumber
          Current document.
protected static int docsToProcess
          Number of documents to process.
protected static org.w3c.dom.Document document
          DOM document.
protected static java.util.List<PatternReplacer> fixupsList
          Fixups list.
protected static java.lang.String fixupsURL
          Fixups list resource URL.
protected static int INITPARAMS
          # params before input file specs.
protected static java.lang.String outputDirectory
          Output directory.
protected static java.lang.String teiHeaderPattern
          TEI header element pattern.
 
Constructor Summary
protected AdornWithNamedEntities()
          Allow overrides but not instantiation.
 
Method Summary
protected static java.lang.String addNamedEntities(java.lang.String text)
          Adorn text with named entities.
protected static java.lang.String applyFixups(java.lang.String text)
          Apply fixups.
protected static org.w3c.dom.Node findTextNodesParent(org.w3c.dom.Document document)
          Find parent of text nodes in a DOM document.
protected static boolean initialize(java.lang.String[] args)
          Initialize.
protected static boolean loadFixups()
          Load fixup definitions.
static void main(java.lang.String[] args)
          Main program.
protected static int processFiles(java.lang.String[] args)
          Process files.
protected static void processOneFile(java.lang.String xmlFileName)
          Process one file.
protected static java.lang.String[] splitDocumentText(java.lang.String docText, java.lang.String splitString)
          Split document text.
protected static void terminate(int filesProcessed, long processingTime)
          Terminate.
protected static void traverse(org.w3c.dom.Node node)
          Traverse DOM tree and fix quotes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

document

protected static org.w3c.dom.Document document
DOM document.


INITPARAMS

protected static final int INITPARAMS
# params before input file specs.

See Also:
Constant Field Values

docsToProcess

protected static int docsToProcess
Number of documents to process.


currentDocNumber

protected static int currentDocNumber
Current document.


outputDirectory

protected static java.lang.String outputDirectory
Output directory.


annie

protected static Annie annie
Annie annotator.


fixupsURL

protected static java.lang.String fixupsURL
Fixups list resource URL.


fixupsList

protected static java.util.List<PatternReplacer> fixupsList
Fixups list.


teiHeaderPattern

protected static final java.lang.String teiHeaderPattern
TEI header element pattern.

See Also:
Constant Field Values
Constructor Detail

AdornWithNamedEntities

protected AdornWithNamedEntities()
Allow overrides but not instantiation.

Method Detail

main

public static void main(java.lang.String[] args)
Main program.

Parameters:
args - Program parameters.

initialize

protected static boolean initialize(java.lang.String[] args)
Initialize.


loadFixups

protected static boolean loadFixups()
Load fixup definitions.


processOneFile

protected static void processOneFile(java.lang.String xmlFileName)
Process one file.

Parameters:
xmlFileName - XML input file name.

processFiles

protected static int processFiles(java.lang.String[] args)
Process files.


terminate

protected static void terminate(int filesProcessed,
                                long processingTime)
Terminate.

Parameters:
filesProcessed - Number of files processed.
processingTime - Processing time in seconds.

traverse

protected static void traverse(org.w3c.dom.Node node)
Traverse DOM tree and fix quotes.

Parameters:
node - Root node of tree.

addNamedEntities

protected static java.lang.String addNamedEntities(java.lang.String text)
Adorn text with named entities.

Parameters:
text - The text.
Returns:
The adorned text. Null if annotation could not be done.

applyFixups

protected static java.lang.String applyFixups(java.lang.String text)
Apply fixups.

Parameters:
text - The text to which to apply fixups.
Returns:
The text after applying fixups.

splitDocumentText

protected static java.lang.String[] splitDocumentText(java.lang.String docText,
                                                      java.lang.String splitString)
Split document text.

Parameters:
docText - The document text.
splitString - The regular expression string at which to split the document. If this appears more than once, the document is split at the first appearance.
Returns:
Two element string array. [0] = document text up to first appearance of split string. Empty if split string not found. [1] = document text right after start of split string through end of document.

findTextNodesParent

protected static org.w3c.dom.Node findTextNodesParent(org.w3c.dom.Document document)
Find parent of text nodes in a DOM document.

Parameters:
document - The document.
Returns:
Node which is parent of the text nodes.