public class MorphAdorner
extends java.lang.Object
Given an input text, the morphological adorner adorns each word with morphological information such as part of speech, lemma and standardized spelling.
Modifier and Type | Field and Description |
---|---|
Abbreviations |
abbreviations
Abbreviations.
|
int |
defaultKWICWidth
Number of chararacters in a KWIC line.
|
TaggedStrings |
extraWords
Extra words.
|
java.lang.String |
extraWordsFileName
Extra words file.
|
java.lang.String |
latinWordsFileName
Latin words file.
|
java.lang.String |
lemmaSeparator |
Lemmatizer |
lemmatizer
Lemmatizer.
|
Abbreviations |
mainAbbreviations
Main text abbreviations.
|
MorphAdornerLogger |
morphAdornerLogger
MorphAdorner logger.
|
MorphAdornerSettings |
morphAdornerSettings
MorphAdorner settings.
|
Names |
names
Proper names.
|
NameStandardizer |
nameStandardizer
Proper name standardizer.
|
PartOfSpeechGuesser |
partOfSpeechGuesser
Part of speech guesser.
|
PartOfSpeechTags |
partOfSpeechTags
Part of speech tags.
|
PartOfSpeechRetagger |
retagger
Part of speech retagger.
|
Abbreviations |
sideAbbreviations
Side text abbreviations.
|
SpellingMapper |
spellingMapper
Spelling mapper.
|
SpellingStandardizer |
spellingStandardizer
Spelling standardizer.
|
WordTokenizer |
spellingTokenizer
Spelling tokenizer for lemmatization.
|
protected static java.util.Map<java.lang.String,MorphAdorner> |
storedAdorners
Stores initialized MorphAdorner objects for reuse.
|
Lexicon |
suffixLexicon
Suffix lexicon.
|
TEITagClassifier |
tagClassifier
Tag classifier.
|
PartOfSpeechTagger |
tagger
Part of speech tagger.
|
java.lang.String |
tagSeparator
Part of speech tag separator.
|
MorphAdornerSettings |
tokenizationSettings
MorphAdorner settings for XML tokenization.
|
TransitionMatrix |
transitionMatrix
Transition matrix.
|
Lexicon |
wordLexicon
Word lexicon.
|
Constructor and Description |
---|
MorphAdorner()
Create empty MorphAdorner object.
|
MorphAdorner(java.lang.String[] args)
Create MorphAdorner object.
|
MorphAdorner(java.lang.String[] args,
java.lang.String logConfiguration,
java.lang.String logDirectory)
Create MorphAdorner object.
|
Modifier and Type | Method and Description |
---|---|
void |
addAbbreviations(Abbreviations abbreviations,
java.lang.String abbreviationsURL,
java.lang.String loadedMessage)
Add abbreviations.
|
AdornedWordOutputter |
adornFile(java.lang.String fileName)
Perform word adornment processes for a single input file.
|
AdornedWordOutputter |
adornText(java.lang.String textToAdorn,
java.net.URL outputURL)
Perform word adornment processes for a single input file.
|
void |
adornXML(java.lang.String inputFileName,
boolean tokenizeOnly)
Adorn XML file.
|
static MorphAdorner |
createAdorner(java.lang.String adornerName,
boolean replaceAdorner,
java.lang.String[] adornerArgs,
java.lang.String adornerLogConfig,
java.lang.String adornerLogDirectory)
Create a Morphological Adorner.
|
static MorphAdorner |
createAndRunAdorner(java.lang.String adornerName,
boolean replaceAdorner,
java.lang.String[] adornerArgs,
java.lang.String adornerLogConfig,
java.lang.String adornerLogDirectory,
java.lang.String outputDirectory,
java.lang.String[] filesToAdorn,
boolean tokenizeOnly)
Create and run a Morphological Adorner.
|
boolean |
doesOutputFileNameExist(java.lang.String inputFileName)
Check if output file name for adorned output already exists.
|
void |
finalize()
Finalize.
|
protected void |
fixSideWords(org.w3c.dom.Document document,
Abbreviations sideAbbreviations)
Fix abbreviations in side text.
|
java.lang.String |
getOutputFileName(java.lang.String inputFileName)
Generate output file name for adorned output.
|
static java.util.Map<java.lang.String,MorphAdorner> |
getStoredAdorners()
Get map of stored adorners.
|
protected void |
initializeAdornment()
Initialize adornment classes.
|
protected boolean |
inSideText(org.w3c.dom.Node element)
Is element in side text.
|
static void |
main(java.lang.String[] args)
Create and run a Morphological Adorner.
|
protected static void |
mergeXML(TextInputter inputter,
java.lang.String xmlFileName)
Merge xml fragments into one xml file.
|
protected void |
printWords(org.w3c.dom.Document document)
Print words in DOM document.
|
void |
processInputFiles()
Process list of files containing text to adorn.
|
void |
processInputFiles(boolean xmlTokenizeOnly)
Process list of files containing text to adorn.
|
void |
readorn(java.lang.String inputFileName)
Readorn adorned XML file.
|
static MorphAdorner |
runAdorner(MorphAdorner adorner,
java.lang.String outputDirectory,
java.lang.String[] filesToAdorn,
boolean tokenizeOnly)
Run a Morphological Adorner.
|
static MorphAdorner |
runAdorner(java.lang.String adornerName,
java.lang.String outputDirectory,
java.lang.String[] filesToAdorn,
boolean tokenizeOnly)
Run a Morphological Adorner.
|
static void |
setStoredAdorners(java.util.Map<java.lang.String,MorphAdorner> storedAdorners)
Set map of stored adorners.
|
void |
updateAdornedSentence(java.util.List<ExtendedAdornedWord> sentence,
java.util.Set<java.lang.String> regIDSet)
Adorn a list of sentences containing adorned words.
|
void |
updateAdornedSentences(java.util.List<java.util.List<ExtendedAdornedWord>> sentences,
java.util.Set<java.lang.String> regIDSet)
Adorn a list of sentences containing adorned words.
|
protected void |
updateSplitWordAdornments(ExtendedAdornedWordFilter wordFilter)
Update adornments for split words.
|
protected static java.util.Map<java.lang.String,MorphAdorner> storedAdorners
public int defaultKWICWidth
public java.lang.String latinWordsFileName
public java.lang.String extraWordsFileName
public TaggedStrings extraWords
public WordTokenizer spellingTokenizer
public PartOfSpeechTags partOfSpeechTags
public PartOfSpeechTagger tagger
public PartOfSpeechRetagger retagger
public Lexicon wordLexicon
public PartOfSpeechGuesser partOfSpeechGuesser
public Lexicon suffixLexicon
public TransitionMatrix transitionMatrix
public SpellingStandardizer spellingStandardizer
public SpellingMapper spellingMapper
public NameStandardizer nameStandardizer
public Lemmatizer lemmatizer
public Names names
public Abbreviations abbreviations
public Abbreviations mainAbbreviations
public Abbreviations sideAbbreviations
public java.lang.String tagSeparator
public java.lang.String lemmaSeparator
public MorphAdornerLogger morphAdornerLogger
public MorphAdornerSettings morphAdornerSettings
public MorphAdornerSettings tokenizationSettings
public TEITagClassifier tagClassifier
public MorphAdorner()
public MorphAdorner(java.lang.String[] args, java.lang.String logConfiguration, java.lang.String logDirectory)
args
- Command line parameters.logConfiguration
- Log file configuration.logDirectory
- Log file directory.public MorphAdorner(java.lang.String[] args)
args
- Parameters.public static java.util.Map<java.lang.String,MorphAdorner> getStoredAdorners()
public static void setStoredAdorners(java.util.Map<java.lang.String,MorphAdorner> storedAdorners)
storedAdorners
- Map from names to adorner instances.protected void initializeAdornment()
public void processInputFiles(boolean xmlTokenizeOnly)
xmlTokenizeOnly
- Only tokenize XML files.public void processInputFiles()
public void adornXML(java.lang.String inputFileName, boolean tokenizeOnly) throws java.lang.Exception
inputFileName
- File name of XML file to adorn.tokenizeOnly
- Only tokenize.java.lang.Exception
- For variety of errors.protected void printWords(org.w3c.dom.Document document)
document
- The DOM document containing words to print.
The text of
protected void fixSideWords(org.w3c.dom.Document document, Abbreviations sideAbbreviations)
document
- DOM document containing words to fix.sideAbbreviations
- Abbreviations list for side text.protected boolean inSideText(org.w3c.dom.Node element)
element
- Element.public java.lang.String getOutputFileName(java.lang.String inputFileName) throws java.io.IOException
inputFileName
- The input file name.java.io.IOException
- if output directory cannot be created.public boolean doesOutputFileNameExist(java.lang.String inputFileName)
inputFileName
- The input file name.public AdornedWordOutputter adornFile(java.lang.String fileName) throws java.io.IOException
fileName
- Input file name.java.lang.Exception
- if an error occurs.java.io.IOException
public AdornedWordOutputter adornText(java.lang.String textToAdorn, java.net.URL outputURL) throws java.io.IOException
textToAdorn
- Text to adorn.outputURL
- URL for output.java.lang.Exception
- if an error occurs.java.io.IOException
public void readorn(java.lang.String inputFileName) throws org.xml.sax.SAXException, java.io.IOException, java.io.FileNotFoundException
inputFileName
- Input XML file name.org.xml.sax.SAXException
java.io.IOException
java.io.FileNotFoundException
public void updateAdornedSentences(java.util.List<java.util.List<ExtendedAdornedWord>> sentences, java.util.Set<java.lang.String> regIDSet)
sentences
- Previously adorned sentences to readorn.regIDSet
- Word IDs of words with preset standard spellings.protected void updateSplitWordAdornments(ExtendedAdornedWordFilter wordFilter)
wordFilter
- ExtendedAdornedWordFilter with words to update.public void updateAdornedSentence(java.util.List<ExtendedAdornedWord> sentence, java.util.Set<java.lang.String> regIDSet)
sentence
- Previously adorned sentence to update.regIDSet
- Word IDs of words with preset standard spellings.public void addAbbreviations(Abbreviations abbreviations, java.lang.String abbreviationsURL, java.lang.String loadedMessage)
abbreviationsURL
- Abbreviations URL.loadedMessage
- Message to display when words loaded.protected static void mergeXML(TextInputter inputter, java.lang.String xmlFileName)
public static void main(java.lang.String[] args)
args
- Program arguments.public static MorphAdorner createAdorner(java.lang.String adornerName, boolean replaceAdorner, java.lang.String[] adornerArgs, java.lang.String adornerLogConfig, java.lang.String adornerLogDirectory)
adornerName
- Name for this adorner.replaceAdorner
- Replace existing adorner.adornerArgs
- Adorner arguments.adornerLogConfig
- Adorner log file configuration.adornerLogDirectory
- Adorner log directory.public static MorphAdorner runAdorner(MorphAdorner adorner, java.lang.String outputDirectory, java.lang.String[] filesToAdorn, boolean tokenizeOnly)
adorner
- The adorner to run.outputDirectory
- Adorned files output directory.filesToAdorn
- File names to adorn.tokenizeOnly
- Only tokenize XML files.If the adorner specified is null, no processing is performed, and null is returned as the adorned used.
public static MorphAdorner runAdorner(java.lang.String adornerName, java.lang.String outputDirectory, java.lang.String[] filesToAdorn, boolean tokenizeOnly)
adornerName
- Name for this adorner.outputDirectory
- Adorned files output directory.filesToAdorn
- File names to adorn.tokenizeOnly
- Only tokenize XML files.If the requested adorner was not found, no processing is performed, and null is return as the adorner used.
public static MorphAdorner createAndRunAdorner(java.lang.String adornerName, boolean replaceAdorner, java.lang.String[] adornerArgs, java.lang.String adornerLogConfig, java.lang.String adornerLogDirectory, java.lang.String outputDirectory, java.lang.String[] filesToAdorn, boolean tokenizeOnly)
adornerName
- Name for this adorner.replaceAdorner
- Replace existing adorner.adornerArgs
- Adorner arguments.adornerLogConfig
- Adorner log file configuration.adornerLogDirectory
- Adorner log directory.outputDirectory
- Adorned files output directory.filesToAdorn
- File names to adorn.tokenizeOnly
- Only tokenize XML files.public void finalize() throws java.lang.Throwable
finalize
in class java.lang.Object
java.lang.Throwable