public class MergeAnnolexCorrectionsIntoAdornedXML
extends java.lang.Object
Usage:
MergeAnnolexCorrectionsIntoAdornedXML correctionsdirectory outputdirectory inputfiles
The corrections file is a tab-separated utf-8 file containing the following columns.
The corrected spelling, lemmata, and parts of speech may all be empty when the operation is 3 (delete).
The value of the "ord" (word ordinal) attribute for each word is adjusted to account for inserted and deleted words. The value of the "reg" (standard spelling) and "tok" attributes (original token) are generated as needed for updated and inserted words.
Whitespace markers "<c> </c>" are added and deleted as needed when tokens are added or deleted. In general, most added punctuation and symbols do not require added whitespace markers. When tokens are deleted, sequences of "<c> </c><c> </c> ..." are compressed to a single "<c> </c>" entry.
Modifier and Type | Field and Description |
---|---|
protected static int |
addedWords
Words added, deleted, or modified in current document.
|
protected static java.util.Set<java.lang.String> |
badPosTags
Holds bad pos tags.
|
protected static org.jdom2.Element |
clonableCElement
"c" element for cloning.
|
protected static java.util.Set<java.lang.String> |
combinedBadPosTags
Holds combined bad pos tags.
|
protected static java.util.Set<java.lang.String> |
combinedMismatches
Holds combined mismatches.
|
protected static java.util.Map<java.lang.String,CorrectedWord> |
correctedWordsMap
Map word IDs to corrected words.
|
protected static int |
currentDocNumber
Current document.
|
protected static boolean |
debug
True to produce debugging output.
|
protected static int |
deletedGaps |
protected static int |
deletedWords |
protected static int |
docsToProcess
Number of documents to process.
|
protected static org.jdom2.Document |
document
DOM document.
|
protected static java.util.List<org.jdom2.Element> |
gapElementsToDelete
List of gaps to delete.
|
protected static java.util.Map<java.lang.String,org.jdom2.Element> |
gapIDsToElements
Map gap IDs to gap elements in DOM document.
|
protected static int |
INITPARAMS
# params before input file specs.
|
protected static java.lang.String |
inputCorrectionsDirectory
Input directory for corrections files.
|
protected static java.lang.String |
inputXMLDirectory
Input directory for adorned files.
|
protected static java.util.Set<java.lang.String> |
mismatches
Holds individual mismatches.
|
protected static int |
modifiedWords |
protected static java.lang.String |
outputDirectory
Output directory for corrected adorned files.
|
protected static java.io.PrintStream |
outputFileStream
Output file stream.
|
protected static PartOfSpeechTags |
posTags
Pos tags.
|
protected static java.io.PrintStream |
printStream
Wrapper for printStream to allow utf-8 output.
|
protected static boolean |
verbose
True to produce verbose output.
|
protected static java.util.List<org.jdom2.Element> |
wordElementsToDelete
List of word elements to delete.
|
protected static java.util.Map<java.lang.String,org.jdom2.Element> |
wordIDsToElements
Map word IDs to word elements in DOM document.
|
Constructor and Description |
---|
MergeAnnolexCorrectionsIntoAdornedXML() |
Modifier and Type | Method and Description |
---|---|
protected static void |
applyCorrections(org.jdom2.Document document)
Apply corrections to XML document from corrections file.
|
protected static boolean |
changeAttribute(org.jdom2.Element element,
java.lang.String attrName,
java.lang.String oldValue,
java.lang.String newValue)
Change attribute value.
|
protected static void |
compressCElements(org.jdom2.Document document)
Compress "
|
protected static int |
countSeparators(java.lang.String s,
char sep)
Count separators in string.
|
protected static void |
deleteGapElement(org.jdom2.Element gapElement)
Delete a gap element.
|
protected static void |
deleteGapElements(java.util.List<org.jdom2.Element> gapElementsToDelete)
Delete gaps in XML file.
|
protected static void |
deleteWordElement(org.jdom2.Element wordElement,
java.util.List<java.lang.String> sortedWordIDs)
Delete a word element.
|
protected static void |
deleteWordElements(java.util.List<org.jdom2.Element> wordElementsToDelete,
java.util.List<java.lang.String> sortedWordIDs)
Delete words in XML file.
|
protected static void |
deleteWordOld(org.jdom2.Element wordElement,
java.util.List<java.lang.String> correctedWordIDs,
int i)
Delete a word.
|
protected static org.jdom2.Element |
extractCElement(org.jdom2.Document document)
Extract a "c" element from XML file to use for cloning copies.
|
protected static java.util.Map<java.lang.String,org.jdom2.Element> |
extractGaps(org.jdom2.Document document)
Extract gaps from "gap" elements in XML document file.
|
protected static java.util.Map<java.lang.String,org.jdom2.Element> |
extractWords(org.jdom2.Document document)
Extract words specified by "w" elements in XML document file.
|
protected static void |
fixEOSAttributes(java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements)
Fix EOS attributes.
|
protected static void |
fixSplitWordIDs(org.jdom2.Document document)
Fix split word IDs.
|
protected static void |
gapToWord(org.jdom2.Element gapElement,
CorrectedWord correctedWord)
Replace a gap element by a word element.
|
static java.util.List<java.lang.String> |
getRelatedWordIDs(java.lang.String wordID)
Get related adorned word IDs for a word ID.
|
protected static java.util.List<java.lang.String> |
getSortedWordIDs()
Get sorted word IDs.
|
protected static boolean |
initialize(java.lang.String[] args)
Initialize.
|
protected static int |
insertWord(java.lang.String idToInsert,
CorrectedWord correctedWord,
java.util.List<java.lang.String> correctedWordIDs,
int i)
Insert word in XML file.
|
protected static CorrectedWordsFileReader |
loadCorrectionsFile(java.lang.String correctionFileName)
Load word corrections file.
|
protected static org.jdom2.Document |
loadXML(java.lang.String inputXMLFileName)
Load XML document file.
|
static void |
main(java.lang.String[] args)
Main program.
|
protected static <K,V> void |
printMap(java.lang.String mapLabel,
java.util.Map<K,V> map)
Print contents of a map (HashMap, TreeMap, etc.).
|
protected static void |
printMismatches()
Print mismatches.
|
protected static <V> void |
printSet(java.lang.String setLabel,
java.util.Set<V> set)
Print contents of a set (HashSet, TreeSet, etc.).
|
protected static int |
processFiles(java.lang.String[] args)
Process files.
|
protected static void |
processOneFile(java.lang.String xmlInputFileName)
Process corrections for one XML file.
|
protected static void |
replaceGapElementsWithWords(java.util.List<org.jdom2.Element> gapElementsToUpdate)
Replace gaps in XML file with real words.
|
protected static void |
resplit(java.lang.String id,
java.lang.String[] spellingParts,
java.lang.String oldJoinedSpelling,
java.lang.String updatedSpelling)
Fix spelling parts of updated split word.
|
protected static void |
terminate(int filesProcessed,
long processingTime)
Terminate.
|
protected static boolean |
updateWord(org.jdom2.Document document,
org.jdom2.Element wordElement,
CorrectedWord correctedWord,
java.util.List<java.lang.String> correctedWordIDs,
int i,
java.lang.String correctedSpelling)
Update word in XML file.
|
protected static void |
updateWordOrdinals(java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements)
Update word ordinals.
|
protected static int |
validateCorrections(java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements)
Validate entries in corrections file.
|
protected static int docsToProcess
protected static int currentDocNumber
protected static java.lang.String inputXMLDirectory
protected static java.lang.String inputCorrectionsDirectory
protected static java.lang.String outputDirectory
protected static java.io.PrintStream outputFileStream
protected static java.io.PrintStream printStream
protected static org.jdom2.Document document
protected static java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements
protected static java.util.Map<java.lang.String,org.jdom2.Element> gapIDsToElements
protected static java.util.Map<java.lang.String,CorrectedWord> correctedWordsMap
protected static final int INITPARAMS
protected static PartOfSpeechTags posTags
protected static java.util.Set<java.lang.String> badPosTags
protected static java.util.Set<java.lang.String> combinedBadPosTags
protected static java.util.Set<java.lang.String> mismatches
protected static java.util.Set<java.lang.String> combinedMismatches
protected static int addedWords
protected static int deletedWords
protected static int modifiedWords
protected static int deletedGaps
protected static org.jdom2.Element clonableCElement
protected static java.util.List<org.jdom2.Element> wordElementsToDelete
protected static java.util.List<org.jdom2.Element> gapElementsToDelete
protected static boolean verbose
protected static boolean debug
public MergeAnnolexCorrectionsIntoAdornedXML()
public static void main(java.lang.String[] args)
args
- Program parameters.protected static boolean initialize(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
protected static void processOneFile(java.lang.String xmlInputFileName) throws java.io.IOException
xmlInputFileName
- XML input file.java.io.IOException
protected static org.jdom2.Document loadXML(java.lang.String inputXMLFileName) throws org.jdom2.JDOMException, java.io.IOException
inputXMLFileName
- Input XML file name.org.jdom2.JDOMException
- java.io.IOExceptionjava.io.IOException
protected static java.util.Map<java.lang.String,org.jdom2.Element> extractWords(org.jdom2.Document document)
document
- Parsed XML document.protected static java.util.Map<java.lang.String,org.jdom2.Element> extractGaps(org.jdom2.Document document)
document
- Parsed XML document.protected static org.jdom2.Element extractCElement(org.jdom2.Document document)
document
- Parsed XML document.protected static int validateCorrections(java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements)
wordIDsToElements
- Map of document word IDs to
document "w" elements.protected static void applyCorrections(org.jdom2.Document document) throws java.lang.Exception
document
- Document to update.java.lang.Exception
protected static void resplit(java.lang.String id, java.lang.String[] spellingParts, java.lang.String oldJoinedSpelling, java.lang.String updatedSpelling)
protected static void compressCElements(org.jdom2.Document document)
document
- Document in which to compress "
Deleting words may have left sequences of "
protected static void fixSplitWordIDs(org.jdom2.Document document)
document
- Parsed XML document.protected static void fixEOSAttributes(java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements)
wordIDsToElements
- Map of word ID to parsed "w" elements.
Adding or deleting words may have left sequences of "w" elements which all have the "eos" attribute set to "1". This method updates the "w" elements in such as sequence so that only the last "w" element in the sequence has "eos" set to "1".
protected static void updateWordOrdinals(java.util.Map<java.lang.String,org.jdom2.Element> wordIDsToElements)
wordIDsToElements
- Map of document word IDs to
document "w" elements.protected static int countSeparators(java.lang.String s, char sep)
s
- The string in which to count separators.sep
- The separator character.protected static boolean changeAttribute(org.jdom2.Element element, java.lang.String attrName, java.lang.String oldValue, java.lang.String newValue)
element
- DOM element.attrName
- Attribute name.oldValue
- Old attribute value.newValue
- New attribute value.public static java.util.List<java.lang.String> getRelatedWordIDs(java.lang.String wordID)
wordID
- Word ID for which related IDs are wanted.Related word IDs are the word IDs for the other parts of a split word. The returned list includes the given wordID.
For unsplit words, the single given wordID is returned in the list.
Null is returned when the wordID does not exist.
protected static int processFiles(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
protected static void printMismatches()
protected static <K,V> void printMap(java.lang.String mapLabel, java.util.Map<K,V> map)
mapLabel
- Label for map.map
- The map to print.
N.B. This method assumes both the keys and values have toString() methods.
protected static <V> void printSet(java.lang.String setLabel, java.util.Set<V> set)
setLabel
- Label for set.set
- The set to print.
N.B. This method assumes set values have toString() methods.
protected static void terminate(int filesProcessed, long processingTime)
filesProcessed
- Number of files processed.processingTime
- Processing time in seconds.protected static CorrectedWordsFileReader loadCorrectionsFile(java.lang.String correctionFileName) throws java.lang.Exception
correctionFileName
- Name of word correction file.java.lang.Exception
protected static boolean updateWord(org.jdom2.Document document, org.jdom2.Element wordElement, CorrectedWord correctedWord, java.util.List<java.lang.String> correctedWordIDs, int i, java.lang.String correctedSpelling)
wordElement
- Word to update.correctedWord
- Correction for word.correctedWordIDs
- List of corrected word IDs.i
- Index of word in list of word IDs.correctedSpelling
- Corrected spelling.protected static int insertWord(java.lang.String idToInsert, CorrectedWord correctedWord, java.util.List<java.lang.String> correctedWordIDs, int i)
idToInsert
- Word ID to insert.correctedWord
- Data for word to insert.correctedWordIDs
- List of corrected word IDs.i
- Index of word in list of word IDs.protected static void deleteWordElements(java.util.List<org.jdom2.Element> wordElementsToDelete, java.util.List<java.lang.String> sortedWordIDs)
wordElementsToDelete
- Word elements to delete.sortedWordIDs
- Sorted word IDs in document.protected static void deleteGapElements(java.util.List<org.jdom2.Element> gapElementsToDelete)
gapElementsToDelete
- Gap elements to delete.protected static void replaceGapElementsWithWords(java.util.List<org.jdom2.Element> gapElementsToUpdate)
gapElementsToUpdate
- Gap elements to update.protected static void gapToWord(org.jdom2.Element gapElement, CorrectedWord correctedWord)
gapElement
- Gap element to replace.correctedWord
- Word to replace gap element.protected static void deleteGapElement(org.jdom2.Element gapElement)
gapElement
- Gap element to delete.protected static void deleteWordElement(org.jdom2.Element wordElement, java.util.List<java.lang.String> sortedWordIDs)
wordElement
- Word element to delete.protected static java.util.List<java.lang.String> getSortedWordIDs()
protected static void deleteWordOld(org.jdom2.Element wordElement, java.util.List<java.lang.String> correctedWordIDs, int i)
wordElement
- Word to delete.correctedWordIDs
- List of corrected word IDs.i
- Index of word in list of word IDs.