public class StripWordAttributes
extends java.lang.Object
Usage:
java edu.northwestern.at.morphadorner.tools.stripwordattributes.StripWordAttributes input.xml output.xml output.tab [/[no]id] [/[no]trim]
input.xml | Input MorphAdornerd xml file. |
output.xml | Derived adorned file with word element attributes stripped. |
output.tab | Tab delimited file of word element attribute values. |
/id or /noid | Optional parameter indicating xml:id should be left attached to each word (<w>) element. Default is /noid which removes the xml:id attribute and value. |
/trim or /notrim | Optional parameter indicating whether whitespace should be trimmed from the start and end of each XML text line. Default is /notrim, which leaves the original whitespace intact. |
The derived adorned output file "output.xml" has all attributes stripped from each <w> tag.
The attribute values for each "<w>" element in the input.xml file are extracted and output to the tab-separated values output.tab file. The order of the attribute lines matches the order of appearance of the <w> elements in the XML output file. When /id is used, the xml:id value in each <w> element in output.xml can be matched with the corresponding xml:id value in output.tab .
The first line in output.tab contains the attribute names for each column. Each subsequent line in the output.tab file contains at least the following information corresponding to a single word "<w>" element. Some adorned files may add extra word attributes, resulting in more columns.
Modifier and Type | Field and Description |
---|---|
protected static java.util.Set<java.lang.String> |
attrsToOmit
Attributes to omit from output attributes file.
|
protected static java.util.Map<java.lang.String,java.lang.String> |
entitiesMap
Standard entities generated by MorphAdorner.
|
protected static java.util.regex.Matcher |
entitiesMatcher
Entities pattern matcher.
|
protected static java.util.regex.Pattern |
entitiesPattern
Entities pattern.
|
protected static java.lang.String |
LINE_SEPARATOR
Line separator.
|
Constructor and Description |
---|
StripWordAttributes(java.lang.String inputXMLFileName,
java.lang.String outputXMLFileName,
java.lang.String outputTabFileName,
boolean leaveID,
boolean trimWhitespace)
Create derived adorned files with character offset attributes.
|
Modifier and Type | Method and Description |
---|---|
protected static java.lang.String |
cleanAttributeValue(java.lang.String attrValue)
Cleans an attribute value of enclosing quotes and internal entities.
|
protected static void |
displayUsage()
Display program usage.
|
protected static java.util.Map<java.lang.String,java.lang.String> |
fillInMissingAttributes(java.util.Map<java.lang.String,java.lang.String> attrMap,
java.lang.String wordText)
Fill in missing attribute values.
|
protected static java.util.Map<java.lang.String,java.lang.String> |
getAttributes(java.lang.String attrsText,
java.lang.String wordText)
Get map of attribute values for a
|
static void |
main(java.lang.String[] args)
Main program.
|
protected static void |
setMissingValue(java.util.Map<java.lang.String,java.lang.String> attrMap,
java.lang.String attrName,
java.lang.String defaultAttrValue)
Set missing attribute value.
|
protected static final java.lang.String LINE_SEPARATOR
protected static java.util.Set<java.lang.String> attrsToOmit
protected static java.util.Map<java.lang.String,java.lang.String> entitiesMap
protected static java.util.regex.Pattern entitiesPattern
protected static java.util.regex.Matcher entitiesMatcher
public StripWordAttributes(java.lang.String inputXMLFileName, java.lang.String outputXMLFileName, java.lang.String outputTabFileName, boolean leaveID, boolean trimWhitespace)
inputXMLFileName
- Input adorned XML file name.outputXMLFileName
- Output modified XML file name.outputTabFileName
- Output attribute values file name.leaveID
- true to leave xml:id in outputtrimWhitespace
- true to trim whitespace from
start and end of input XML lines.public static void main(java.lang.String[] args)
protected static void displayUsage()
protected static java.util.Map<java.lang.String,java.lang.String> getAttributes(java.lang.String attrsText, java.lang.String wordText)
attrsText
- String containing attribute values in
attr="value" form as extracted from
wordText
- Word text.protected static java.lang.String cleanAttributeValue(java.lang.String attrValue)
attrValue
- Input attribute value with possible
enclosing quotes and internal entities.protected static java.util.Map<java.lang.String,java.lang.String> fillInMissingAttributes(java.util.Map<java.lang.String,java.lang.String> attrMap, java.lang.String wordText)
attrMap
- Map with attribute name -> attribute value
entries.wordText
- Word text.protected static void setMissingValue(java.util.Map<java.lang.String,java.lang.String> attrMap, java.lang.String attrName, java.lang.String defaultAttrValue)
attrMap
- Map with attribute name ->
attribute value entries.attrName
- Attribute name.defaultAttrValue
- Default attribute value for
attribute attrName if not present
in attrMap.