public class XGParser
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
(package private) java.util.List |
adornedWordDataList
List of adorned word data entries.
|
(package private) AdornedWordOutputter |
adornerOutputter |
(package private) boolean |
boolDot |
(package private) java.io.BufferedReader |
brCurrent |
(package private) static java.lang.String |
FILE_SEPARATOR
File separator.
|
(package private) UnicodeReader |
frCurrent |
(package private) java.util.Map<java.lang.Integer,XGPair> |
hMap |
(package private) java.util.Map<java.lang.String,java.lang.String> |
hmAttributes |
(package private) int |
intCountNonBlanks |
(package private) int |
intCountTags |
(package private) int |
intCpt |
(package private) int |
intID |
(package private) int |
intLongWord |
(package private) int |
intStrWordIndex |
(package private) int |
intStrWordLength |
(package private) int |
nextAdornedWord
Next adorned word to process.
|
(package private) org.w3c.dom.NamedNodeMap |
nnmEntities |
(package private) XGOptions |
options |
(package private) java.lang.StringBuffer |
sbWord |
(package private) java.util.Map<java.lang.Integer,java.lang.Integer> |
splitWords
Map of multipart word IDs to # of parts.
|
(package private) java.lang.String |
strLine |
(package private) java.lang.String |
strWord |
(package private) java.lang.String |
surroundMarker
Surrounding sentence/phrase marker.
|
(package private) int |
surroundMarkerLength
Surround marker string length.
|
(package private) java.lang.String |
surroundMarkerTrim |
(package private) int |
wordNodesCreated
Number of word nodes created.
|
Constructor and Description |
---|
XGParser(XGOptions options,
org.w3c.dom.Document document)
Create parser.
|
Modifier and Type | Method and Description |
---|---|
protected org.w3c.dom.Node |
cloneEntityReference(org.w3c.dom.EntityReference er,
org.w3c.dom.Document doc)
Clone a read-only EntityReference into a writable Node.
|
protected static org.w3c.dom.Node |
cloneNode(org.w3c.dom.Node node)
Clone a node and its sub-elements.
|
protected int |
countNonBlankCharacters(java.lang.String strString)
Count non-blank characters in a
String and
update the tag HashMap . |
protected int |
createNewNode(org.w3c.dom.Document doc,
org.w3c.dom.Node node,
org.w3c.dom.Node nodeChild,
java.lang.String strCurrentPath,
java.lang.Integer integerTagNumber)
Create new document node.
|
java.lang.StringBuffer |
extractText(org.w3c.dom.Node node)
Extract text from
node . |
static java.lang.Object[] |
extractText(XGOptions options,
org.w3c.dom.Document document)
Extract text from DOM document.
|
protected void |
getNextEntry()
Reads next entry of adorner and updates appropriate class variables.
|
int |
getNumberOfAdornedWords()
Get number of adorned words.
|
int |
getRunningWordID()
Get word ID.
|
static boolean |
isPunctuationAndNotGap(java.lang.String s)
Check if string is punctuation, but not a gap span.
|
static java.util.Map<java.lang.Integer,java.lang.Integer> |
mergeAdornments(XGOptions options,
XGParser instance,
org.w3c.dom.Document document,
java.lang.String segmentName,
AdornedWordOutputter outputter,
TextInputter inputter)
Merged adornments with original XML text.
|
org.w3c.dom.Document |
modifyDOM(org.w3c.dom.Node node,
org.w3c.dom.Document doc,
java.lang.String strCurrentPath)
Modify
element to add adornments and remove initial text node. |
protected int |
read()
Reads a integer from the adorner.
|
void |
setRunningWordID(int runningWordID)
Set running word ID.
|
static org.w3c.dom.Document |
textToDOM(XGOptions options,
java.lang.String xmlText)
Create DOM from XML text.
|
XGOptions options
java.util.Map<java.lang.Integer,XGPair> hMap
java.util.Map<java.lang.String,java.lang.String> hmAttributes
org.w3c.dom.NamedNodeMap nnmEntities
boolean boolDot
int intCountNonBlanks
int intCountTags
int intCpt
java.lang.String strLine
java.lang.StringBuffer sbWord
int intStrWordIndex
int intStrWordLength
java.lang.String strWord
int intLongWord
int intID
UnicodeReader frCurrent
java.io.BufferedReader brCurrent
AdornedWordOutputter adornerOutputter
int nextAdornedWord
java.util.List adornedWordDataList
java.lang.String surroundMarker
java.lang.String surroundMarkerTrim
int surroundMarkerLength
java.util.Map<java.lang.Integer,java.lang.Integer> splitWords
Records for each word split by soft or jump tags, the ID for that word and the number of parts into which it is split.
int wordNodesCreated
static final java.lang.String FILE_SEPARATOR
public XGParser(XGOptions options, org.w3c.dom.Document document)
options
- Options for processing.document
- Document to process.public void setRunningWordID(int runningWordID)
runningWordID
- The running word ID.public int getRunningWordID()
public int getNumberOfAdornedWords()
protected int read() throws java.io.IOException, java.io.FileNotFoundException
int
in the output stream.
If this output is split into several files, handle multiple buffers.
java.io.IOException
java.io.FileNotFoundException
protected void getNextEntry() throws java.io.IOException, java.io.FileNotFoundException
java.io.IOException
java.io.FileNotFoundException
public java.lang.StringBuffer extractText(org.w3c.dom.Node node) throws java.io.IOException
node
.node
- the Node
to parse.StringBuffer
containing the
element text, taking reading context into account.
The algorithm used to parse children (soft, jump, hard tags)
is the same as that in modifyDOM(org.w3c.dom.Node, org.w3c.dom.Document, java.lang.String)
.
java.io.IOException
protected int createNewNode(org.w3c.dom.Document doc, org.w3c.dom.Node node, org.w3c.dom.Node nodeChild, java.lang.String strCurrentPath, java.lang.Integer integerTagNumber)
doc
- The document we're processing.node
- The current node we're processing.nodeChild
- The child node we're processing.strCurrentPath
- Current XML path to this node.integerTagNumber
- Integer tag number for path.protected static org.w3c.dom.Node cloneNode(org.w3c.dom.Node node)
node
- The Node
to cloneNode
cloned.protected org.w3c.dom.Node cloneEntityReference(org.w3c.dom.EntityReference er, org.w3c.dom.Document doc)
er
- The EntityReference
to clone.doc
- The parent Document
.Node
containing the same
writable sub-elements than er
.public org.w3c.dom.Document modifyDOM(org.w3c.dom.Node node, org.w3c.dom.Document doc, java.lang.String strCurrentPath) throws org.w3c.dom.DOMException, java.io.IOException
element
to add adornments and remove initial text node.node
- The Node
to parse.doc
- The Document
to modify.strCurrentPath
- The XPath or the last Node
explored.Document
.
The algorithm used to parse children (soft, jump, hard tags)
is the same as used in extractText(org.w3c.dom.Node)
.
org.w3c.dom.DOMException
java.io.IOException
protected int countNonBlankCharacters(java.lang.String strString) throws java.io.IOException
String
and
update the tag HashMap
.strString
- The text to analyze.strString should have all whitespace characters mapped to blanks before this method is called.
java.io.IOException
public static java.lang.Object[] extractText(XGOptions options, org.w3c.dom.Document document) throws java.io.IOException
options
- The processing options.document
- The document to process.java.io.IOException
public static java.util.Map<java.lang.Integer,java.lang.Integer> mergeAdornments(XGOptions options, XGParser instance, org.w3c.dom.Document document, java.lang.String segmentName, AdornedWordOutputter outputter, TextInputter inputter) throws java.io.IOException
options
- XGTagger options.instance
- XGParser instance.document
- Document being processed.segmentName
- Name of document segment being processed.outputter
- Adorned word outputter.inputter
- Text inputter.java.io.IOException
public static org.w3c.dom.Document textToDOM(XGOptions options, java.lang.String xmlText) throws java.io.IOException
options
- The processing options.xmlText
- The XML text.java.io.IOException
public static boolean isPunctuationAndNotGap(java.lang.String s)
s
- String to check for being punctuation.