public class ExtendedAdornedWordFilter extends ExtendedXMLFilterImpl
Modifier and Type | Class and Description |
---|---|
(package private) class |
ExtendedAdornedWordFilter.MainFront
Holds main/front information.
|
Modifier and Type | Field and Description |
---|---|
protected QueueStack<java.lang.String> |
divStack
Div tag stack.
|
protected java.lang.String |
firstWordID
First word ID found in text.
|
protected int |
gapCount
Gap count for generating IDs.
|
protected boolean |
generateGapWords
Generate words for gaps.
|
protected java.util.List<java.lang.String> |
idList
List of String word IDs.
|
protected java.util.Map<java.lang.String,ExtendedAdornedWord> |
idToWordInfo
Map word ID to list of adorned word information objects.
|
protected java.lang.String |
lastElementName
Last element name encountered.
|
protected java.lang.String |
lastID
Last word ID.
|
protected ExtendedAdornedWord |
lastWordInfo
Last word encountered.
|
protected java.util.List<ExtendedAdornedWord> |
lastWordList
List of current last word for a text section.
|
protected ExtendedAdornedWord |
lastWordPartInfo
Last word part encountered for a split word.
|
protected java.util.List<ExtendedAdornedWord> |
lastWordPartList
List of current last word part for a text section.
|
protected java.util.Map<java.lang.String,ExtendedAdornedWord> |
leadingGapWords
Gap words which appear before first real word.
|
protected int |
pageNumber
Running page number.
|
protected XMLTagClassifier |
tagClassifier
XML tag classes.
|
protected java.util.List<java.lang.String> |
tagList
List of tags for determining node ancestry of each word.
|
Constructor and Description |
---|
ExtendedAdornedWordFilter(org.xml.sax.XMLReader reader)
Create adorned word info filter.
|
ExtendedAdornedWordFilter(org.xml.sax.XMLReader reader,
boolean generateGapWords)
Create adorned word info filter.
|
ExtendedAdornedWordFilter(org.xml.sax.XMLReader reader,
XMLTagClassifier tagClassifier,
boolean generateGapWords)
Create adorned word info filter.
|
Modifier and Type | Method and Description |
---|---|
protected void |
addWordOrdinals()
Generate missing word ordinals.
|
void |
characters(char[] ch,
int start,
int length)
Handle character data.
|
void |
endDocument()
End of document found.
|
void |
endElement(java.lang.String uri,
java.lang.String localName,
java.lang.String qName)
Handle end of an element.
|
java.util.List<java.lang.String> |
findWordsByMatchingLeadingPath(java.lang.String pattern)
Find words whose paths start with a given string.
|
java.util.List<java.lang.String> |
findWordsByMatchingPath(java.lang.String pattern)
Find words matching a specified path regular expression pattern.
|
protected void |
fixLeadingGapWords()
Correct IDs for leading gap words.
|
protected void |
generateMissingExtendedAdornedWordInformation()
Generate missing adorned word information.
|
java.util.List<java.lang.String> |
getAdornedWordIDs()
Return list of adorned word IDs.
|
java.util.List<java.lang.String> |
getAdornedWordIDsInReadingContextOrder()
Get adorned word IDs in reading context order.
|
int |
getAdornedWordIndexByID(java.lang.String id)
Get index for a word ID.
|
protected java.lang.String |
getDivType()
Get nearest ancestral div type for word.
|
ExtendedAdornedWord |
getExtendedAdornedWord(int index)
Get adorned word information for a word index.
|
ExtendedAdornedWord |
getExtendedAdornedWord(java.lang.String id)
Get adorned word information for a word ID.
|
protected boolean |
getInJumpTag()
Get in jump tag flag.
|
protected ExtendedAdornedWordFilter.MainFront |
getMainFront()
Get main/side and front/main/back text divisions.
|
int |
getNumberOfWords()
Get number of words read.
|
java.util.List<java.lang.String> |
getRelatedSplitWordIDs(java.lang.String wordID)
Get related adorned word IDs for a word ID of a split word.
|
java.util.List<ExtendedAdornedWord> |
getRelatedSplitWords(ExtendedAdornedWord adornedWordInfo)
Get related adorned words.
|
protected java.util.List<java.lang.String> |
getSelectedWordIDs(java.lang.String startingWordID,
java.lang.String endingWordID)
Get list of selected word IDs from specified ID range.
|
java.util.List<java.util.List<ExtendedAdornedWord>> |
getSentences()
Get adorned words as a list of sentences.
|
protected java.util.List<java.util.List<ExtendedAdornedWord>> |
getSentencesFromEOS()
Get adorned words as a list of sentences using EOS attributes.
|
protected java.util.List<java.util.List<ExtendedAdornedWord>> |
getSentencesFromSentenceNumbers()
Get adorned words as a list of sentences using sentence numbers.
|
java.util.List<java.lang.String> |
getSiblingWordIDs(java.lang.String wordID)
Get sibling words.
|
protected boolean |
getSpoken()
Get spoken/not spoken word flag.
|
protected boolean |
getUnclear()
Get unclear word flag.
|
protected boolean |
getVerse()
Get verse flag.
|
protected java.lang.String |
getWordText(ExtendedAdornedWord adornedWord)
Get word text for an extended adorned word.
|
void |
ignorableWhitespace(char[] ch,
int start,
int length)
Handle whitespace.
|
java.lang.String[] |
splitPath(java.lang.String path)
Split word path into separate tags.
|
java.lang.String[] |
splitPathFull(java.lang.String path)
Split word path into separate tags.
|
void |
startElement(java.lang.String uri,
java.lang.String localName,
java.lang.String qName,
org.xml.sax.Attributes atts)
Handle start of an XML element.
|
java.lang.String |
trimTag(java.lang.String tag)
Trim tag number from XML tag.
|
java.lang.String |
trimTrailingSoftTags(java.lang.String path)
Remove trailing soft tags from a path.
|
removeAttribute, setAttributeValue, setAttributeValue, setAttributeValue
endPrefixMapping, error, fatalError, getContentHandler, getDTDHandler, getEntityResolver, getErrorHandler, getFeature, getParent, getProperty, notationDecl, parse, parse, processingInstruction, resolveEntity, setContentHandler, setDocumentLocator, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setParent, setProperty, skippedEntity, startDocument, startPrefixMapping, unparsedEntityDecl, warning
protected java.util.Map<java.lang.String,ExtendedAdornedWord> idToWordInfo
protected java.util.List<java.lang.String> idList
protected java.util.List<java.lang.String> tagList
protected QueueStack<java.lang.String> divStack
protected java.util.List<ExtendedAdornedWord> lastWordList
protected java.util.List<ExtendedAdornedWord> lastWordPartList
protected ExtendedAdornedWord lastWordInfo
protected ExtendedAdornedWord lastWordPartInfo
protected XMLTagClassifier tagClassifier
protected boolean generateGapWords
protected java.lang.String lastID
protected int gapCount
protected int pageNumber
protected java.util.Map<java.lang.String,ExtendedAdornedWord> leadingGapWords
protected java.lang.String firstWordID
protected java.lang.String lastElementName
public ExtendedAdornedWordFilter(org.xml.sax.XMLReader reader)
reader
- XML input reader to which this filter applies.public ExtendedAdornedWordFilter(org.xml.sax.XMLReader reader, boolean generateGapWords)
reader
- XML input reader to which
this filter applies.generateGapWords
- true to generate "words"
for public ExtendedAdornedWordFilter(org.xml.sax.XMLReader reader, XMLTagClassifier tagClassifier, boolean generateGapWords)
reader
- XML input reader to which
this filter applies.tagClassifier
- XML tag class.generateGapWords
- true to generate "words"
for public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts) throws org.xml.sax.SAXException
startElement
in interface org.xml.sax.ContentHandler
startElement
in class org.xml.sax.helpers.XMLFilterImpl
uri
- The XML element's URI.localName
- The XML element's local name.qName
- The XML element's qname.atts
- The XML element's attributes.org.xml.sax.SAXException
public void characters(char[] ch, int start, int length) throws org.xml.sax.SAXException
characters
in interface org.xml.sax.ContentHandler
characters
in class org.xml.sax.helpers.XMLFilterImpl
ch
- Array of characters.start
- The starting position in the array.length
- The number of characters.org.xml.sax.SAXException
- If there is an error.public void ignorableWhitespace(char[] ch, int start, int length) throws org.xml.sax.SAXException
ignorableWhitespace
in interface org.xml.sax.ContentHandler
ignorableWhitespace
in class org.xml.sax.helpers.XMLFilterImpl
ch
- Array of characters.start
- The starting position in the array.length
- The number of characters.org.xml.sax.SAXException
- If there is an error.public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws org.xml.sax.SAXException
endElement
in interface org.xml.sax.ContentHandler
endElement
in class org.xml.sax.helpers.XMLFilterImpl
uri
- The XML element's URI.localName
- The XML element's local name.qName
- The XML element's qname.org.xml.sax.SAXException
public void endDocument() throws org.xml.sax.SAXException
endDocument
in interface org.xml.sax.ContentHandler
endDocument
in class org.xml.sax.helpers.XMLFilterImpl
SaxException
org.xml.sax.SAXException
protected ExtendedAdornedWordFilter.MainFront getMainFront()
public int getNumberOfWords()
protected boolean getSpoken()
protected boolean getVerse()
protected boolean getUnclear()
protected boolean getInJumpTag()
protected java.lang.String getDivType()
public java.util.List<java.lang.String> getAdornedWordIDs()
public ExtendedAdornedWord getExtendedAdornedWord(java.lang.String id)
id
- The String word ID.public ExtendedAdornedWord getExtendedAdornedWord(int index)
index
- The word index.public int getAdornedWordIndexByID(java.lang.String id)
id
- The String word ID.public java.util.List<java.lang.String> getAdornedWordIDsInReadingContextOrder()
public java.util.List<java.util.List<ExtendedAdornedWord>> getSentences()
This method tries to return sentences in as close to their order of appearance in the text as possible. Sentences from intrusive jump tags will generally appear after the text section into which they intrude, and so may be dislodged an arbitrary distance from their actual position in the text.
protected java.util.List<java.util.List<ExtendedAdornedWord>> getSentencesFromEOS()
This method tries to return sentences in as close to their order of appearance in the text as possible. Sentences from intrusive jump tags will generally appear after the text section into which they intrude, and so may be dislodged an arbitrary distance from their actual position in the text.
protected java.util.List<java.util.List<ExtendedAdornedWord>> getSentencesFromSentenceNumbers()
This method tries to return sentences in as close to their order of appearance in the text as possible. Sentences from intrusive jump tags will generally appear after the text section into which they intrude, and so may be dislodged an arbitrary distance from their actual position in the text.
public java.util.List<java.lang.String> getRelatedSplitWordIDs(java.lang.String wordID)
wordID
- Word ID for which related IDs are wanted.Related word IDs are the word IDs for the other parts of a split word. The returned list includes the given wordID.
For unsplit words, the single given wordID is returned in the list.
Null is returned when the wordID does not exist.
public java.util.List<ExtendedAdornedWord> getRelatedSplitWords(ExtendedAdornedWord adornedWordInfo)
adornedWordInfo
- Adorned word for which related words
are wanted.Related words are those corresponding to the parts of a split word. The returned list includes the given word.
For unsplit words, the single given adorned word is returned in the list.
public java.lang.String trimTag(java.lang.String tag)
tag
- XML tag to trim.public java.lang.String[] splitPathFull(java.lang.String path)
path
- The word path.public java.lang.String[] splitPath(java.lang.String path)
path
- The word path.public java.lang.String trimTrailingSoftTags(java.lang.String path)
path
- Path from which to remove trailing soft tags.public java.util.List<java.lang.String> getSiblingWordIDs(java.lang.String wordID)
wordID
- The word ID of the word for which to find siblings.Sibling words have the same parent hard or jump tag.
public java.util.List<java.lang.String> findWordsByMatchingPath(java.lang.String pattern)
pattern
- The regular expression pattern to match.public java.util.List<java.lang.String> findWordsByMatchingLeadingPath(java.lang.String pattern)
pattern
- The pattern to match.protected java.util.List<java.lang.String> getSelectedWordIDs(java.lang.String startingWordID, java.lang.String endingWordID)
startingWordID
- Starting word ID.endingWordID
- Ending word ID.protected void addWordOrdinals()
protected void fixLeadingGapWords()
protected void generateMissingExtendedAdornedWordInformation()
Generates any missing word and sentence numbers, end of sentence flags, word paths, gap IDs, etc.
protected java.lang.String getWordText(ExtendedAdornedWord adornedWord)
adornedWord
- The extended adorned word.
The word text for an unsplit word is just the word's text. The word text for a split word is the joined text for all word parts, in order of appearance of the parts.