public class TransitionMatrix extends IsCloseableObject implements UsesLogger
Holds the unigram, bigram, and trigram counts and probabilities.
Call calculateProbabilities() to calculate tag transition probabilities. Weights for the ngrams are computed using deleted interpolation.
Modifier and Type | Field and Description |
---|---|
protected static int |
BIGRAM |
protected Map2D<java.lang.String,java.lang.String,java.lang.Integer> |
bigramCountMap |
protected Map2D<java.lang.String,java.lang.String,java.lang.Double> |
bigramProbMap |
protected double[] |
bigramWeights
Bigram weights from deleted interpolation.
|
protected static boolean |
debug
True if debugging output enabled.
|
protected boolean |
haveProbabilities
True if probabilities calculated.
|
protected Logger |
logger
Logger used for output.
|
protected int[] |
totalNGrams
Total ngram tag counts.
|
protected int |
totalWords
Total number of words.
|
protected static int |
TRIGRAM |
protected Map3D<java.lang.String,java.lang.String,java.lang.String,java.lang.Integer> |
trigramCountMap |
protected Map3D<java.lang.String,java.lang.String,java.lang.String,java.lang.Double> |
trigramProbMap |
protected double[] |
trigramWeights
Trigram weights from deleted interpolation.
|
protected static int |
UNIGRAM
Constants for clarification.
|
protected java.util.Map<java.lang.String,java.lang.Integer> |
unigramCountMap
HashMaps with part of speech tags as the keys and
counts as the values.
|
protected java.util.Map<java.lang.String,java.lang.Double> |
unigramProbMap
HashMaps with part of speech tags as the keys and
transition probability as the values.
|
protected int[] |
uniqueNGrams
Unique ngram tag counts.
|
Constructor and Description |
---|
TransitionMatrix() |
Modifier and Type | Method and Description |
---|---|
void |
calculateProbabilities()
Calculate transition probabilities from counts.
|
java.util.Set<java.lang.String> |
columnKeySet()
Get column key set.
|
protected void |
computeBigramWeights()
Calculate bigram weights for contextual smoothing.
|
protected void |
computeTrigramWeights()
Calculate trigram weights for contextual smoothing.
|
void |
displayNGramCounts()
Display the ngram counts.
|
double[] |
getBigramWeights()
Return weights for bigrams using deleted interpolation.
|
int |
getCount(java.lang.String tag)
Look up unigram count.
|
int |
getCount(java.lang.String tag1,
java.lang.String tag2)
Look up bigram count.
|
int |
getCount(java.lang.String tag1,
java.lang.String tag2,
java.lang.String tag3)
Look up trigram count.
|
Logger |
getLogger()
Get the logger.
|
double |
getProbability(java.lang.String tag)
Look up unigram probability.
|
double |
getProbability(java.lang.String tag1,
java.lang.String tag2)
Look up bigram probability.
|
double |
getProbability(java.lang.String tag1,
java.lang.String tag2,
java.lang.String tag3)
Look up trigram probability.
|
int |
getTotalWordCount()
Get total number of words.
|
double[] |
getTrigramWeights()
Return weights for trigrams using deleted interpolation.
|
void |
incrementCount(java.lang.String tag,
int increment)
Increment unigram tag count.
|
void |
incrementCount(java.lang.String tag1,
java.lang.String tag2,
int increment)
Increment bigram tag count.
|
void |
incrementCount(java.lang.String tag1,
java.lang.String tag2,
java.lang.String tag3,
int increment)
Increment trigram tag count.
|
void |
loadTransitionMatrix(java.io.Reader reader,
char delimChar)
Load transition matrix from a reader.
|
void |
loadTransitionMatrix(java.net.URL url,
boolean compressed,
java.lang.String encoding,
char delimChar)
Load transition matrix from a URL.
|
void |
loadTransitionMatrix(java.net.URL url,
java.lang.String encoding,
char delimChar)
Load transition matrix from a URL.
|
java.util.Set<java.lang.String> |
rowKeySet()
Get row key set.
|
double |
safelyDivideCount(int numerator,
int denominator)
Safely divide two counts.
|
double |
safelyDivideSmoothedCount(int numerator,
int denominator)
Safely divide two counts.
|
void |
saveTransitionMatrix(java.lang.String transitionFileName,
java.lang.String encoding,
char delimChar)
Save transition matrix to a file.
|
void |
saveTransitionMatrix(java.io.Writer writer,
char delimChar)
Save transition matrix to a writer.
|
void |
setLogger(Logger logger)
Set the logger.
|
java.util.Set<java.lang.String> |
sliceKeySet()
Get slice key set.
|
close
protected static boolean debug
protected java.util.Map<java.lang.String,java.lang.Integer> unigramCountMap
protected Map2D<java.lang.String,java.lang.String,java.lang.Integer> bigramCountMap
protected Map3D<java.lang.String,java.lang.String,java.lang.String,java.lang.Integer> trigramCountMap
protected java.util.Map<java.lang.String,java.lang.Double> unigramProbMap
protected Map2D<java.lang.String,java.lang.String,java.lang.Double> bigramProbMap
protected Map3D<java.lang.String,java.lang.String,java.lang.String,java.lang.Double> trigramProbMap
protected int[] totalNGrams
protected int[] uniqueNGrams
protected int totalWords
protected boolean haveProbabilities
protected double[] bigramWeights
protected double[] trigramWeights
protected static final int UNIGRAM
protected static final int BIGRAM
protected static final int TRIGRAM
protected Logger logger
public Logger getLogger()
getLogger
in interface UsesLogger
public void setLogger(Logger logger)
setLogger
in interface UsesLogger
logger
- The logger.public void incrementCount(java.lang.String tag, int increment)
tag
- The part of speech tag.increment
- The increment.public void incrementCount(java.lang.String tag1, java.lang.String tag2, int increment)
tag1
- The first part of speech tag.tag2
- The second part of speech tag.increment
- The increment.public void incrementCount(java.lang.String tag1, java.lang.String tag2, java.lang.String tag3, int increment)
tag1
- The first part of speech tag.tag2
- The second part of speech tag.tag3
- The third part of speech tag.increment
- The increment.public double safelyDivideCount(int numerator, int denominator)
numerator
- The undiscounted numerator value.denominator
- The undiscounted denominator value.public double safelyDivideSmoothedCount(int numerator, int denominator)
numerator
- The undiscounted numerator value.denominator
- The undiscounted denominator value.public void calculateProbabilities()
protected void computeTrigramWeights()
The trigram weights are computed using deleted interpolation.
protected void computeBigramWeights()
The bigram weights are computed using deleted interpolation.
public int getCount(java.lang.String tag)
tag
- The part of speech tag.public int getCount(java.lang.String tag1, java.lang.String tag2)
tag1
- The first part of speech tag.tag2
- The second part of speech tag.public int getCount(java.lang.String tag1, java.lang.String tag2, java.lang.String tag3)
tag1
- The first part of speech tag.tag2
- The second part of speech tag.tag3
- The third part of speech tag.public double getProbability(java.lang.String tag)
tag
- The part of speech tag.public double getProbability(java.lang.String tag1, java.lang.String tag2)
tag1
- The first part of speech tag.tag2
- The second part of speech tag.public double getProbability(java.lang.String tag1, java.lang.String tag2, java.lang.String tag3)
tag1
- The first part of speech tag.tag2
- The second part of speech tag.tag3
- The third part of speech tag.public java.util.Set<java.lang.String> rowKeySet()
public java.util.Set<java.lang.String> columnKeySet()
public java.util.Set<java.lang.String> sliceKeySet()
public int getTotalWordCount()
public void loadTransitionMatrix(java.net.URL url, boolean compressed, java.lang.String encoding, char delimChar) throws java.io.IOException
url
- URL from which to load transition matrix.compressed
- true if gzip compressed.encoding
- Character encoding for file text.delimChar
- Column separator character.
Usually a tab (\t).java.io.IOException
- when an I/O error occurs.public void loadTransitionMatrix(java.net.URL url, java.lang.String encoding, char delimChar) throws java.io.IOException
url
- URL from which to load transition matrix.encoding
- Character encoding for file text.delimChar
- Column separator character.
Usually a tab (\t).java.io.IOException
- when an I/O error occurs.public void loadTransitionMatrix(java.io.Reader reader, char delimChar) throws java.io.IOException
reader
- Reader from which to read transition
matrix.delimChar
- Column separator character.
Usually a tab (\t).java.io.IOException
- when an I/O error occurs.public void displayNGramCounts()
public void saveTransitionMatrix(java.lang.String transitionFileName, java.lang.String encoding, char delimChar) throws java.io.IOException
transitionFileName
- File to receive the transition matrix.encoding
- Character encoding for file text.delimChar
- Column separator character.
Usually a tab (\t).java.io.IOException
- when an I/O error occurs.
Each unigram, bigram, and trigram entry in the transition matrix is saved in a columnar format with the specified delimiter character acting as the column separator. The counts are saved, not the probabilities, so that different smoothing methods can be applied without requiring the training date be recreated.
tag count
tag1 tag2 count
tag1 tag2 tag3 count
public void saveTransitionMatrix(java.io.Writer writer, char delimChar) throws java.io.IOException
writer
- Writer to use to save transition matrix.delimChar
- Column separator character. Usually a tab (\t).java.io.IOException
- when an I/O error occurs.public double[] getBigramWeights()
The sum of the lambda values is 1.0 . The adjusted probability for a bigram is computed from the maximum likelihood probabilities (i.e., undiscounted) as follows.
p*( tag2 | tag1 ) =< br />
lambda[0] * p( tag2 | tag1 ) +
lambda[1] * p( tag2 )
public double[] getTrigramWeights()
The sum of the lambda values is 1.0 . The adjusted probability for a trigram is computed from the maximum likelihood probabilities (i.e., undiscounted) as follows.
p*( tag3 | tag1 , tag2 ) =< br />
lambda[0] * p( tag3 | tag1 , tag2 ) +
lambda[1] * p( tag3 | tag2 ) +
lambda[2] * p( tag3 )