public class PunktTokenCounter
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
protected double |
abbreviationThreshold
Threshold for considering a token to be an abbreviation.
|
protected java.lang.StringBuilder |
b |
protected java.util.Map<java.lang.String,java.lang.Integer> |
c |
protected static int |
CANDIDATE_1 |
protected static int |
CANDIDATE_2 |
protected java.util.Set<java.lang.String> |
candidates |
protected boolean |
ignoreAbbreviationPenalty
Allow disabling the abbreviation penalty heuristic, which
exponentially disadvantages words that are sometimes found
without a final period.
|
protected int |
n |
protected static int |
START |
protected int |
state |
Constructor and Description |
---|
PunktTokenCounter()
Create Punkt token count with default settings.
|
PunktTokenCounter(double abbreviationThreshhold,
boolean ignoreAbbreviationPenalty)
Create Punkt token counter.
|
Modifier and Type | Method and Description |
---|---|
protected void |
count(PunktToken t) |
protected void |
finish() |
java.util.Set<java.lang.String> |
getAbbreviations()
Get set of detected abbreviations.
|
java.util.Set<java.lang.String> |
getCandidates()
Return set of abbreviation candidates.
|
int |
getCount(java.lang.String tokenString)
Get count for a token string.
|
int |
getN()
Return total number of stored tokens.
|
protected void |
inc(java.lang.String s) |
protected boolean |
isAnAbbreviation(java.lang.String candidate)
Determine if abbreviation candidate is actually an abbreviation.
|
protected boolean |
isPeriod(PunktToken token)
Check if token is a period.
|
protected static final int START
protected static final int CANDIDATE_1
protected static final int CANDIDATE_2
protected int state
protected java.lang.StringBuilder b
protected java.util.Map<java.lang.String,java.lang.Integer> c
protected java.util.Set<java.lang.String> candidates
protected int n
protected double abbreviationThreshold
protected boolean ignoreAbbreviationPenalty
PunktTokenCounter(double abbreviationThreshhold, boolean ignoreAbbreviationPenalty)
abbreviationThreshhold
- Threshold for considering
a token to be an abbreviation.
0.3D is the usual value.ignoreAbbreviationPenalty
- True to allow disabling
the abbreviation penalty
heuristic. Usually false.PunktTokenCounter()
protected void count(PunktToken t)
protected void finish()
protected boolean isPeriod(PunktToken token)
token
- The token to check.protected void inc(java.lang.String s)
public int getCount(java.lang.String tokenString)
tokenString
- Token string for which to get the count.public int getN()
public java.util.Set<java.lang.String> getCandidates()
public java.util.Set<java.lang.String> getAbbreviations()
protected boolean isAnAbbreviation(java.lang.String candidate)
candidate
- Candidate abbreviation token.