public class LancasterStemmer extends java.lang.Object implements Stemmer
Paice/Husk Stemmer - License Statement.
This software was designed and developed at Lancaster University, Lancaster, UK, under the supervision of Dr Chris Paice. It is fully in the public domain, and may be used or adapted by any organisation or individual. Neither Dr Paice nor Lancaster University accepts any responsibility whatsoever for its use by other parties, and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.
It is assumed that, as a matter of professional courtesy, anyone who incorporates this software into a system of their own, whether for commercial or research purposes, will acknowledge the source of the code.
Modified from the original Java programs written by Christopher O'Neill and Rob Hooper.
Modifier and Type | Field and Description |
---|---|
static java.lang.String[] |
defaultStemmingRules
Default stemming rules.
|
static java.lang.String[] |
prefixes
Prefixes to remove from words before stemming.
|
protected boolean |
preStrip |
protected java.util.Vector<java.lang.String> |
ruleTable |
protected int[] |
ruleTableIndex |
protected static char |
zeroDigit
Character for "0" digit.
|
Constructor and Description |
---|
LancasterStemmer()
Create a Paice/Husk stemmer using the default stemming rules.
|
LancasterStemmer(java.lang.String[] rules)
Create a Paice/Husk stemmer from a string list of rules.
|
LancasterStemmer(java.lang.String[] rules,
boolean preStrip)
Create a Paice/Husk stemmer from a string list of rules.
|
Modifier and Type | Method and Description |
---|---|
protected int |
charCode(char ch)
Converts a lower case letter to an index.
|
protected java.lang.String |
clean(java.lang.String s)
Remove non-letters from a string.
|
protected int |
firstVowel(java.lang.String s,
int last)
Returns index of first vowel in string.
|
protected boolean |
isDigit(char ch)
Determine if character is a digit.
|
protected boolean |
isLetter(char ch)
Determine if character is a letter.
|
protected void |
loadRules(java.lang.String[] rules)
Loads the stemming rules.
|
java.lang.String |
stem(java.lang.String s)
Stem a specified string.
|
protected java.lang.String |
stripPrefixes(java.lang.String s)
Removes prefixes from a string.
|
protected java.lang.String |
stripSuffixes(java.lang.String s)
Strip suffixes from a string.
|
protected boolean |
vowel(char ch,
char prev)
Determine if character is a vowel or not.
|
public static final java.lang.String[] prefixes
public static final java.lang.String[] defaultStemmingRules
These rules MUST be stored in ascending alphanumeric order of the first character.
protected static final char zeroDigit
protected java.util.Vector<java.lang.String> ruleTable
protected int[] ruleTableIndex
protected boolean preStrip
public LancasterStemmer()
StemmerException
- if something goes wrong.
Prefixes are automatically removed from words with more than two characters.
public LancasterStemmer(java.lang.String[] rules)
rules
- The stemming rules as an array of String.
Prefixes are automatically removed from words with more than two characters.
public LancasterStemmer(java.lang.String[] rules, boolean preStrip)
rules
- The stemming rules as an array of String.preStrip
- True to remove prefixes from words with
more than two characters.
Prefixes are automatically removed from words with more than two characters.
protected void loadRules(java.lang.String[] rules)
rules
- String array of rules.protected int firstVowel(java.lang.String s, int last)
s
- String to search for vowel.last
- Last position to search for vowel.protected java.lang.String stripSuffixes(java.lang.String s)
s
- The string from which to remove suffixes.protected boolean vowel(char ch, char prev)
ch
- The potential vowel.prev
- The previous character.When the character is a "y", the previous character is checked to see if it is a vowel. If so, "y" is not considered a vowel.
protected boolean isDigit(char ch)
ch
- The character to check.protected boolean isLetter(char ch)
ch
- The character to check.protected int charCode(char ch)
ch
- The character. Must be in the range 'a' .. 'z'.protected java.lang.String stripPrefixes(java.lang.String s)
s
- The string from which to remove prefixes.protected java.lang.String clean(java.lang.String s)
s
- String from which to remove non-letters.