public class Dictionary
extends java.lang.Object
The dictionary format:
In what follows: Every "%" symbol and everything after it is ignored on every line. Every newline or tab is replaced by a space.
The dictionary file is a sequence of ENTRIES. Each ENTRY is one or more WORDS (a sequence of upper or lower case letters) separated by spaces, followed by a ":", followed by an EXPRESSION followed by a ";". An EXPRESSION is a lisp expression where the functions are "&" or "and" or "|" or "or", and there are three types of parentheses: "()", "{}", and "[]". The terminal symbols of this grammar are the connectors, which are strings of letters or numbers or *s. (This description applies to the prefix form of the dictionary. the current dictionary is written in infix form. If the defined constant INFIX_NOTATION is defined, then infix is used otherwise prefix is used.)
The connector begins with an optinal @, which is followed by an upper case sequence of letters. Each subsequent *, lower case letter or number is a subscript. At the end is a + or - sign. The "@" allows this connector to attach to one or more other connectors.
Here is a sample dictionary entry (in infix form):
gone: T- & {@EV+};
(See our paper for more about how to interpret the meaning of the dictionary expressions.)
A previously defined word (such as "gone" above) may be used instead of a connector to specify the expression it was defined to be. Of course, in this case, it must uniquely specify a word in the dictionary, and have been previously defined.
If a word is of the form "/foo", then the file current-dir/foo is a so-called word file, and is read in as a list of words. A word file is just a list of words separted by blanks or newlines.
A word that contains the character "_" defines an idiomatic use of the words separated by the "_". For example "kind of" is an idiomatic expression, so a word "kind_of" is defined in the dictionary. Idomatic expressions of any number of words can be defined in this way. When the word "kind" is encountered, all the idiomatic uses of the word are considered.
An expresion enclosed in "[..]" is give a cost of 1. This means that if any of the connectors inside the square braces are used, a cost of 1 is incurred. (This cost is the first element of the cost vector printed when a sentence is parsed.) Of course if something is inside of 10 levels of "[..]" then using it incurs a cost of 10. These costs are called "disjunct costs". The linkages are printed out in order of non-increasing disjunct cost.
The expression "(A+ or ())" means that you can choose either "A+" or the empty expression "()", that is, that the connector "A+" is optional. This is more compactly expressed as "{A+}". In other words, curly braces indicate an optional expression.
The expression "(A+ or [])" is the same as that above, but there is a cost of 1 incurred for choosing not to use "A+". The expression "(EXP1 & [EXP2])" is exactly the same as "[EXP1 & EXP2]". The difference between "({[A+]} & B+)" and "([{A+}] & B+)" is that the latter always incurrs a cost of 1, while the former only gets a cost of 1 if "A+" is used.
The dictionary writer is not allowed to use connectors that begin in "ID". This is reserved for the connectors automatically generated for idioms.
One more thing...
The Dictionary is a binary tree
The data structure storing the dictionary is simply a binary tree.
There is one catch however. The ordering of the words is not
exactly the order given by strcmp. It was necessary to
modify the order to make it so that "make" < "make.n" < "make-up"
The problem is that if some other string could lie between '\0'
and '.' (such as '-' which strcmp would give) then it makes it much
harder to return all the strings that match a given word.
For example, if "make-up" was inserted, then "make" was inserted
the a search was done for "make.n", the obvious algorithm would
not find the match.
int ss, tt;
while (*s != '\0' && *s == *t) {
s++;
t++;
}
if (*s == '.') {
ss = 1;
} else {
ss = (*s)<<1;
}
if (*t == '.') {
tt = 1;
} else {
tt = (*t)<<1;
}
return (ss - tt);
}
int dict_compare(String s, String t) {
Constructor and Description |
---|
Dictionary(ParseOptions opts,
java.lang.String dict_name,
java.lang.String pp_name,
java.lang.String cons_name,
java.lang.String affix_name)
This is the dictionary constructor method.
|
Modifier and Type | Method and Description |
---|---|
(package private) DictNode |
abridged_lookup(java.lang.String s) |
(package private) boolean |
advance()
this reads the next token from the input into token
|
(package private) boolean |
boolean_abridged_lookup(java.lang.String s) |
(package private) boolean |
boolean_dictionary_lookup(java.lang.String s) |
(package private) java.lang.String |
build_idiom_word_name(java.lang.String s)
Allocates string space and returns a pointer to it.
|
(package private) boolean |
check_connector(java.lang.String s)
makes sure the string s is a valid connector
|
(package private) Exp |
connector()
the current token is a connector (or a dictionary word)
make a node for it
|
(package private) static boolean |
contains_underbar(java.lang.String s)
Returns true if the string contains an underbar character.
|
(package private) int |
dict_compare(java.lang.String s,
java.lang.String t)
The data structure storing the dictionary is simply a binary tree.
|
(package private) void |
dict_display_word_info(java.lang.String s) |
(package private) void |
dict_error(java.lang.String s) |
(package private) int |
dict_match(java.lang.String s,
java.lang.String t)
assuming that s is a pointer to a dictionary string, and that
t is a pointer to a search string, this returns 0 if they
match, >0 if s>t, and <0 if s |
DictNode |
dictionary_lookup(java.lang.String s)
Returns a pointer to a lookup list of the words in the dictionary.
|
(package private) static java.io.Reader |
dictopen(ParseOptions opts,
java.lang.String dictname,
java.lang.String filename)
This function is used to open a dictionary file or a word file,
or any associated data file (like a post process knowledge file).
|
(package private) Exp |
Exp_create()
allocate a new Exp node and link it into the
exp_list for freeing later
|
(package private) Exp |
expression() |
(package private) java.lang.String |
generate_id_connector()
generate a new connector name
obtained from the current_name
allocate string space for it.
|
(package private) java.lang.String |
get_a_word(java.io.Reader fp)
Reads in one word from the file, allocates space for it,
and returns it.
|
(package private) int |
get_character(boolean quote_mode)
This gets the next character from the input, eliminating comments.
|
(package private) void |
increment_current_name() |
(package private) DictNode |
insert_dict(DictNode n,
DictNode newNode)
Insert the new node into the dictionary below node n
give error message if the new element's string is already there
assumes that the "n" field of new is already set, and the left
and right fields of it are null
|
(package private) void |
insert_idiom(DictNode dn)
Takes as input a pointer to a DictNode.
|
(package private) void |
insert_list(DictNode p,
int l)
Insert the list into the dictionary.
|
(package private) static boolean |
is_ed_word(java.lang.String s) |
(package private) boolean |
is_equal(int c)
returns true if this token is a special token and it is equal to c
|
(package private) static boolean |
is_idiom_number(java.lang.String s)
return true if the string s is a sequence of digits.
|
(package private) static boolean |
is_idiom_string(java.lang.String s)
Returns false if it is not a correctly formed idiom string.
|
(package private) static boolean |
is_idiom_word(java.lang.String s) |
(package private) static boolean |
is_ing_word(java.lang.String s) |
(package private) static boolean |
is_initials_word(java.lang.String word)
This might be a good place for entity extraction since all cap words
often represent entities US, DOD etc.
|
(package private) static boolean |
is_ly_word(java.lang.String s) |
(package private) static boolean |
is_number(java.lang.String s) |
(package private) static boolean |
is_s_word(java.lang.String s) |
(package private) static boolean |
ishyphenated(java.lang.String s)
returns true iff it's an appropriately formed hyphenated word.
|
(package private) DictNode |
make_idiom_DictNodes(java.lang.String string)
Tear the idiom string apart.
|
(package private) Exp |
make_optional_node(Exp e)
This creates an OR node with two children, one the given node,
and the other as zeroary node.
|
(package private) Exp |
make_unary_node(Exp e)
This creates a node with one child (namely e).
|
(package private) Exp |
make_zeroary_node()
This creates a node with zero children.
|
(package private) int |
max_postfix_found(DictNode d) |
(package private) static int |
numberfy(java.lang.String s)
if the string contains a single ".", and ends in ".Ix" where
x is a number, return x.
|
(package private) boolean |
open_dictionary(java.lang.String dict_path_name)
Opens the dictionary, sets the path and assigns the Dictionary object's
filepointer to the dictionary specified in ParseOptions.
|
(package private) Postprocessor |
post_process_open(ParseOptions opts,
java.lang.String dictname,
java.lang.String path)
read rules from path and initialize the appropriate fields in
a postprocessor structure, a pointer to which is returned.
|
(package private) void |
prune_lookup_list(java.lang.String s) |
(package private) void |
rabridged_lookup(DictNode dn,
java.lang.String s) |
(package private) void |
rdictionary_lookup(DictNode dn,
java.lang.String s) |
void |
read_dictionary()
Read the dictionary into memory.
|
(package private) boolean |
read_entry()
Starting with the current token parse one dictionary entry.
|
(package private) DictNode |
read_word_file(DictNode dn,
java.lang.String filename)
(1) opens the word file and adds it to the word file list
(2) reads in the words
(3) puts each word in a DictNode
(4) links these together by their left pointers at the front of the list pointed to by dn
(5) returns a pointer to the first of this list
|
(package private) Exp |
restricted_expression(boolean and_ok,
boolean or_ok) |
(package private) boolean |
true_dict_match(java.lang.String s,
java.lang.String t)
We need to prune out the lists thus generated.
|
(package private) void |
warning(java.lang.String s) |
public DictNode root
public java.lang.String name
public boolean use_unknown_word
public boolean unknown_word_defined
public boolean capitalized_word_defined
public boolean pl_capitalized_word_defined
public boolean hyphenated_word_defined
public boolean number_word_defined
public boolean ing_word_defined
public boolean s_word_defined
public boolean ed_word_defined
public boolean ly_word_defined
public boolean left_wall_defined
public boolean right_wall_defined
public Postprocessor postprocessor
public Postprocessor constituent_pp
public Dictionary affix_table
public boolean andable_defined
public ConnectorSet andable_connector_set
public ConnectorSet unlimited_connector_set
public int max_cost
public int num_entries
public ParseOptions opts
public WordFile word_file_header
public Exp exp_list
public java.io.Reader fp
public java.lang.StringBuffer token
public boolean is_special
public int already_got_it
public int line_number
DictNode lookup_list
static boolean rand_table_inited
static java.lang.StringBuffer current_name
static final int CN_size
public Dictionary(ParseOptions opts, java.lang.String dict_name, java.lang.String pp_name, java.lang.String cons_name, java.lang.String affix_name) throws java.io.IOException
java.io.IOException
Postprocessor post_process_open(ParseOptions opts, java.lang.String dictname, java.lang.String path) throws java.io.IOException
opts
- the parse options. These are kept in many places, use care!dictname
- the dictionary to use. If fully qualified then sets the path for affix, etc.path
- Colon separated list of directories to search for ditionary, postprocessor etc.java.io.IOException
Postprocessor
,
Dictionary(ParseOptions, String, String, String, String, String)
boolean open_dictionary(java.lang.String dict_path_name) throws java.io.IOException
dict_path_name
- the fully qualified? path to the ditionary?java.io.IOException
dictopen(ParseOptions, String, String)
public void read_dictionary() throws java.io.IOException
java.io.IOException
open_dictionary(String)
public DictNode dictionary_lookup(java.lang.String s)
void prune_lookup_list(java.lang.String s)
s
- void rdictionary_lookup(DictNode dn, java.lang.String s)
dn
- s
- boolean boolean_dictionary_lookup(java.lang.String s)
s
- void rabridged_lookup(DictNode dn, java.lang.String s)
dn
- s
- DictNode abridged_lookup(java.lang.String s)
s
- boolean boolean_abridged_lookup(java.lang.String s)
int dict_match(java.lang.String s, java.lang.String t)
boolean true_dict_match(java.lang.String s, java.lang.String t)
void dict_display_word_info(java.lang.String s)
s
- static boolean is_idiom_string(java.lang.String s)
correct such string:
() contains no "."
() non-empty strings separated by _
s
- word to lookupstatic boolean is_idiom_word(java.lang.String s)
s
- static boolean is_initials_word(java.lang.String word)
word
- static boolean is_number(java.lang.String s)
s
- static boolean ishyphenated(java.lang.String s)
s
- static boolean is_ing_word(java.lang.String s)
s
- static boolean is_s_word(java.lang.String s)
s
- static boolean is_ed_word(java.lang.String s)
s
- static boolean is_ly_word(java.lang.String s)
s
- static int numberfy(java.lang.String s)
static boolean is_idiom_number(java.lang.String s)
static boolean contains_underbar(java.lang.String s)
void dict_error(java.lang.String s) throws java.io.IOException
java.io.IOException
void warning(java.lang.String s)
Exp Exp_create()
int get_character(boolean quote_mode) throws java.io.IOException
java.io.IOException
boolean advance() throws java.io.IOException
java.io.IOException
boolean is_equal(int c)
boolean check_connector(java.lang.String s) throws java.io.IOException
java.io.IOException
Exp connector() throws java.io.IOException
java.io.IOException
Exp make_unary_node(Exp e)
Exp make_zeroary_node()
Exp make_optional_node(Exp e)
Exp expression() throws java.io.IOException
java.io.IOException
Exp restricted_expression(boolean and_ok, boolean or_ok) throws java.io.IOException
java.io.IOException
int dict_compare(java.lang.String s, java.lang.String t)
verbose version
int dict_compare(String s, String t) { int ss, tt; while (*s != '\0' && *s == *t) { s++; t++; } if (*s == '.') { ss = 1; } else { ss = (*s)<<1; } if (*t == '.') { tt = 1; } else { tt = (*t)<<1; } return (ss - tt); }terse version
int dict_compare(String s, String t) { int i = 0; while (i < s.length() && i < t.length() && s.charAt(i) == t.charAt(i)) { i++; } return (i >= s.length() ? 0 : (s.charAt(i) == '.' ? 1 : (s.charAt(i) << 1))) - (i >= t.length() ? 0 : (t.charAt(i) == '.' ? 1 : (t.charAt(i) << 1))); }
DictNode insert_dict(DictNode n, DictNode newNode) throws java.io.IOException
java.io.IOException
void insert_list(DictNode p, int l) throws java.io.IOException
p
- points to a list of dict_nodes connected by their left pointersl
- is the length of this list (the last ptr may not be null)java.io.IOException
boolean read_entry() throws java.io.IOException
java.io.IOException
void insert_idiom(DictNode dn) throws java.io.IOException
java.io.IOException
DictNode read_word_file(DictNode dn, java.lang.String filename) throws java.io.IOException
java.io.IOException
java.lang.String get_a_word(java.io.Reader fp) throws java.io.IOException
java.io.IOException
static java.io.Reader dictopen(ParseOptions opts, java.lang.String dictname, java.lang.String filename) throws java.io.IOException
It works as follows. If the file name begins with a "/", then it's assumed to be an absolute file name and it tries to open that exact file.
If the filename does not begin with a "/", then it uses the dictpath mechanism to find the right file to open. This looks for the file in a sequence of directories until it finds it. The sequence of directories is specified in a dictpath string, in which each directory is followed by a ":".
The dictpath that it uses is constructed as follows. If the dictname is non-null, and is an absolute path name (beginning with a "/", then the part after the last "/" is removed and this is the first directory on the dictpath. After this comes the DICTPATH environment variable, followed by the DEFAULTPATH
java.io.IOException
java.lang.String generate_id_connector()
java.lang.String build_idiom_word_name(java.lang.String s)
int max_postfix_found(DictNode d)
DictNode make_idiom_DictNodes(java.lang.String string)
void increment_current_name()