|
MergeBrillLexicon
merges the contents of a Brill format lexicon with a MorphAdorner
format lexicon into a combined MorphAdorner lexicon.
Usage:
mergebrilllexicon lexicon.lex brilllexicon.txt mergedlexicon.lex
where
- lexicon.lex -- input MorphAdorner format word lexicon.
- brilllexicon.txt -- input Brill format word lexicon to be merged
with MorphAdorner word lexicon.
- mergedlexicon.lex -- output merged lexicon in MorphAdorner format.
A Brill lexicon is a simple utf-8 formatted text file containing
words and their possible part of speech tags. Each word
appears on a separate line. The first token on each line is the
word. The remaining tokens are the potential parts of speech
for the word, separated by blanks or tab characters.
The most commonly occurring part of speech should be the first one listed.
word pos1 pos2 pos3 ...
This type of lexicon was popularized by Eric Brill's part of speech tagger
in the early 1990s.
The Brill entries are merged with the input MorphAdorner lexicon
to produce an updated output MorphAdorner format lexicon. The first
part of speech for each word is added with a could of two, while the
remaining words are added with a count of one. The default
English lemmatizer is used to determine lemmata for the Brill
words. When a word to be added already exists in the MorphAdorner
lexicon, only the new parts of speech are added to the existing
lexicon entry.
Brill lexicons are convenient for adding large lists of words
such as proper and place names, foreign language words, and
so on. Here is a small section of a sample Brill lexicon.
Yellott np1
Yellowby np1
Yellville np1
Yelton np1
Yelverton np1
lieu fw-fr
lieux fw-fr
lire fw-fr
lit fw-fr
literary j
livre fw-fr
livres fw-fr
loi fw-fr
lois fw-fr
loix fw-fr
MorphAdorner also defines an enhanced Brill lexicon which provides the
lemmata for each word's parts of speech.
MergeEnhancedBrillLexicon
allows you to merge an enhanced Brill lexicon into a MorphAdorner lexicon.
|