NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Merging An Enhanced Brill Lexicon

MergeEnhancedBrillLexicon merges the contents of an enhanced Brill format lexicon with a MorphAdorner format lexicon into a combined MorphAdorner lexicon.

Usage:

mergeenhancedbrilllexicon lexicon.lex enhancedbrilllexicon.txt mergedlexicon.lex

where

  • lexicon.lex -- input MorphAdorner format word lexicon.
  • enhancedbrilllexicon.txt -- input enhanced Brill format word lexicon to be merged with MorphAdorner word lexicon.
  • mergedlexicon.lex -- output merged lexicon in MorphAdorner format.

An enhanced Brill lexicon is a simple utf-8 formatted text file containing words and their possible part of speech tags along with the lemma for each part of speech. Each word appears on a separate line. The first token on each line is the word. The remaining tokens are a a set of pairs of potential parts of speech for the word, followed by a blank, followed by the lemma for that word and part of speech. The most commonly occurring part of speech should be the first one listed.

word pos1 lemma1 pos2 lemma2 pos3 lemma3 ...

This type of lexicon is an enhancement over the simple lexicon format popularized by Eric Brill's part of speech tagger in the early 1990s. The original Brill lexicon did not provide for specifying the lemmata.

The enhanced Brill entries are merged with the input MorphAdorner lexicon to produce an updated output MorphAdorner format lexicon. The first part of speech for each word is added with a could of two, while the remaining words are added with a count of one. When a word to be added already exists in the MorphAdorner lexicon, only the new parts of speech are added to the existing lexicon entry.

Enhanced Brill lexicons are convenient for adding large lists of words such as proper and place names, foreign language words, and so on. Here is a small section of a sample enhanced Brill lexicon.

Chippewas np2 Chippewa
mor'n d|cs more|than
quicker'n jc|cs quick|than
y'r po22 you
you'se pn22|vbb you|be
youv'e pn22|vhb you|have

MorphAdorner also allows you to merge a simple Brill lexicon into a MorphAdorner lexicon. A simple Brill lexicon only provides the list of parts of speech for each word, not the lemmata.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk