Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Documentation > Merging A Brill Lexicon
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Merging A Brill Lexicon
 
 

MergeBrillLexicon merges the contents of a Brill format lexicon with a MorphAdorner format lexicon into a combined MorphAdorner lexicon.

Usage:

mergebrilllexicon lexicon.lex brilllexicon.txt mergedlexicon.lex

where

  • lexicon.lex -- input MorphAdorner format word lexicon.
  • brilllexicon.txt -- input Brill format word lexicon to be merged with MorphAdorner word lexicon.
  • mergedlexicon.lex -- output merged lexicon in MorphAdorner format.

A Brill lexicon is a simple utf-8 formatted text file containing words and their possible part of speech tags. Each word appears on a separate line. The first token on each line is the word. The remaining tokens are the potential parts of speech for the word, separated by blanks or tab characters. The most commonly occurring part of speech should be the first one listed.

word pos1 pos2 pos3 ...

This type of lexicon was popularized by Eric Brill's part of speech tagger in the early 1990s.

The Brill entries are merged with the input MorphAdorner lexicon to produce an updated output MorphAdorner format lexicon. The first part of speech for each word is added with a could of two, while the remaining words are added with a count of one. The default English lemmatizer is used to determine lemmata for the Brill words. When a word to be added already exists in the MorphAdorner lexicon, only the new parts of speech are added to the existing lexicon entry.

Brill lexicons are convenient for adding large lists of words such as proper and place names, foreign language words, and so on. Here is a small section of a sample Brill lexicon.

Yellott np1
Yellowby np1
Yellville np1
Yelton np1
Yelverton np1
lieu fw-fr
lieux fw-fr
lire fw-fr
lit fw-fr
literary j
livre fw-fr
livres fw-fr
loi fw-fr
lois fw-fr
loix fw-fr

MorphAdorner also defines an enhanced Brill lexicon which provides the lemmata for each word's parts of speech. MergeEnhancedBrillLexicon allows you to merge an enhanced Brill lexicon into a MorphAdorner lexicon.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Sun Mar 15 05:52:36 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University