NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Lexicon Lookup

A MorphAdorner word lexicon for a corpus stores all the spellings for words which appear in the corpus, along with the lemmata and parts of speech for each spelling. Each lexicon entry also provides the number of times that spelling appears, both overall as well as broken down by part of speech. MorphAdorner currently provides two English language lexicons, one for Early Modern English, and one for Nineteenth Century Fiction.

MorphAdorner augments the lexicons with auxiliary lists of words which do not appear in the corpus. These include extensive lists of proper names, common foreign words, and combinations of existing words with parts of speech that do not appear in the corpus. These are assigned an "occurrence" count of one. These auxiliary lists improve the ability of MorphAdorner to adorn text with parts of speech and recognize proper names and places.

Lexicon File Format

Lexicon files are plain text files encoded in utf-8 format. Each line in the lexicon file takes the following form:

spelling countspelling pos1 lemma1 countpos1 pos2 lemma2 countpos2 ...

where

  • spelling is the spelling for a word,
  • countspelling is the number of times the spelling appears in the training data,
  • pos1 is the tag corresponding to the most commonly occurring part of speech for this spelling,
  • lemma1 is the lemma form for this spelling,
  • countpos1 is the number of times the pos1 tag appeared, and
  • pos2, countpos2, etc. are the other possible parts of speech and their counts and lemmata.

These fields are separated by tab characters.

The raw counts are stored rather than probabilities so that new training data can be used to update the lexicon easily, and so that individual part of speech taggers can apply different methods of count smoothing.

Following are a few lines from the nineteenth century fiction lexicon.

die 1660 vvi die 1164 n1 die 22 vvb die 474
die-away 2 j die-away 2
died 803 vvd die 607 vvn die 196

For example, the spelling died appears 803 times in the training data. It appears 607 times as the part of speech vvn and 196 times as the part of speech vvn. Its lemma in both cases is die.

When lemmata are not available, an "*' appears in the lemma field. Suffix lexicons contains "*" for all lemmata, for example.

You can try looking up spellings in MorphAdorner's Lexicon lookup online.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk