MorphAdorner Lexicon Lookup

Lexicon Lookup

A MorphAdorner word lexicon for a corpus stores all the spellings for words which appear in the corpus, along with the lemmata and parts of speech for each spelling. Each lexicon entry also provides the number of times that spelling appears, both overall as well as broken down by part of speech. MorphAdorner currently provides two English language lexicons, one for Early Modern English, and one for Nineteenth Century Fiction.

MorphAdorner augments the lexicons with auxiliary lists of words which do not appear in the corpus. These include extensive lists of proper names, common foreign words, and combinations of existing words with parts of speech that do not appear in the corpus. These are assigned an "occurrence" count of one. These auxiliary lists improve the ability of MorphAdorner to adorn text with parts of speech and recognize proper names and places.

Lexicon File Format

Lexicon files are plain text files encoded in utf-8 format. Each line in the lexicon file takes the following form:

spelling countspelling pos1 lemma1 countpos1 pos2 lemma2 countpos2 ...

where

spelling is the spelling for a word,
countspelling is the number of times the spelling appears in the training data,
pos1 is the tag corresponding to the most commonly occurring part of speech for this spelling,
lemma1 is the lemma form for this spelling,
countpos1 is the number of times the pos1 tag appeared, and
pos2, countpos2, etc. are the other possible parts of speech and their counts and lemmata.

These fields are separated by tab characters.

The raw counts are stored rather than probabilities so that new training data can be used to update the lexicon easily, and so that individual part of speech taggers can apply different methods of count smoothing.

Following are a few lines from the nineteenth century fiction lexicon.

die 1660 vvi die 1164 n1 die 22 vvb die 474 die-away 2 j die-away 2 died 803 vvd die 607 vvn die 196

For example, the spelling died appears 803 times in the training data. It appears 607 times as the part of speech vvn and 196 times as the part of speech vvn. Its lemma in both cases is die.

When lemmata are not available, an "*' appears in the lemma field. Suffix lexicons contains "*" for all lemmata, for example.

You can try looking up spellings in MorphAdorner's Lexicon lookup online.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner