NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Spelling Standardization Process

This section describes the process by which MorphAdorner maps a variant spelling to a standard (usually modern) form.

Spelling Map File Formats

Spelling maps are the key to MorphAdorner's methodology for standardizing or modernizing spelling. A spelling map is a utf-8 text file contain two fields separated by a tab character. The first field is a variant spelling. The second field is the standardized spelling for the variant.

Currently MorphAdorner uses two maps. The first is culled primarily from nineteenth century fiction texts and currently contains about 5,000 entries. The second is culled from Early Modern English texts and contains over 350,000 thousand known variants. There is also a short list of about 400 variants which are known to vary by word class.

Here are some entries from the Early Modern English spelling map showing standard spellings for forms of "advance." The first column is the variant, the second column is the standard spelling.

aduauceadvance
aduaucedadvanced
aduauceingadvancing
aduaucementadvancement
aduaucethadvanceth
aduaucingadvancing
aduaucyngadvancing
aduaucyngeadvancing
aduaunc'dadvanced

The file of spellings by word class is similar except that it contains multiple sections. Each is headed by a word class name by a colon. This is followed by the list of variant to standard spellings for that word class. For example, the adjectives section starts:

adjective:
ageanagain
badbad
blewblue
brownebrown
chastechaste
christenchristian
clereclear
cliverclever
coldcold
crosscross
cumfblercumfortabler

while the verb section starts:

verb:
d' do
'm am
'oldhold
'sis
aintaren't
ain'taren't
allaysallays
an'taren't
arare
ar'are
arenaaren't
badbade

Some spellings map to themselves when they have different standard spellings for different word classes. The spelling "bad" is an example.

Standardization Steps

MorphAdorner attempts to standardize a spelling as follows.

  1. Load the list of known standard spellings. This is a combination of entries from the 1911 Webster's Dictionary and entries verified against the Oxford English Dictionary from ongoing work with the Monk project texts.

  2. Load maps of known variant spellings to modern spellings as described above.

  3. Create a ternary trie of all the standard and variant spellings. A ternary trie allows very efficient extraction of strings within a specified edit distance of a given string. In other words, it allows efficient extraction of list of words whose spellings are near to any given word's spelling.

  4. Load a list of modernization rules. Currently MorphAdorner defines about 70 such rules which can transform many variant spellings to their modern spellings, or come very close. The rules also provide for correcting defective spellings that contain "gap" markers reflecting illegible letters in the original text. Some sample rules include:

  • Transform the ending "me~" to "men"
  • Transform the ending "ynge" to "ing"
  • Transform "uu" to "w"
  • Transform "v" followed by a non-vowel to "u"

Now for each old spelling, perform the following steps.

  1. Apply all the applicable transformation rules which results in an improved spelling. If this spelling appears in the standard spellings list, we're done. For example, applying the rules to strykynge directly produces the modern standard spelling striking.

  2. See if the transformed spelling appears in the variant spellings map. If so, assign the mapped spelling value as the standard spelling. We're done. For example, applying the rules to vniuersitie produces universitie . This is not the modern spelling, but it is close. The mapped spelling list for Early Modern English provides an entry for universitie, giving the modern spelling as university.

  3. Compile a list of words whose spellings are "close to" the transformed spelling by using the ternary trie to search quickly for all words within a specified edit distance of the transformed word.

  4. Compute a measure of string similarity between each found spelling and the transformed spelling. String similarity measures how similar two strings of characters are. A similarity of 0.0 indicates two strings are completely different, while a similarity of 1.0 indicates two strings are identical. MorphAdorner uses a weighted similarity score based upon letter pair similarity, phonetic distance, and edit distance.

  5. Choose the found spelling with the highest similarity as the most probable correct/standard spelling. If this spelling appears in the standard spellings list, we're done. If not, see if it appears in the mapped spellings list. if so, take the mapped spelling value as the standard spelling, and we're done. Otherwise, accept the transformed spelling as the standard spelling, with the proviso that it may not be a proper standard spelling, and requires further review.

Interactions with Part Of Speech

The standard spelling for some words cannot be determined until the part of speech for the word is known. Examples of such words include doe, bee, poor, marie, and wast. Thus "doe" is most likely "doe" a female deer when it appears as a noun, while "doe" is most likely "do" when it appears as a verb. When "marie" appears as an adjective it is probably "merry", but most likely "marry" when used as a verb.

MorphAdorner keeps a short list of variant spellings by general word class. The final standardized spelling is not assigned until a part of speech has been assigned, so these special cases can usually be disambiguated properly.

Standardizing Proper Names

Proper names can appear with a bewildering variety of spellings even within a single work. Some variants can be transformed to their modern standard forms by using the general standardization rules presented above. For example, the spellings Syracvse and Vlysses, which are the commonest variants of those proper name spellings in the TCP/EEBO version of Plutarch's Lives, both transform by rule to their modern spellings Syracuse and Ulysses.

Other variants are not so easily rectified. The place name Cappadocia appears in Plutarch's Lives as

CPADOCIA1
Cappadocia21
OHPPADOCIA1
Coppadocia1
CAPRADOCIA1

where the frequency of occurrence follows each variant.

MorphAdorner currently uses the following algorithm to look for standard spelling candidates for proper names. This is a variant of the extended search algorithm for standard spellings described above. Because we know we are looking for proper names, we can do a better job by limiting the search space to known proper names.

Proper name search algorithm

  1. Collect the list of known spellings of proper names (tagged with NUPOS parts of speech np1 and np2) in the early modern English lexicon. Currently there are around 66,000 such spellings.

  2. Construct a "name" ternary trie of the lowercase versions of all these names. A ternary trie allows very efficient extraction of strings within a specified edit distance of a given string.

  3. Construct a "consonant" ternary trie of the lowercase versions of the names with all vowels removed. For each unique combination of consonants (in order), store the list of spellings which reduce to that consonant string.

For each unknown name, perform the following steps.

  1. Find all strings in the "name" trie within a specified edit distance of the unknown name. An edit distance of 2 seems to be a good choice.

  2. If any names were found in step 1, compute a measure of string similarity between each found name and the unknown name. Choose the found name with the highest similarity as the most probable correct/standard spelling. Letter-pair similarity seems to work well as a measure of string similarity, but there are many other possible choices.

  3. If no names were found in step 1, find all strings in the "consonant" trie within a specified edit distance of the unknown name with vowels removed. An edit distance of 3seems to be a good choice.

  4. If any consonant strings were found in step 3, perform the following steps for each consonant string.

  1. Pick up all the names which reduce to this consonant string.

  2. For each of those names, compute a measure of string similarity between the name and the unknown name (that is, between the full spellings).

  3. Keep a list of those found names with a similarity score above a reasonable threshhold. 0.75 seems to be a good choice.

  4. Choose the found name with the highest similarity as the most probable correct/standard spelling.

If no names were found by either lookup procedure, leave the unknown name alone.

Here is an example of the algorithm applied to the list of names above. In each case, only one candidate spelling (the correct one, it turns out) was found.

Names near CPADOCIA

cappadocia (0.75)

Names near Cappadocia

cappadocia (1.0)

Names near OHPPADOCIA

cappadocia (0.7777777777777778)

Names near Coppadocia

cappadocia (0.7777777777777778)

Names near CAPRADOCIA

cappadocia (0.7777777777777778)

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk