|
English texts of the past exhibit far greater spelling variance
than contemporary texts. Texts from the seventeenth century and earlier
times use conventions that differ from contemporary standards in the use
of "u" and "v" and "y" and capitalization, among others.
Often the same words is spelled differently even within the same work.
By the eighteenth-century texts employ much more modern orthographic
standards, except for capitalization.
MorphAdorner uses rules, word lists, and extended search techniques
such as spelling correction methods and other heuristics to map variant
spellings to their standard (usually modern) form. For obsolete words no
longer in use, a representative standard form is chosen which is usually
the Oxford English Dictionary headword form. Presently MorphAdorner knows
a couple of hundred thousand variant spellings. Using this list,
MorphAdorner can automatically determine the correct standard form for
previously unseen spellings in many cases.
Sometimes a new spelling is just too different from any of the ones
MorphAdorner already knows. Using the extended search facilities
on such a spelling may result in a "standard spelling" which
veers far from the correct form. As time goes one we hope to reduce
the occurrence of such errors.
Orthographic standardization improves the quality of
part-of-speech tagging,
name recognition,
and text searching.
However, standardization by itself isn't sufficient to fix some other
problems. These include the lack of the apostrophe
to mark the possessive case and the inconsistent practices
of capitalization as markers of proper nouns.
In English before 1700 the apostrophe never indicates
the genitive, and "her mother's daughter" is written
"her mothers daughter". An even more problematic example is
"her majesty's daughter" which appears in early texts as
"her majesties daughter." The use of the apostrophe as a genetive
marker gained ground during the eighteenth century, and has been used
as it is today since the early nineteenth century.
In the eighteenth century, the apostrophe is sometimes used as a plural
marker in certain character combinations. Thus "canoe's" is much more
likely to be a plural than a possessive form.
The modern practice of restricting capitalization to names,
namelike entities, and certain emphatic uses is about two centuries old.
In earlier English nouns are freely capitalized, and capitalization is not
a reliable way of picking out proper nouns. However, proper nouns have
usually been capitalized in all forms of written English since about 1550.
Before that names can appear in lower case.
In poetry the first word of each line is often capitalized even
when that word does not start a sentence. For purposes of part-of-speech
tagging, a simple workaround is to use the lower case form of a word
that does not start a sentence, except if the word appears in a list
of known proper names.
You can read a more detailed description of the
spelling standardization process.
You can try MorphAdorner's
spelling standardizer online.
|