MorphAdorner Language Recognizer

Language Recognizer

Literary texts are generally composed in one principal language with possible inclusions of short passages (letters, quotations) from other languages. It is helpful to categorize texts by principal language and most prominent secondary language, if any. MorphAdorner includes a simple statistical method based upon character ngrams and rank order statistics to determine the principal language of a text and list possible secondary languages. The method is described in a paper by William B. Cavnar and John M. Trenkle entitled N-Gram-Based Text Categorization which appeared in the Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. MorphAdorner's implementation follows one written by Nakatani Shuyo.

During the Monk project we used this language recognition mechanism to help screen documents that were nominally English but in fact contained large admixtures of unmarked foreign language text. Some examples:

EEBO document A36803 had an English introduction, a lot of Latin, and a lot of English names. It had a low English score and a non-negligible Latin score. We excluded this from the Monk corpus of EEBO texts.
EEBO document A57469 had an English title but was classified as primarily French. It turned out to be a legal text with a lot of French and Latin. We also excluded this from the Monk corpus.
EEBO document A34069 had a low English score (~0.7). It turned out to be an account of a trading voyage containing a lot of Dutch interaction.

From these and other experiences we determined that the language recognizer test scores offered a reliable way to identify texts that might contain significant amounts of non-English text in them. The specific language labels were not quite so reliable. For example, French and Latin -- particularly in older texts -- were difficult to distinguish, but they were definitely distinguishable from English. Likewise Scots often appeared as a second choice for English texts. The Scots score was typically higher for older English texts which contain large amounts of old-fashioned variant spellings. In more modern texts a high Scots score often pointed to novels containing swaths of Scots dialect.

You can try MorphAdorner's default language extractor online. This extractor recognizes over 70 languages. The longer the text, the more reliable the detection.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner