edu.northwestern.at.morphadorner.corpuslinguistics.languagerecognizer (MorphAdorner)

Interface Summary
Interface Description

LanguageRecognizer
Interface for a Language Recognizer.

Interface Summary
Interface	Description
LanguageRecognizer	Interface for a Language Recognizer.

Class Summary
Class	Description
AbstractLanguageRecognizer	Abstract Language Recognizer.
CybozuLabsLanguageRecognizer	Cybozu Labs Language Recognizer.
DefaultLanguageRecognizer	DefaultLanguageRecognizer determines language in which a text is written.
LanguageRecognizerFactory	LanguageRecognizer factory.

Package edu.northwestern.at.morphadorner.corpuslinguistics.languagerecognizer Description

Language recognizer.

Literary texts are generally composed in one principal language with possible inclusions of short passages (letters, quotations) from other languages. It is helpful to categorize texts by principal language and most prominent secondary language, if any. MorphAdorner includes a simple statistical method based upon character ngrams and rank order statistics to determine the principal language of a text and list possible secondary languages. The method is described in a paper by William B. Cavnar and John M. Trenkle entitled N-Gram-Based Text Categorization which appeared in the Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. MorphAdorner's implementation follows one written by Nakatani Shuyo.