Package edu.northwestern.at.morphadorner.corpuslinguistics.languagerecognizer

Language recognizer.

See: Description

Package edu.northwestern.at.morphadorner.corpuslinguistics.languagerecognizer Description

Language recognizer.

Literary texts are generally composed in one principal language with possible inclusions of short passages (letters, quotations) from other languages. It is helpful to categorize texts by principal language and most prominent secondary language, if any. MorphAdorner includes a simple statistical method based upon character ngrams and rank order statistics to determine the principal language of a text and list possible secondary languages. The method is described in a paper by William B. Cavnar and John M. Trenkle entitled N-Gram-Based Text Categorization which appeared in the Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. MorphAdorner's implementation follows one written by Nakatani Shuyo.