|
Literary texts are generally composed in one principal language with
possible inclusions of short passages (letters, quotations) from
other languages. It is helpful to categorize texts by principal
language and most prominent secondary language, if any. MorphAdorner
includes a simple statistical method based upon character ngrams and
rank order statistics to determine the principal language of a text
and list possible secondary languages. The method is described in
a paper by William B. Cavnar and John M. Trenkle entitled
N-Gram-Based Text Categorization which appeared
in the Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis
and Information Retrieval. MorphAdorner's implementation follows one written
by Frank S. Nestel.
During the Monk project we used this language recognition mechanism
to help screen documents that were nominally English but in
fact contained large admixtures of unmarked foreign language text.
Some examples:
- EEBO document A36803 had an English introduction, a lot of Latin,
and a lot of English names. It had a low English score and
a non-negligible Latin score. We excluded this from the Monk
corpus of EEBO texts.
- EEBO document A57469 had an English title but was classified
as primarily French. It turned out to be a legal text with
a lot of French and Latin. We also excluded this from the
Monk corpus.
- EEBO document A34069 had a low English score (~0.7). It turned
out to be an account of a trading voyage containing a lot of
Dutch interaction.
From these and other experiences we determined that the language recognizer
test scores offered a reliable way to identify texts that might contain
significant amounts of non-English text in them. The specific language
labels were not quite so reliable. For example, French and Latin --
particularly in older texts -- were difficult to distinguish, but they
were definitely distinguishable from English. Likewise Scots often
appeared as a second choice for English texts. The Scots score was
typically higher for older English texts which contain large amounts of
old-fashioned variant spellings. In more modern texts a high Scots
score often pointed to novels containing swaths of Scots dialect.
You can try MorphAdorner's
default language extractor online.
This extractor attempts to recognize only the following languages:
- dutch
- english
- french
- german
- italian
- latin
- scots
- spanish
- welsh
|