MorphAdorner: Finding Languages of a TEI Encoded Text

Finding Languages of a TEI Encoded Text

FindTeiTextLanguage determines the language(s) in which a TEI text is written.

Usage:

findteitextlanguage output.tab input1.xml input2.xml ...

where

output.tab -- output tab-separated values file described below.
input*.xml -- input TEI XML files whose language is to be found.

The output file is a tab-delimited utf-8 text file containing the following fields, in order.

The original XML file name.
The length of the plain text from the TEI file, ignoring XML markup, in characters.
The most likely language.
The language recognizer score for the most likely language.
The second most likely language.
The language recognizer score for the second most likely language.
The third most likely language.
The language recognizer score for the third most likely language.

Texts which do not have at least three recognizable languages will have missing language names set to blank with a score of zero.

Language recognizer scores range from 0.0 (not a match) to 1.0 (perfect match). Documents for which the second and third languages achieve non-negligible scores indicate potential problems for processing unless the words in the secondary language are marked up in the TEI document.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner