Package edu.northwestern.at.morphadorner.tools.findteitextlanguage

Determines the language(s) in which a TEI text is written.

See: Description

Package edu.northwestern.at.morphadorner.tools.findteitextlanguage Description

Determines the language(s) in which a TEI text is written.

Usage:

java edu.northwestern.at.morphadorner.tools.findteitextlanguage output.tab input1.xml input2.xml ...

output.tab -- output tab-separated values file described below.
input*.xml -- input TEI XML files whose language is to be found.

The output file is a tab-delimited utf-8 text file containing the following fields, in order.

  1. The original XML file name.
  2. The length of the plain text from the TEI file, ignoring XML markup, in characters.
  3. The most likely language.
  4. The language recognizer score for the most likely language.
  5. The second most likely language.
  6. The language recognizer score for the second most likely language.
  7. The third most likely language.
  8. The language recognizer score for the third most likely language.

Texts which do not have at least three recognizable languages will have missing language names set to blank with a score of zero.

Language recognizer scores range from 0.0 (not a match) to 1.0 (perfect match).