|
FindTeiTextLanguage
determines the language(s) in which a TEI text is written.
Usage:
findteitextlanguage output.tab input1.xml input2.xml ...
where
- output.tab -- output tab-separated values file described below.
- input*.xml -- input TEI XML files whose language is to be found.
The output file is a tab-delimited utf-8 text file containing the
following fields, in order.
- The original XML file name.
- The length of the plain text from the TEI file, ignoring XML markup,
in characters.
- The most likely language.
- The language recognizer score for the most likely language.
- The second most likely language.
- The language recognizer score for the second most likely language.
- The third most likely language.
- The language recognizer score for the third most likely language.
Texts which do not have at least three recognizable languages will
have missing language names set to blank with a score of zero.
Language recognizer scores range from 0.0 (not a match) to 1.0 (perfect
match). Documents for which the second and third languages achieve
non-negligible scores indicate potential problems for processing unless
the words in the secondary language are marked up in the TEI document.
|