Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Documentation > Finding Languages of a TEI Encoded Text
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Finding Languages of a TEI Encoded Text
 
 

FindTeiTextLanguage determines the language(s) in which a TEI text is written.

Usage:

findteitextlanguage output.tab input1.xml input2.xml ...

where

  • output.tab -- output tab-separated values file described below.
  • input*.xml -- input TEI XML files whose language is to be found.

The output file is a tab-delimited utf-8 text file containing the following fields, in order.

  1. The original XML file name.
  2. The length of the plain text from the TEI file, ignoring XML markup, in characters.
  3. The most likely language.
  4. The language recognizer score for the most likely language.
  5. The second most likely language.
  6. The language recognizer score for the second most likely language.
  7. The third most likely language.
  8. The language recognizer score for the third most likely language.

Texts which do not have at least three recognizable languages will have missing language names set to blank with a score of zero.

Language recognizer scores range from 0.0 (not a match) to 1.0 (perfect match). Documents for which the second and third languages achieve non-negligible scores indicate potential problems for processing unless the words in the secondary language are marked up in the TEI document.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Wed Mar 25 15:00:22 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University