Abbott is a framework for converting texts encoded in
disparate versions of TEI into a
common format called TEI Analytics.
Abbott was developed by Brian Pytlik Zillig and
Steve Ramsey at the University of Nebraska.
An adorned corpus is a corpus in which the words in each work in the corpus have been adorned with morphological information such as part of speech and lemma.
Adornment is the process of adding information such as
morphological information to texts. We use the term "adornment" in
preference to terms such as "annotation" or "tagging" which carry too
many alternative and confusing meanings. Adornment harkens back to the
medieval sense of manuscript adornment or illumination performed by
monks - attaching pictures and marginal comments to texts.
An affix is a prefix or suffix which can be added to a morpheme or word to modify its meaning.
An attribute in machine learning
terms is a property of an object which may be used to determine its
classification. For example, one attribute of a literary work is its
genre: play, novel, short story, etc.
Bayes's rule defines the conditional probability for
two events A and B as follows:
Pr(A | B) = Pr(B | A) * Pr(A) / Pr(B)
A bigram is an ordered sequence of two adjacent words, characters, or morphological adornments.
A bound morpheme is a prefix or suffix which is not a word but which can be attached to a free morpheme
to modify its meaning. For example, the bound morpheme "un" may be
attached to the free morpheme "known" to form the new morpheme/word
A chunk or work part is a part of a work residing in a corpus.
A chunk consists of an ordered series of words and associated
morphological information with a label. A chunk may be treated as a bag
of words or ngrams for data analysis and navigation.
Words which appear near each other in a text more frequently than we would expect by chance are called collocates. Collocates may be ngrams, but may also consist of multiple words with gaps between one or more of the words.
A corpus is a collection of natural language texts. The plural is corpora. Each individual text in a corpus is called a work.
Data herding is the process of acquiring, combining, editing, normalizing, and warehousing texts so they can be used for further analysis.
Document Coordinate System
A document coordinate system assign a numeric vector of
coordinate values to the position of each token in a document. A
typical coordinate value might consist of a pair of line and column
values based upon the printed form of the text, or a character offset
and length pair based upon the digitized text.
A free morpheme is the basic or root form of a word. Bound morphemes can be attached to modify the meaning.
A hard tag is an SGML, HTML, or XML tag which starts a new
text segment but does not interrupt the reading sequence of a text.
Examples of hard tags include <div> and <p>.
Hidden Markov Model
A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters. The problem is to find the unknown parameters using values of the observable model parameters.
Abbreviation for hidden markov model.
A jump tag is an SGML, HTML, or XML tag which interrupts
the reading sequence of a text and starts a new text segment. Examples
of jump tags include <note> and <speaker>.
Keyword extraction extracts "interesting" phrases which characterize a text.
Language recognition attempts to determine the language(s)
in which a text is written. Literary texts are generally composed in
one principal language with possible inclusions of short passages
(letters, quotations) from other languages. It is helpful to categorize
texts by principal language and most prominent secondary language, if
any. We can use statistical methods based upon character ngrams and
rank order statistics to determine the principal language of a text and
list possible secondary languages.
The lemma form or lexical root of an inflected spelling is
the base form or head word form you would find in a dictionary. A lemma
can also refer to the set of lexemes with the same lexical root, the same major word class, and the same word-sense.
Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary.
A lexeme is the combination of the lemma form of a spelling along with its word class (noun, verb. etc.).
A lexicon is a collection of words and their associated morphological information as used in a corpus.
Machine learning occurs when a computer program modifies
itself or "learns" so that subsequent executions with the same input
result in a different and hopefully more accurate output. Machine
learning methods may be supervised, i.e., using training data, or
unsupervised, without using training data.
A Markov process is a discrete state random process in
which the conditional probability distribution of the future states of
the process depends only upon the present state and not on any past
MorphAdorner is a suite of Java programs which performs
morphological adornment of words in a text. A high-level description of
MorphAdorner's capabilities appears on the
MorphAdorner home page.
A morpheme is a minimal grammatical unit of a language. A
morpheme consists of a word or meaningful part of a word that cannot be
divided into smaller independent grammatical units.
A multiword unit is a special type of
in which the component words comprise a meaningful phrase.
A named entity is a multiword unit consisting of a type of name such as a personal name, corporate name, place name, or date.
An ngram is an ordered sequence of n adjacent words, characters, or morphological adornments.
NUPOS is a part of speech
tag set devised by Martin Mueller to allow part of speech tagging of
English texts from all periods as well as texts in classical languages.
Further information about NUPOS appears in
Morphology and NUPOS.
Part of Speech
The part of speech is the role a word performs in a
sentence. A simple list of the parts of speech for English includes
adjective, adverb, conjunction, noun, preposition, pronoun, and verb.
For computational purposes, however, each of these major word classes
is usually subdivided to reflect more granular syntactic and
Part of Speech Tagging
Part of speech tagging adorns or "tags" words in a
text with each word's corresponding part of speech.
Part of speech tagging relies both on the meaning of the word
and its positional relationship with adjacent words.
A phone is an acoustic pattern which speakers of a
particular natural language consider distinguishable and linguistically
important. Distinct phones in one language may be grouped together and
treated as the same sound in another language.
A phoneme is a group of phones considered to be the same sound by speakers of a specific natural language. One or more phonemes combine to form a morpheme.
A prefix consists of characters comprising one or more bound morphemes which can be added to the front of a word to modify its meaning.
Pronoun Coreference Resolution
Pronoun coreference resolution matches pronouns with the
nouns to which they refer. Some pronouns may not actually refer to a
specific noun. For example, in the sentence "It is not clear how to
proceed" the initial pronoun "It" does not refer to any specific noun.
A pseudo-bigram generalizes the computation of bigram statistical measures to ngrams longer than two words by splitting the original multiword units into two groups of words, each treated as a single "word".
Sentence splitting assembles a tokenized text into
sentences. Recognizing sentence boundaries is a difficult task for a
computer and generally requires a combination of rules and statistical
A soft tag is an SGML, HTML, or XML tag which does not
interrupt the reading sequence of a text and does not start a new text
segment. Examples of soft tags include <hi> and <em>.
The spelling is the orthographic representation of a spoken word. Words may have more than one spelling, particularly in texts dating from earlier periods when spelling was not standardized.
Spelling standardization is the mapping of variant, often archaic, spellings to standard modern forms.
Stemming removes affixes from a spelling. The resulting stem is not necessarily a proper lexeme. Stemming offers a simpler alternative to lemmatization.
Stemming can be useful in information retrieval applications, but is
much less useful in literary applications. Popular stemmers include the
Martin Porter's stemmer and the Lancaster (Paice-Husk) stemmer.
A suffix consists of characters comprising one or more bound morphemes which can be added to the end of a word to modify its meaning.
Supervised learning is a machine learning
technique which predicts the value of a given function for any valid
input after having been presented with training examples (i.e. pairs of
input and correct output).
See adorned corpus.
Abbreviation for Text Encoding Initiative.
TEI Analytics is a literary DTD jointly developed by
Martin Mueller at Northwestern University and Brian Pytlik Zillig and
Steve Ramsey at the University of Nebraska. TEI Analytics is the default
XML input format assumed by MorphAdorner. TEI Analytics is a
minor modification of the P5 TEI-Lite schema,
with additional elements from the Linguistic Segment Categories to
support morphosyntactic annotation and lemmatization.
Text Encoding Initiative
The Text Encoding Initiative (TEI) Guidelines "are an
international and interdisciplinary standard that enables libraries,
museums, publishers, and individual scholars to represent a variety of
literary and linguistic texts for online research, teaching, and
preservation." More information may be found at the
official Text Encoding Initiative site.
A trigram is an ordered sequence of three adjacent
words, characters, or morphological adornments.
Unsupervised learning is a machine learning method which fits a model to observed data without benefit of training data.
The Viterbi algorithm allows searching a space containing
an apparently exponential number of points to be searched in polynomial
time. The Viterbi algorithm is frequently used in hidden Markov model statistical part of speech tagging applications to reduce the time complexity of seaches for the best tags for a sequence of spellings in a sentence.
A word is the basic unit of a language. Words are composed of morphemes.
Word Sense Disambiguation
Word sense disambiguation is the process of distinguishing
different meanings of the same word in different textual contexts. For
example, a "bank" can be both a financial institution or a geographic
location next to a river.
Word tokenization splits a text into words, whitespace, and punctuation.
A work is a single text which is a member of a corpus. Each work consist of one or more text segments called work parts or chunks.