Northwestern University Information Technology
This section details Martin Mueller's "NUPOS" part of speech tagset and makes explicit the structure of the tagset and other related morphology objects such as "spellings", "word classes", "lemmata", and "word parts".
As a convention, in this discussion, when we use the term "word", it means "a specific single occurrence of a word somewhere in a text." For the concept of a "word in general", we will use the terms "headword" and "lemma", which we'll define and discuss in detail later.
The full version of NUPOS can handle both Greek and English texts and part of speech tagging. Here we only describe the subset of NUPOS that deals with English. For more information, see Martin Mueller's fuller description.
The first and most basic attribute of a word is its spelling. This may seem to be a simple concept, but especially for earlier texts from periods before spelling became regularized, it is useful to distinguish among several different meanings of the term "spelling". In NUPOS there are three different "spellings" for each word:
1. The "token spelling". This is the spelling of the word exactly as it appears in the original digital source for the text, including all capitalization and any typographical conventions that might be used in the source as markup for various purposes. For example, the original source for a text might contain a word token "common|lie", where the encoders used the vertical bar character "|" to mark up a soft hyphen at the end of a line. As another example, in some early printed texts, a "y" with a superscript "t" was used to represent the word "that". Such a word might be marked up as "y^t" in the source for such a text. As a final example, the token "@abper;fecit" might appear in the source for an early text. In this example "&abper;" is a symbol used in early typesetting as an abbreviation for "per" or "par".
The token spelling retains as much fidelity as possible with the original digital source. It will often contain various kinds of non-uniform markup, as used by the organizations that digitally encoded the texts. It may be of interest to some researchers, but most people will be more interested in the other two kinds of spellings.
The token spelling may be of importance in contexts where an application wishes to reproduce as much visual fidelity as possible with original printed texts when displaying the text to users.
2. The "standard original spelling". This is a version of the spelling with the typographical conventions normalized, and in most contexts is probably what one thinks of when one uses the general term "the spelling of the word". It is usually identical with the token spelling, but not always. In the examples above, the three tokens become the following "standard original spellings":
common|lie --> commonlie y^t --> that @abper;fecit --> perfecit
3. The "standard modern spelling". This is the standard modern orthographic form of the original spelling. But the morphological form is not modernized. Thus a spelling like "lovyth" is regularized to "loveth". "loveth" is not, however, regularized to "loves", but is rather recognized as a standard archaic form. In the three examples above, the standard modern spellings are as follows:
common|lie --> commonlie --> commonly y^t --> that --> that @abper;fecit --> perfecit --> perfecit
Note that "perfecit" is a Latin word, and at no point is there an attempt made to translate foreign words into English.
For modern texts, the three spellings are nearly always identical. The main exceptions will be for words in XML texts split by decorator (soft) tags.
Words have spellings, as outlined above. We also want to enumerate and discuss in detail their other tagging attributes, such as word class, part of speech, and lemma. Before we can do this, however, we need to discuss a pesky complexity of texts - contractions.
Consider as an example the first word of Hamlet, "Who's". This is a single lexical word, and in this example all three spellings of the word are the same string "Who's".
In terms of the other attributes, however, this word is properly considered to be a lexical representation of the two separate words "who" and "is". Each part has its own word class, part of speech and lemma. In this particular example, it might also be possible to think of each part as having its own spelling or "sub-spelling", "who" and "'s", but in the general case it might be difficult to reasonably split up a spelling into its pieces, and the current version of NUPOS does not attempt to do this.
In NUPOS, this word "who's" is tagged as follows:
|major word class
|part of speech
While we might wish that this complexity didn't exist or could be safely ignored, it can be important when analyzing texts. For example, consider the set of all words in Shakespeare which are instances of the auxiliary verb "be". In NUPOS, the first word of Hamlet is correctly included as a member of this set. It is also a member of the set of all words in Shakespeare which are instances of the wh-word "who".
As another example, consider the general notion of counting different kinds of words in Shakespeare. In NUPOS, the count of the total number of occurrences of the auxiliary verb "be" includes the first word of Hamlet, as it should, as does the count of the total number of occurrences of the wh-word "who". The first word of Hamlet is counted twice, once as "be" and once as "who". Consequently, the sum of the counts of the number of different kinds of words in Hamlet is equal to the number of word parts in Hamlet, not the number of words.
As a final example, consider an analysis of bigrams in Shakespeare. In NUPOS, the first word of Hamlet is considered to be an instance of the bigram "the lemma who (crq) followed by the lemma be (va)", as well as an instance of the bigram "word class crq followed by part of speech vaz".
In the general case, each word, while it usually only has one part, might have more than one part -- two parts in the case of most contractions, but at least conceivably perhaps even more than two parts. While it is words which possess spelling attributes, it is their parts which possess the other morphological attributes, and this is an important distinction to keep in mind.
In the normal case, when a word has only one part, we often use the simple term "word" to refer to its unique part. For example, we say "this word is a verb", when to be precise what we are really saying is "the one and only part of this word is a verb."
In NUPOS, each word part has a "major word class" and a "word class". These concepts provide the coarsest ways to categorize words.
There are 17 major word classes, which should be self-explanatory:
|Major word classes
Major word classes are subdivided into a slightly finer categorization by "word class". There are 34 word classes in NUPOS:
Each word class has a very short string which provides a name for the word class, and each word class belongs to one and only one of the major word classes.
For example, for the major word class "verb", there are three word classes "va" (auxiliary verb), "vm" (modal verb), and "v" (verb). So in NUPOS, there are three kinds of verbs.
NUPOS has a fine-grained part of speech tagset, much finer-grained than the word classes and major word classes. There are 241 total English parts of speech in the current version of NUPOS (not counting punctuation).
Each part of speech belongs to one and only one word class, so the part of speech tagset in NUPOS represents a subdivision of the word class tagset, in the same way that the word class tagset represents a subdivision of the major word class tagset.
To continue the example of verbs, in NUPOS each of the verb word classes contains a number of parts of speech:
word class va (auxiliary verb): 19 parts of speech word class vm (modal verb): 14 parts of speech word class v (verb): 27 parts of speech
Each part of speech, in addition to belonging to a word class, is also characterized by, and largely defined by, how it is used in various grammatical categories. These categories and their possible values should be mostly self-explanatory to those familiar with English grammar.
Syntax (used as): See below. Tense: pres, past or empty (not applicable) Mood: ppl, inf, impt or empty (not applicable) Case: gen, obj, subj, or empty (not applicable) Person: 1st, 2nd, 3rd, or empty (not applicable) Number: sg, pl, or empty (not applicable). Degree: comp, sup, or empty (not applicable). Negative: no, nor, not, or empty (not applicable).
As an example, the NUPOS part of speech "vmd2" is used for modal verbs used in the second person singular past tense. It has the following attributes in addition to its name "vmd2":
word class = vm (modal verb) syntax = vm tense = past mood = empty case = empty person = 2nd number = sg degree = empty negative = empty
An example of this part of speech occurs in Act 5, Scene 1 of Hamlet, where Gertrude says "I hoped thou shouldst have been my Hamlet's wife;" In this passage, the word "shouldst" is tagged with the lemma "shall (vm)" and the part of speech "vmd2". By virtue of this tagging, we know all of the following facts about this word:
It is an instance of the headword "shall" It is a verb. It is a modal verb. It has NUPOS part of speech "vmd2". It is in the past tense. It is in the second person. It is singular.
In a full implementation of NUPOS, any of these attributes can be used as a criterion for searching, grouping, sorting, counting, and analysis. For example, a researcher might compare the use of past tense modal verbs by one author to their use by another author, or he might do a search where he finds all uses of second person singular verbs in the works of Chaucer. Or he might find all of the verbs used in Spenser and generate a report which counts up how many times each of them are used in the various possible combinations of person and number.
The "syntax" attribute is used to specify how the part of speech is used. For example, the part of speech "av-j" is used for adjectives that are used as adverbs. The "syntax" attribute of this part of speech is "av". An example of this part of speech occurs in Act 1, Scene 1 of Hamlet, where Bernardo says "Long live the king!" The word "Long" in this passage in used as an adverb modifying the verb "live" and has the NUPOS part of speech "av-j". Contrast this with the word "long" in Act 3, Scene 1, where Hamlet says "That makes calamity of so long life;". In this passage, the word "long" is tagged with the part of speech "j", the part of speech for "normal" uses of adjectives. Both of the parts of speech "av-j" and "j" have the word class "j" and major word class "adjective", but "av-j" has the syntax attribute "av", while "j" has the syntax attribute "j".
Martin has also mentioned the possibility of more coarse-grained versions of NUPOS, finer grained than word classes but coarser than the full set of 220+ parts of speech. These intermediate levels of NUPOS may be useful for data mining and other kinds of analysis. We have not yet worked out the details of this idea.
Another distinctive feature of NUPOS is that it offers some ambiguous wordclasses, like 'jn' for words that hover between noun and adjective or 'an' for words that hover between noun and adverb (home, tomorrow).
All of the NUPOS parts of speech are displayed at the end of this appendix.
A lemma is a dictionary "headword" plus its word class.
For example, consider the verb "love" in Shakespeare. This lemma has the headword "love" and the word class "v". He uses this common lemma in 41 of his 42 works, a total of 1,135 times, in a variety of contexts with quite a few different parts of speech and spellings. For example, he uses it a total of 153 times with the part of speech "vvz", which is the NUPOS part of speech tag for verbs used in the third person singular in the present tense. 150 of these uses are spelled "loves", and three of them are spelled "loveth".
There is, of course, also a noun named "love". In NUPOS, there are two separate lemmata for the headword "love", one for the noun and one for the verb. In general, headwords like "love" are used to form NUPOS lemmata based on their word class, and the word class is listed along with the headword when naming the lemma. In our example, the NUPOS names for the two "love" lemmata are "love (n)" and "love (v)".
The set of all lemmata used in a work or collection of works is called the "lexicon" for the work or collection.
MorphAdorner reads source XML texts, locates sentence and word boundaries, and marks each word with five morphological tags -- the three spellings, the NUPOS part of speech, and the lemma headword. For contractions, MorphAdorner emits multiple parts of speech and headwords.
It's important to recall that MorphAdorner is more than just a part of speech tagger. It's also a spelling normalizer and a lemma tagger.
This tagging data emitted by MorphAdorner is sufficient to recover all of the information mentioned above for each word and word part, including the major word class, word class, part of speech category values, and lemma (headword plus major word class). Note that MorphAdorner only emits the lemma headword. The word class may be deduced from the part of speech.
Following the approach to contracted forms taken by NUPOS, Morphadorner treats contracted forms as a single token for two reasons.
The orthographic practice reflects an underlying linguistic reality that the tokenization should respect.
In Early Modern English (as in Shaw's orthographic reforms) contracted forms appear without apostrophes, as in 'noot' for 'knows not' or 'niltow' for 'wilt thou not'. It's not obvious how to split these forms. The situation is even less clear for dialectical forms.
Contracted forms get two part of speech tags separated by a vertical bar, but with regard to forms like "don't', "cannot", "ain't", MorphAdorner analyzes the forms as the negative form of a verb and does not treat the form as a contraction. It uses the symbol 'x' to mark a negative part of speech tag.
NUPOS comprises the following objects, attributes, and relationships:
The following diagram is useful as a way of summarizing NUPOS. It's not a formal UML diagram, and the drawing has no particular implementation implications, other than as a way of summarizing some of the functionality that any particular full implementation of NUPOS must support. It's just an informal way of making a picture out of the objects, attributes, and relationships enumerated above and described and defined in detail in this note. The double-headed arrow is used to indicate the relationship "may have more than one of", while the single-headed arrow indicates "has one and only one of". The term "list of" in the one-to-many relationship between words and their parts indicates that the parts of a word are ordered -- there's a first one, then a second one, and so on. This is important for dealing with n-grams.
The following table lists all the non-punctuation parts of speech defined by NUPOS. The first column provides the NUPOS part of speech tag. The second column describes the tag. The third column offers an example the part of speech. The fourth column provides a rounded count of occurrences of the tag in the NUPOS training data expressed as parts per million. That shows how commonly a tag occurs in the MorphAdorner training data. The training data consists of about six million words drawn from the following texts:
Examples are chosen for the most part from the training data.
|Occurences per million words
|acp word as adverb
|I have not seen him since
|noun-adverb as adverb
|comparative adj/noun as adverb
|determiner/adverb as adverb
|comparative determiner/adverb as adverb
|can lesser hide his love
|superlative determiner as adverb
|negative determiner as adverb
|adjective as adverb
|comparative adjective as adverb
|he fared worse
|adj/noun as adverb
|duly, right honourable
|superlative adjective as adverb
|in you it best lies
|noun as adverb
|had been cannibally given
|superlative adj/noun as adverb
|hee being the worthylest constant
|present participle as adverb
|past participle as adverb
|Stands Macbeth thus amazedly
|acp word as conjunction
|since I last saw him
|acp word as coordinating conjunction
|wh-word as conjunction
|when she saw
|2, two, ii
|'that' as conjunction
|I saw that it was hopeless
|that man, much money
|determiner in possessive use
|a man, the man
|negative determiner as adverb
|word in unspecified other language
|adverb as adjective
|the then king
|yet she much whiter
|present participles as comparative adjective
|for what pleasinger then varietie, or sweeter then flatterie?
|past participle as comparative adjective
|shall find curster than she
|the sky is blue
|present participle as superlative adjective
|the lyingest knave in Christendom
|past participle as superlative adjective
|present participle as adjective
|past participle as adjective
|noun-adverb as singular noun
|adjective as singular noun
|acp word as plural noun
|and many such-like "As'es" of great charge
|noun-adverb as plural noun
|all our yesterdays
|adverb as plural noun
|and are etcecteras no things
|determiner/adverb negative as plural noun
|yeas and honest kerysey noes
|adjective as plural noun
|give me particulars
|adj/noun as plural noun
|the subjects of his substitute
|present participle as plural noun, 'do'
|present participle as plural noun, 'have'
|my present havings
|present participle as plural noun
|the desperate languishings
|past participle as plural noun
|there was no necessity of a Letter of Slains for Mutilation
|singular possessive, noun
|noun-adverb in singular possessive use
|adjective as possessive noun
|the Eternal's wrath
|adj/noun as possessive noun
|our sovereign's fall
|past participle as possessive noun
|the late lamented's house
|plural possessive, noun
|adj/noun as plural possessive noun
|mortals' chiefest enemy
|adj/noun as noun
|a deep blue
|proper adjective as noun
|proper adjective as plural noun
|proper adjective as possessive noun
|The Roman's courage
|proper adjective as plural possessive noun
|The Romans' courage
|singular, proper noun
|plural, proper noun
|The Nevils are thy subjects
|singular possessive, proper noun
|plural possessive, proper noun
|will take the Nevils' part
|singular noun as proper noun
|at the Porpentine
|plural noun as proper noun
|such Brooks are welcome to me
|singular possessive noun as proper noun
|and through Wall's chink
|present participle as noun, 'do'
|present participle as noun, 'have'
|present participle as noun
|the running of the deer
|past participle as noun
|acp word as preposition
|to my brother
|acp word as particle
|singular, indefinite pronoun
|plural, indefinite pronoun
|from wicked ones
|plural, indefinite pronoun
|To hear my nothings monstered
|singular possessive, indefinite pronoun
|the pairings of one's nail
|possessive case, indefinite pronoun
|2nd person, personal pronoun
|3rd singular, personal pronoun
|1st singular possessive, personal pronoun
|a book of mine
|1st plural possessive, personal pronoun
|this land of ours
|2nd singular possessive, personal pronoun
|this is thine
|2nd person, possessive, personal pronoun
|this is yours
|3rd singular possessive, personal pronoun
|a cousin of his
|3rd plural possessive, personal pronoun
|this is theirs
|1st singular objective, personal pronoun
|1st plural objective, personal pronoun
|2nd singular objective, personal pronoun
|3rd singular objective, personal pronoun
|3rd plural objective, personal pronoun
|1st singular subjective, personal pronoun
|1st plural subjective, personal pronoun
|2nd singular subjective, personal pronoun
|3rd singular subjective, personal pronoun
|3rd plural objective, personal pronoun
|1st singular, possessive pronoun
|1st plural, possessive pronoun
|2nd singular, possessive pronoun
|2nd person possessive pronoun
|3rd singular, possessive pronoun
|its, her, his
|3rd plural, possessive pronoun
|1st singular reflexive pronoun
|1st plural reflexive pronoun
|2nd singular reflexive pronoun
|2nd plural reflexive pronoun
|3rd singular reflexive pronoun
|herself, himself, itself
|3rd plural reflexive pronoun
|2nd singular possessive, reflexive pronoun
|interrogative use, wh-word
|Who? What? How?
|relative use, wh-word
|the girl who ran
|alphabetical or other symbol
|adverb as interjection
|wh-word as interjection
|Why, there were but four
|adjective as interjection
|adjective/noun as interjection
|And welcome, Somerset
|noun as interjection
|verb as interjection
|My gracious silence, hail
|2nd singular present of 'be'
|2nd plural present imperative, 'be'
|2nd singular present, 'be'
|thow nart yit blisful
|present tense, 'be'
|present tense negative, 'be'
|aren't, ain't, beant
|past tense, 'be'
|2nd singular past of 'be'
|thou wast, thou wert
|2nd singular past, 'be'
|plural past tense, 'be'
|whose yuorie shoulders weren couered all
|past tense negative, 'be'
|present participle, 'be'
|1st singular, 'be'
|1st singular negative, 'be'
|I nam nat lief to gabbe
|past participle, 'be'
|plural present, 'be'
|Thise arn the wordes
|3rd singular present, 'be'
|3rd singular present negative, 'be'
|2nd singular present of 'do'
|2nd plural present imperative, 'do'
|Dooth digne fruyt of Penitence
|2nd singular present negative, 'do'
|thee dostna know the pints of a woman
|present tense, 'do'
|present tense negative, 'do'
|past tense, 'do'
|2nd singular past of 'do'
|2nd singular past negative, verb
|Why, thee thought'st Hetty war a ghost, didstna? 0.20
|plural past tense, 'do'
|on Job , whom that we diden wo
|past tense negative, 'do'
|present participle, 'do'
|past participle, 'do'
|plural present, 'do'
|As freendes doon whan they been met
|3rd singular present, 'do'
|3rd singular present negative, 'do'
|2nd singular present of 'have'
|2nd plural present imperative, 'have'
|O haveth of my deth pitee!
|2nd singular present negative, 'have'
|present tense, 'have'
|present tense negative, 'have'
|past tense, 'have'
|2nd singular past of 'have'
|plural past tense, 'have'
|Of folkes that hadden grete fames
|past tense negative, 'have'
|present participle, 'have'
|past participle, 'have'
|plural present, 'have'
|They han of us no jurisdiccioun,
|3rd singular present, 'have'
|3rd singular present negative, 'have'
|Ther loveth noon, that she nath why to pleyne.
|2nd singular present of modal verb
|2nd singular present negative, modal verg
|O deth, allas, why nyltow do me deye
|present tense, modal verb
|can, may, shall, will
|1st singular present, modal verb
|Chill not let go, zir, without vurther 'cagion
|present tense negative, modal verb
|cannot; won't; I nyl nat lye
|past tense, modal verb
|could, might, should, would
|2nd singular past of modal verb
|couldst, shouldst, wouldst; how gret scorn woldestow han
|2nd singular present, modal verb
|Why noldest thow han writen of Alceste
|plural past tense, modal verb
|tho thinges ne scholden nat han ben doon
|past negative, modal verb
|couldn't; She nolde do that vileynye or synne
|infinitive, modal verb
|Criseyde shal nought konne knowen me.
|past participle, modal verb
|I had oones or twyes ycould
|plural present tense, modal verg
|and how ye schullen usen hem
|2nd singular present of verb
|2nd present imperative, verb
|For, sire and dame, trusteth me right weel,
|2nd singular present negative, verb
|"Yee!" seyde he, "thow nost what thow menest;
|present tense, verg
|present tense negative, verb
|What shall I don? For certes, I not how
|past tense, verb
|2nd singular past of verb
|2nd singular past negative, verb
|thou seidest that thou nystist nat
|past plural, verb
|They neuer strouen to be chiefe
|past tense negative, verb
|she caredna to gang into the stable
|present participle, verb
|past participle, verb
|plural present, verb
|Those faytours little regarden their charge
|3rd singular preseent, verb
|3rd singular present negative, verb
|She caresna for Seth.
|unknown or unparsable token
|Announcements and News
|Announcements and news about changes to MorphAdorner
|Documentation for using MorphAdorner
|Downloading and installing the MorphAdorner client and server software
|Glossary of MorphAdorner terms
|Natural language processing references
|Licenses for MorphAdorner and Associated Software
|Online examples of MorphAdorner Server facilities.
|Slides from talks about MorphAdorner.
|Technical information for programmers using MorphAdorner
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |