NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Word Tokenizer: Tokenization Problems

The following presents some of the problems and solutions encountered while developing the word tokenizers for MorphAdorner. One important general principle is that MorphAdorner's word tokenizer and sentence splitter iterate back and forth as needed to achieve the best possible sentence splitting and tokenization.

Commas in numbers

MorphAdorner treats a comma as a separator in all cases except when a comma appears in the middle of a number. For example, the string 1,250 represents a number (one thousand two hundred fifty). MorphAdorner leaves such number strings intact so that the part of speech taggers can treat it as a number.

Missing whitespace after a period

Many sentences in literary text transcriptions run together without a space after the period. Example:

	systematic."How

Here the sentence should be split after the period and before the double quote.

For

	systematic.'How

the sentence split should occur on the single quote because contractions should rarely, if ever, have a "." followed by a single quote.

Some commonly merged forms should always be split:

	Mr.Capitalname -> Mr. Capitalname
	&c.crap -> &c. crap
	Mrs.Howell -> Mrs. Howell
	St.Miriam
	Mr.Doyce!
	Dr.Mull
	Mr.R.'s -> Mr. R.'s

Examples of other merged strings which should be split include

	stairs.The
	pleasing.How
	emotions.What!
	bloodthirstiness.The
	on.Think
	Tom.You
	spring.The
	it.Or
	houses.But,
	door.The
	in.He
	stairs.The
	right.The
	so.But
	sufferable.The
	dishonour.But
	emotion.She
	Esq.Advocate

Here the decision to split comes from the nature of the tokens on the left and right hand sides of the period. In each case, the token is a known word or abbreviation in its own right.

On the other hand, common abbreviations should not be split. MorphAdorner keeps a list of these. Examples:

	i.e.
	p.m.

It can be difficult to decide in some cases when a string is a legitimate abbreviation. For example, e.g_ is presumably a variant of e.g., but what about etc.s? When in doubt, MorphAdorner leaves a potential abbreviation unsplit.

Roman numerals

Roman numerals in older texts exhibit considerably more orthographic variation than contemporary usage allows. For example, the letter "j" is often used as a substitute for the letter "i" and "u" for "v". Runs of letters may exceed the nominal length, e.g., "iiiii" may be used where "v" would normally appear in current usage. Particularly in early modern texts, numerals may be preceded and/or followed by a period. Examples:

xviiii 19
xxc 80
.XVI. 16

Some Roman numerals are followed by the letter "o" or "m" in a <sup> tag, e.g., DCCXXV<sup>o</sup>. These are Latin or quasi-Latin inflection markers for a dative or accusative form. These should be treated as a form of the word without the trailing marker characters, e.g., DCCXXV<sup>o</sup> should be treated as DCCXXV.

MorphAdorner attempts to recognize many of these variants so that they can be assigned one of the number part of speech tags.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk