Northwestern University Information Technology
The following presents some of the problems and solutions encountered while developing the word tokenizers for MorphAdorner. One important general principle is that MorphAdorner's word tokenizer and sentence splitter iterate back and forth as needed to achieve the best possible sentence splitting and tokenization.
MorphAdorner treats a comma as a separator in all cases except when a comma appears in the middle of a number. For example, the string 1,250 represents a number (one thousand two hundred fifty). MorphAdorner leaves such number strings intact so that the part of speech taggers can treat it as a number.
Many sentences in literary text transcriptions run together without a space after the period. Example:
Here the sentence should be split after the period and before the double quote.
the sentence split should occur on the single quote because contractions should rarely, if ever, have a "." followed by a single quote.
Some commonly merged forms should always be split:
Mr.Capitalname -> Mr. Capitalname &c.crap -> &c. crap Mrs.Howell -> Mrs. Howell St.Miriam Mr.Doyce! Dr.Mull Mr.R.'s -> Mr. R.'s
Examples of other merged strings which should be split include
stairs.The pleasing.How emotions.What! bloodthirstiness.The on.Think Tom.You spring.The it.Or houses.But, door.The in.He stairs.The right.The so.But sufferable.The dishonour.But emotion.She Esq.Advocate
Here the decision to split comes from the nature of the tokens on the left and right hand sides of the period. In each case, the token is a known word or abbreviation in its own right.
On the other hand, common abbreviations should not be split. MorphAdorner keeps a list of these. Examples:
It can be difficult to decide in some cases when a string is a legitimate abbreviation. For example, e.g_ is presumably a variant of e.g., but what about etc.s? When in doubt, MorphAdorner leaves a potential abbreviation unsplit.
Roman numerals in older texts exhibit considerably more orthographic variation than contemporary usage allows. For example, the letter "j" is often used as a substitute for the letter "i" and "u" for "v". Runs of letters may exceed the nominal length, e.g., "iiiii" may be used where "v" would normally appear in current usage. Particularly in early modern texts, numerals may be preceded and/or followed by a period. Examples:
Some Roman numerals are followed by the letter "o" or "m" in a <sup> tag, e.g., DCCXXV<sup>o</sup>. These are Latin or quasi-Latin inflection markers for a dative or accusative form. These should be treated as a form of the word without the trailing marker characters, e.g., DCCXXV<sup>o</sup> should be treated as DCCXXV.
MorphAdorner attempts to recognize many of these variants so that they can be assigned one of the number part of speech tags.
|Announcements and News
|Announcements and news about changes to MorphAdorner
|Documentation for using MorphAdorner
|Downloading and installing the MorphAdorner client and server software
|Glossary of MorphAdorner terms
|Natural language processing references
|Licenses for MorphAdorner and Associated Software
|Online examples of MorphAdorner Server facilities.
|Slides from talks about MorphAdorner.
|Technical information for programmers using MorphAdorner
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |