|
The following presents some of the problems and solutions
encountered while developing the word tokenizers for MorphAdorner.
One important general principle is that MorphAdorner's word tokenizer and
sentence splitter iterate back and forth as needed to achieve the best
possible sentence splitting and tokenization.
Commas in numbers
MorphAdorner treats a comma as a separator in all cases except when
a comma appears in the middle of a number. For example, the string
1,250 represents a number (one thousand two hundred fifty).
MorphAdorner leaves such number strings intact so that the part of speech
taggers can treat it as a number.
Missing whitespace after a period
Many sentences in literary text transcriptions run together without a
space after the period. Example:
systematic."How
Here the sentence should be split after the period and before the double
quote.
For
systematic.'How
the sentence split should occur on the single quote because contractions
should rarely, if ever, have a "." followed by a single quote.
Some commonly merged forms should always be split:
Mr.Capitalname -> Mr. Capitalname
&c.crap -> &c. crap
Mrs.Howell -> Mrs. Howell
St.Miriam
Mr.Doyce!
Dr.Mull
Mr.R.'s -> Mr. R.'s
Examples of other merged strings which should be split include
stairs.The
pleasing.How
emotions.What!
bloodthirstiness.The
on.Think
Tom.You
spring.The
it.Or
houses.But,
door.The
in.He
stairs.The
right.The
so.But
sufferable.The
dishonour.But
emotion.She
Esq.Advocate
Here the decision to split comes from the nature of the tokens
on the left and right hand sides of the period. In each case,
the token is a known word or abbreviation in its own right.
On the other hand, common abbreviations should not be split.
MorphAdorner keeps a list of these. Examples:
i.e.
p.m.
It can be difficult to decide in some cases when a string is a
legitimate abbreviation. For example, e.g_
is presumably a variant of e.g., but what about
etc.s? When in doubt,
MorphAdorner leaves a potential abbreviation unsplit.
Roman numerals
Roman numerals in older texts exhibit considerably more
orthographic variation than contemporary usage allows.
For example, the letter "j" is often used as a substitute
for the letter "i" and "u" for "v". Runs of letters may exceed the
nominal length, e.g., "iiiii" may be used where "v" would
normally appear in current usage. Particularly in early
modern texts, numerals may be preceded and/or followed by
a period. Examples:
xviiii 19
xxc 80
.XVI. 16
Some Roman numerals are followed by the letter "o" or "m"
in a <sup> tag, e.g., DCCXXV<sup>o</sup>.
These are Latin or quasi-Latin inflection markers
for a dative or accusative form. These should be treated as a form of
the word without the trailing marker characters, e.g.,
DCCXXV<sup>o</sup> should be treated as DCCXXV.
MorphAdorner attempts to recognize many of these variants
so that they can be assigned one of the number part of speech
tags.
|