|
Extracting words and sentences from a text are fundamental operations
required by other language processing functions.
Word tokenization splits a text into
words and punctuation marks.
Sentence splitting
assembles the tokenized text into sentences.
The first step in word tokenization is recognizing word boundaries.
The tokenizer uses white space such as blanks and tabs as the primary
cue for splitting the text into tokens. Punctuation marks are
split from the initial tokens. This is not as easy as it sounds.
For example, when should a token containing a hypen be split into
two or more tokens? When does a period indicate the end of an
abbreviation as opposed to a sentence or a number or a Roman numeral?
Sometimes a period can act as a sentence terminator and an abbreviation
terminator at the same time. When should a single quote be split from a
word? Early modern English included many contractions such as
'tis with a leading quote.
MorphAdorner's tokenizers use a number of
heuristics and a list of common abbreviations to produce a sequence of
punctuation and spellings that will be consistent with the subsequent
operations of sentence boundary identification, part of speech tagging,
and lemmatization.
Different part of speech tag sets may require different tokenization.
The Penn Treebank tag set assumes contractions should be split into
separate tokens. Thus the token
can't appears as two tokens,
can and 't. The NUPOS tag set
can work with tokens split this way, but at present we prefer to
keep contracted forms as a single token.
Even when the text has been more-or-less correctly tokenized the
individual tokens may still be erroneous.
The digital text of many Early Modern English works was created
using scanners and optical character recognition (OCR) software.
Such digitized text frequently contains all manner of orthographic errors.
Example include substitution of "~" for the letters "m" or "n" and
mapping of the archaic long "s" as the letter "f".
Some of these errors can be corrected automatically using heuristics and a
spelling standardizer.
In the print world, a punctuation mark does not count as a word.
Instead punctuation separates groups of words. In computer
terms, punctuation is a kind of "meta-data", not so qualitatively
different from SGML or XML markup. MorphAdorner's word tokenizers
treat punctuation marks as words. This procedure is justified because
the punctuation "meta-data" added by authors (or editors)
lives at the same level of data as the words and allows a consistent
treatment of token transition probabilities for adornment processes
such as part of speech tagging.
You may be interested in reading about some
tokenization problems
we encountered while processing literary texts.
You can try MorphAdorner's
default word tokenizer online. The example
only works with plain unmarked text.
|