See: Description
Interface | Description |
---|---|
CanSplitAroundPeriods |
Interface for tokenizer which can split tokens around a period.
|
CanTokenizeWhitespace |
Interface for tokenizer which can tokenize whitespace.
|
PostTokenizer |
Interface for processing an extracted token.
|
PreTokenizer |
Interface for preparing a string for tokenization.
|
WordTokenizer |
Interface for tokenizing a string into "words".
|
Class | Description |
---|---|
AbstractPostTokenizer |
Post tokenizer which processes an extracted token.
|
AbstractPreTokenizer |
Default pretokenizes which prepares a string for tokenization.
|
AbstractWordTokenizer |
Base class for deriving word tokenizers.
|
ApostrophesAreNotQuotesWordTokenizer |
Word tokenizer that treats apostrophes as distinct from single quotes.
|
ContractionTokenizer |
Split text containing contraction into separate tokens.
|
DefaultPostTokenizer |
Default post tokenizer which processes tokens after extraction.
|
DefaultPreTokenizer |
Default pretokenizes which prepares a string for tokenization.
|
DefaultWordTokenizer |
Default word tokenizer.
|
EccoPostTokenizer |
Ecco text post tokenizer which processes tokens after extraction.
|
EccoPreTokenizer |
A pretokenizer for ECCO texts.
|
EEBOPostTokenizer |
Post tokenizer for EEBO texts.
|
EEBOPreTokenizer |
A pretokenizer for original form EEBO texts (not converted to TEIAnalytics).
|
EEBOWordTokenizer |
Word tokenizer for EEBO texts.
|
ICU4JBreakIteratorWordTokenizer |
Word tokenizer which uses ICU library for tokenization.
|
NoopPreTokenizer |
A "no-op" preTokenizer which leaves the input string unchanged.
|
NoopWordTokenizer |
Word tokenizer which leaves original text untokenized.
|
PennTreebankTokenizer |
Split text into tokens according the Penn Treebank tokenization rules.
|
PostTokenizerFactory |
PostTokenizer factory.
|
PreTokenizerFactory |
PreTokenizer factory.
|
TokenizerUtils |
Tokenizer utilities.
|
WhitespaceWordTokenizer |
Simple word tokenizer which splits on whitespace only.
|
WordTokenizerFactory |
WordTokenizer factory.
|
Extracting words and sentences from a text are fundamental operations required by other language processing functions. Word tokenization splits a text into words and punctuation marks. Sentence splitting assembles the tokenized text into sentences.
The first step in word tokenization is recognizing word boundaries. The tokenizer uses white space such as blanks and tabs as the primary cue for splitting the text into tokens. Punctuation marks are split from the initial tokens. This is not as easy as it sounds. For example, when should a token containing a hypen be split into two or more tokens? When does a period indicate the end of an abbreviation as opposed to a sentence or a number or a Roman numeral? Sometimes a period can act as a sentence terminator and an abbreviation terminator at the same time. When should a single quote be split from a word? Early modern English included many contractions such as 'tis with a leading quote.
MorphAdorner's tokenizers use a number of heuristics and a list of common abbreviations to produce a sequence of punctuation and spellings that will be consistent with the subsequent operations of sentence boundary identification, part of speech tagging, and lemmatization. Different part of speech tag sets may require different tokenization. The Penn Treebank tag set assumes contractions should be split into separate tokens. Thus the token can't appears as two tokens, can and 't. The NUPOS tag set can work with tokens split this way, but at present we prefer to keep contracted forms as a single token. MorphAdorner includes tokenizers for each approach.
Even when the text has been more-or-less correctly tokenized the individual tokens may still be erroneous. The digital text of many Early Modern English works was created using scanners and optical character recognition (OCR) software. Such digitized text frequently contains all manner of orthographic errors. Example include substitution of "~" for the letters "m" or "n" and mapping of the archaic long "s" as the letter "f". Some of these errors can be corrected automatically using heuristics and a spelling standardizer.
In the print world, a punctuation mark does not count as a word. Instead punctuation separates groups of words. In computer terms, punctuation is a kind of "meta-data", not so qualitatively different from SGML or XML markup. MorphAdorner's word tokenizers treat punctuation marks as words. This procedure is justified because the punctuation "meta-data" added by authors (or editors) lives at the same level of data as the words and allows a consistent treatment of token transition probabilities for adornment processes such as part of speech tagging.
All MorphAdorner word tokenizers must implement the
WordTokenizer
interface. The
WordTokenizerFactory
provides the mechanism for instantiating a default or specified instance
of a WordTokenizer implementation.
The AbstractWordTokenizer
serves as a base class for deriving concrete implementations of
word tokenizers.
MorphAdorner word tokenizers may want to preprocess the text to regularize
white space or perform other operations before splitting the text
into tokens. A word tokenizer can use invoke a class implementing the
PreTokenizer
interface for this purpose. The
PreTokenizerFactory
instantiates a default or specified instance of a Pretokenizer implementation.
The AbstractPreTokenizer
serves as a base class for deriving concrete implementations of
pretokenizers.