AbstractPreTokenizer (MorphAdorner)

java.lang.Object
- edu.northwestern.at.utils.IsCloseableObject
- - edu.northwestern.at.morphadorner.corpuslinguistics.tokenizer.AbstractPreTokenizer

All Implemented Interfaces:

PreTokenizer, UsesLogger

Direct Known Subclasses:

DefaultPreTokenizer, EccoPreTokenizer, EEBOPreTokenizer, NoopPreTokenizer
```
public abstract class AbstractPreTokenizer
extends IsCloseableObject
implements PreTokenizer, UsesLogger
```
Default pretokenizes which prepares a string for tokenization.

Field Summary

Fields
Modifier and Type	Field and Description
`protected static java.lang.String`	`alwaysSeparators` Pattern to match characters which are always separators.
`protected static PatternReplacer`	`alwaysSeparatorsReplacer` Always Separators replacer pattern.
`protected static java.lang.String`	`asterisks` Pattern to match one or more asterisk.
`protected static java.lang.String`	`commaSeparator` Pattern to match comma as a separator.
`protected static PatternReplacer`	`commaSeparatorReplacer` Comma separator replacer pattern.
`protected static java.lang.String`	`hyphens` Pattern to match two or more hyphens in a row.
`protected Logger`	`logger` Logger used for output.
`protected static java.lang.String`	`periods` Pattern to match three or more periods.

Constructor Summary

Constructors
Constructor and Description

AbstractPreTokenizer()
Create a preTokenizer.

Constructors
Constructor and Description
`AbstractPreTokenizer()` Create a preTokenizer.

Method Summary

Methods
Modifier and Type	Method and Description
`Logger`	`getLogger()` Get the logger.
`java.lang.String`	`pretokenize(java.lang.String line)` Prepare text for tokenization.
`void`	`setLogger(Logger logger)` Set the logger.

Methods inherited from class edu.northwestern.at.utils.IsCloseableObject
close

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface edu.northwestern.at.morphadorner.corpuslinguistics.tokenizer.PreTokenizer
close

- Field Detail
  - periods
```
protected static final java.lang.String periods
```
    Pattern to match three or more periods.
    
    See Also:
    Constant Field Values
  - asterisks
```
protected static final java.lang.String asterisks
```
    Pattern to match one or more asterisk.
    
    See Also:
    Constant Field Values
  - hyphens
```
protected static final java.lang.String hyphens
```
    Pattern to match two or more hyphens in a row.
    
    See Also:
    Constant Field Values
  - commaSeparator
```
protected static final java.lang.String commaSeparator
```
    Pattern to match comma as a separator.
    
    See Also:
    Constant Field Values
  - logger
```
protected Logger logger
```
    Logger used for output.
  - alwaysSeparators
```
protected static final java.lang.String alwaysSeparators
```
    Pattern to match characters which are always separators.
    Unicode ? (BLACKCIRCLE) is the dot character which marks character lacunae. This is not a token separator. Neither is Unicode � (SOLIDCIRCLE) which was used in the old EEBO format TCP files to mark character lacunae.
    
    Unicode ?, the non-breaking hyphen, is not treated as a token separator.
    
    Unicode ? (DEGREES_MARK) is degrees quote symbol. Unicode ? (MINUTES_MARK) is minutes quote symbol. Unicode ? (SECONDS_MARK) is seconds quote symbol. These are not token separators.
    
    Unicode � (LSQUOTE) is left single curly quote. Unicode � (RSQUOTE) is right single curly quote. These may or may not be token separators. It is up to the word tokenizer to decide.
    
    Unicode � (LDQUOTE) is left double curly quote. Unicode � (RDQUOTE) is right double curly quote. These are token separators.
    
    See Also:
    Constant Field Values
  - alwaysSeparatorsReplacer
```
protected static PatternReplacer alwaysSeparatorsReplacer
```
    Always Separators replacer pattern.
  - commaSeparatorReplacer
```
protected static PatternReplacer commaSeparatorReplacer
```
    Comma separator replacer pattern.
- Constructor Detail
  - AbstractPreTokenizer
```
public AbstractPreTokenizer()
```
    Create a preTokenizer.
- Method Detail
  - getLogger
```
public Logger getLogger()
```
    Get the logger.
    
    Specified by:
    
    getLogger in interface UsesLogger
    
    Returns:
    The logger.
  - setLogger
```
public void setLogger(Logger logger)
```
    Set the logger.
    
    Specified by:
    
    setLogger in interface UsesLogger
    
    Parameters:
    logger - The logger.
  - pretokenize
```
public java.lang.String pretokenize(java.lang.String line)
```
    Prepare text for tokenization.
    
    Specified by:
    
    pretokenize in interface PreTokenizer
    
    Parameters:
    line - The text to prepare for tokenization,
    
    Returns:
    The pretokenized text.

Class AbstractPreTokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class edu.northwestern.at.utils.IsCloseableObject

Methods inherited from class java.lang.Object

Methods inherited from interface edu.northwestern.at.morphadorner.corpuslinguistics.tokenizer.PreTokenizer

Field Detail

periods

asterisks

hyphens

commaSeparator

logger

alwaysSeparators

alwaysSeparatorsReplacer

commaSeparatorReplacer

Constructor Detail

AbstractPreTokenizer

Method Detail

getLogger

setLogger

pretokenize