The configuration settings for MorphAdorner appear in utf-8 text
files. Each setting takes the form setting=value and appears
on a separate line in the configuration file. The default settings
file is called morphadorner.properties. Overriding settings
may be specified by in a file named by the p= parameter on the
MorphAdorner Command Line. A number of sample settings files
are provided in the MorphAdorner release materials, corresponding
to settings used when adorning files in the various corpora used
in the Monk project.
The following table lists the setting names and their definitions,
along with typical values.
MorphAdorner Configuration Settings
|
Setting Name
|
Description and Values
|
|
abbreviations.abbreviations_url
|
Specifies the URL for an extra list of abbreviations.
Such a list for Early Modern English texts may be found
in data/emeabbreviations.
|
|
adornedwordoutputter.class
|
Class which produces adorned word values. The following output classes
are currently implemented in MorphAdorner.
- PrintStreamAdornedWordOutputter writes words and their
adornments as plain utf-8 text in tab-separated columns to a file.
This is the default output format.
- ConsoleAdornedWordOutputter writes words and their adornments
as plain utf-8 text in tab-separated columns to the default
system output device.
- ListAdornedWordOutputter writes words and adornments
to an internal list of strings. This is used when processing
XML input files.
- SimpleXMLAdornedWordOutputter outputs words and their
adornments to a file in a simple XML format.
<words>
<word id="1">
<tok>Poets</tok>
<spe>Poets</spe>
<pos>n2</pos>
<reg>Poets</reg>
<lem>poet</lem>
<eos>0</eos>
</word>
<word id="2">
...
</word>
...
</words>
- ByteStreamAdornedWordOutputter writes words and adornments
to an internal byte stream.
|
|
adorner.handle_xml
|
true to use the TEI XML handler, false to use the ordinary text handler.
|
|
adorner.lemmatization.ignorelexiconentries
|
true to ignore lemma definitions in the current lexicon file when
generating output lemmata, and use only the current lemmatizer.
false to look at the lemma definitions in the lexicon first, and
use the lemmatizer only when there is no lemma definition in the lexicon.
|
|
adorner.output.end_of_sentence_flag
|
true to output an end of sentence flag for each adorned word,
false to not generate this flag. The attribute value is set to "1"
when a word ends a sentence and "0" otherwise.
|
|
adorner.output.end_of_sentence_flag_attribute
|
The name of the XML word attribute for the end of sentence flag.
The default value is eos.
|
|
adorner.output.kwic
|
true to output keyword in context (kwic) entries for each adorned word,
false to not generate these entries.
|
|
adorner.output.kwic.width
|
The number of characters of kwic text to output. 80 is a typical value,
which is split between the left and right kwic text.
|
|
adorner.output.kwic_left_attribute
|
The name of the XML word attribute for the kwic text appearing
before a word.
The default value is kl.
|
|
adorner.output.kwic_right_attribute
|
The name of the XML word attribute for the kwic text appearing
after a word.
The default value is kr.
|
|
adorner.output.lemma
|
true to output the lemma for an adorned word, false otherwise.
|
|
adorner.output.lemma_attribute
|
The name of the XML word attribute for the lemmata of an adorned word.
The default value is lem.
|
|
adorner.output.original_token
|
true to output the original word token for an adorned word,
false otherwise.
|
|
adorner.output.original_token_attribute
|
The name of the XML word attribute for the original word token of an
adorned word.
The default value is tok.
|
|
adorner.output.part_of_speech
|
true to output the part of speech for an adorned word,
false otherwise.
|
|
adorner.output.part_of_speech_attribute
|
The name of the XML word attribute for the part of speech of an
adorned word.
The default value is pos.
|
|
adorner.output.running_word_numbers
|
true to output the word numbers for adorned words as continuously
ascending values. false to restart the word numbers over for
each sentence.
|
|
adorner.output.sentence_number
|
true to output the sentence number for an adorned word,
false otherwise.
|
|
adorner.output.sentence_number_attribute
|
The name of the XML word attribute for the sentence number for an
adorned word.
The default value is sn.
|
|
adorner.output.spelling
|
true to output the spelling for an adorned word,
false otherwise.
|
|
adorner.output.spelling_attribute
|
The name of the XML word attribute for the spelling for an
adorned word.
The default value is spe.
|
|
adorner.output.standard_spelling
|
true to output the standard spelling for an adorned word,
false otherwise.
|
|
adorner.output.standard_spelling_attribute
|
The name of the XML word attribute for the standard spelling for an
adorned word.
The default value is reg.
|
|
adorner.output.word_number
|
true to output the word number for an adorned word,
false otherwise.
|
|
adorner.output.word_number_attribute
|
The name of the XML word attribute for the word number for an
adorned word.
The default value is wn.
|
|
adorner.output.word_ordinal
|
true to output the word ordinal for an adorned word,
false otherwise.
|
|
adorner.output.word_ordinal_attribute
|
The name of the XML word attribute for the word ordinal for an
adorned word.
The default value is ord.
|
|
initialspellingstandardizer.class
|
The initial spelling standardizer class.
This is used when guessing parts of speech for words not present
in the lexicon. NoopSpellingStandardizer, the default,
leaves spellings unstandardized when guessing parts of speech.
|
|
lexicon.suffix_lexicon
|
The file containing the suffix lexicon.
For the standard MorphAdorner release, the lexicon files
appear in the data/ subdirectory.
The 19th century fiction suffix lexicon is
data/ncfsuffixlexicon.lex and the
Early Modern English suffix lexicon is
data/emesuffixlexicon.lex.
This value may be overridden on the MorphAdorner command
line by the -u parameter.
|
|
lexicon.word_lexicon
|
The file containing the word lexicon.
For the standard MorphAdorner release, the lexicon files
appear in the data/ subdirectory.
The 19th century fiction word lexicon is
data/ncfwordlexicon.lex and the
Early Modern English word lexicon is
data/emewordlexicon.lex.
This value may be overridden on the MorphAdorner command
line by the -l parameter.
|
|
morphadornerxmlwriter.class
|
The class for writing adorned XML files.
DefaultMorphAdornerXMLWriter is the default. This should not be
changed unless you implement a new XML writer class.
|
|
namestandardizer.class
|
The proper name standardizer class.
DefaultNameStandardizer is the default, which implements
the scheme described in
Standardizing Proper Names.
The NoopNameStandardizer class leaves names unstandardized.
The EEBOSimpleNameStandardizer
class corrects a handful of names when processing
early modern English texts.
|
|
partofspeechguesser.check_possessives
|
true to check for possessive endings when guessing the part
of speech for an unknown word, false otherwise.
The default setting is false, which is also the recommended setting.
|
|
partofspeechguesser.class
|
The part of speech guesser class, which tries to determine the
most likely parts of speech for an unknown word.
DefaultPartOfSpeechGuesser is the default
which is designed for English words.
|
|
partofspeechguesser.try_standard_spellings
|
true to use standard spellings when guessing the parts of speech
for unknown words, false to use the original spellings only.
The default setting is true.
|
|
partofspeechretagger.class
|
The class which corrects the initial part of speech tagging.
The IRetagger class applies
a short list of fixup rules to improve the tagging of I
tokens. The NoopRetagger class leaves the original tagging
unchanged. The PronounRetagger class applies a short list
of fixup rules to improve the tagging of pronouns.
The DefaultPartOfSpeechRetagger is the same as
IRetagger.
|
|
partofspeechtagger.class
|
The class which perform part of speech tagging. The default
TrigramTagger which is a hidden Markov model
based trigram tagger. This is the workhorse tagger in MorphAdorner.
Other taggers, mostly experimental, include:
- AffixTagger uses an affix lexicon to assign
a part of speech tag to a word based upon the prefixes or suffixes
of the word.
- BigramTagger is a hidden Markov model
based bigram tagger. It is faster but less accurate than
the trigram tagger.
- BigramHybridTagger combines the bigram tagger with
a second pass by a Hepple tagger to correct the initial tagging.
Note: you must supply the correction rules, none are provided
by default.
- HeppleTagger is Mark Hepple's rule-based part of speech
tagger modified from the version in Gate to work with the
MorphAdorner lexicons, guessers, etc.
- RegexpTagger uses regular expressions to assign a part of
speech tag to a word. You must supply the regular expressions,
none are provided by default.
- SimpleTagger assigns a "noun" part of speech to all words,
except those that appear to be numbers. Numbers are assigned a "number"
part of speech. Words starting with a capital letter can be assigned a
separate "proper name" part of speech. This tagger is mostly useful
as a backup to a more sophisticated tagger.
- SimpleRuleBasedTagger assigns the most commonly occurring
part of speech to all words using a lexicon, and then applies a small
set of contextual rules to "fix up" the tagging. This simple tagger is
useful when very fast tagging without high accuracy is useful, e.g.,
in sentence splitting.
- TrigramHybridTagger combines the trigram tagger with
a second pass by a Hepple tagger to correct the initial tagging.
Note: you must supply the correction rules, none are provided
by default.
- UnigramTagger uses a lexicon to assign the most frequently
occurring part of speech tag to a word.
|
|
partofspeechtagger.transition_matrix
|
The file containing the tag transition probability matrix data.
For the standard MorphAdorner release, these files
appear in the data/ subdirectory.
The 19th century fiction transition matrix file is
data/ncftransmat.mat and the
Early Modern English transition matrix file is
data/emetransmat.mat.
This value may be overridden on the MorphAdorner command
line by the -t parameter.
|
|
pretokenizer.class
|
The class which applies any pretokenization corrections to the
text to prepare it for initial token extraction.
The default is DefaultPreTokenizer which ensures that
characters which should always be separate tokens are surrounded
by whitespace. In general this class should always be used.
The EEBOPreTokenizer was written to correct the text
for EEBO texts before those texts were modified by Abbott to conform
to TEI Analytics standards.
|
|
posttokenizer.class
|
The class which applies any tokenization corrections to the
initial token extraction. The default is DefaultPostTokenizer.
The EEBOPostTokenizer was written to correct tokens extracted
from EEBO texts before those texts were modified by Abbott to conform
to TEI Analytics standards.
|
|
sentencesplitter.class
|
The class which determines sentence boundaries.
ICU4JBreakIteratorSentenceSplitter uses an ICU4J BreakIterator to
identify candidate sentences.
Several heuristics are used to correct the initial sentence identification
for English sentences.
The DefaultSentenceSplitter is the same as
ICU4JBreakIteratorSentenceSplitter.
|
|
spelling.spelling_pairs
|
The spelling data file which maps variant spellings to standard
spellings.
For the standard MorphAdorner release, these files
appear in the data/ subdirectory.
The 19th century fiction spelling map file is
data/ncfmergedspellingpairs.tab and the
Early Modern English spelling map file is
data/ememergedspellingpairs.tab.
This value may be overridden on the MorphAdorner command
line by the -a parameter.
|
|
spelling.spelling_pairs_by_word_class
|
The spelling data file which maps variant spellings to standard
spellings by word class.
For the standard MorphAdorner release, these files
appear in the data/ subdirectory.
The spelling map by word class file used for all periods
is data/spellingsbywordclass.txt .
This value may be overridden on the MorphAdorner command
line by the -w parameter.
|
|
spelling.standard_spellings
|
The spelling data file which list standard spellings.
For the standard MorphAdorner release, this file
is data/standardspellings.txt .
This value may be overridden on the MorphAdorner command
line by the -s parameter.
|
|
spellingmapper.class
|
The spelling mapper class which maps spellings from one dialect to another.
The USToBritishSpellingMapper maps United States spellings
to British spellings, while BritishToUSSpellingMapper maps
British spellings to United States spellings.
|
|
spellingstandardizer.class
|
The class which maps variant spellings to standard spellings.
The DefaultSpellingStandardizer class is the
ExtendedSimpleSpellingStandardizer which uses spelling maps
along with a few simple heuristics to find standard spellings given a variant
spelling. The SimpleSpellingStandardizer class only uses
spelling maps. The ExtendedSearchSpellingStandardizer
implements the full scheme discussed at
Spelling Standardization Process which can lead to exotically erroneous
standard spellings in some cases.
|
|
textinputter.class
|
The class which reads input text for adornment.
The DefaultTextInputter class is the
URLTextInputter which reads utf-8 text from a URL.
The SimpleXMLTextInputter reads utf-8 text from a TEI or EEBO XML
file. The DiskBasedXMLTextInputter also reads utf-8 text from a
TEI or EEBO XML file, but divides the file into smaller sections
which are stored in temporary disk files and adorned separately.
This is useful for working with large XML input files.
The FirstTokenURLTextInputterreads only the first token in each
line from a URL.
|
|
wordlists.use_latin_word_list
|
true to use an extended list of Latin words when adding part of speech
tags to words, false to not use the extended list.
|
|
wordtokenizer.class
|
Class which splits a sentence into word tokens.
DefaultWordTokenizer is the default and is suitable for
English text.
|
|
xml.adorn_existing_xml_files
|
true to adorn XML files with an existing adorned version
in the output directory, false to skip adornment for
existing files. true is the default value.
When set true and an existing adorned file
exists, a versioned output file name is created to avoid overwriting
the previous adorned version. For example, if the file "aaa.xml"
is to be adorned, and the adorned version "aaa.xml" already exists in
the output directory, then the file "aaa-001.xml" is created.
If "aaa-001.xml" already exists, "aaa-002.xml" is created, and so on.
|
|
xml.close_sentence_at_end_of_hard_tag
|
true to force a sentence to close at the end of a hard tag,
false to allow a sentence to cross across hard tags. In many
literary texts sentences do cross hard tag (usually paragraph)
boundaries, so this setting should be set false.
|
|
xml.close_sentence_at_end_of_jump_tag
|
true to force a sentence to close at the end of a jump tag,
false to allow a sentence to cross across hard tags. This
setting should generally be set to true.
|
|
xml.disallow_word_elements_in=figDesc sic
|
Specifies the XML elements in which to disallow generated and
elements. Element names are separated by blanks. The default list
is figDesc sic.
|
|
xml.field_delimiters
|
Field delimiters for adorned word output.
The default is the Ascii tab character \t .
This should not generally be changed.
|
|
xml.fix_gap_tags
|
true to fix <gap> tags in XML texts, false to
leave them alone. In general, if the input texts are in
TEI Analytics format, this setting should be false.
|
|
xml.fix_orig_tags
|
true to fix <orig> tags in XML texts, false to
leave them alone. In general, if the input texts are in
TEI Analytics format, this setting should be false.
|
|
xml.fix_split_words
|
true to fix split words in XML texts, false to
leave them alone. The match patterns are regular expressions
specified by the settings
xml.fix_split_words.match1,
xml.fix_split_words.match2, etc. The corresponding
corrections are specified by the settings
xml.fix_split_words.replace1,
xml.fix_split_words.replace2, etc. These patterns
may actually be used for more general purposes than splitting or
joining words. Examples of these settings may be found in
the eme.properties settings file in the MorphAdorner release.
|
|
xml.id.attribute
|
The name of the XML word ID attribute.
The default value is xml:id.
|
|
xml.id.spacing
|
This setting gives the spacing between ID values. For
example, an increment of 10 spaces
reading_context_order or wordinblock values by 10.
This allows new values to be interpolated for editing
purposes. The default value is 10.
|
|
xml.id.type
|
Word IDs start with the work identifier, taken from the file name
of the work.
reading_context_order appends integer values
whose order gives the reading context order defined
by the classification of hard, soft, and jump tags.
word_within_page_block appends two integer values in the
the form pageblocknumber-wordinblock, where pageblocknumber
is the ordinal of the current (page break) entry,
and wordinblock is the number of the word within
the page block (starting at 1).
|
|
xml.ignore_tag_case
|
true to ignore the case of XML tags when processing them,
false to consider different tag case significant.
The default is true.
|
|
xml.jump_tags
|
The list of XML jump tags, separated by blanks.
MorphAdorner uses the following jump tags for the default
TEI Analytics XML input files.
bibl figdesc figDesc figure footnote note ref stage tailnote
|
|
xml.log
|
true to enable extended logging, false otherwise.
The default is false.
|
|
xml.output_nonredundant_attributes_only
|
true to emit only non-redundant word tag attributes,
false to emit all word attributes.
A word attribute is redundant if its value can be determined
from the data enclosed by the tags or from another
tag value.
By default MorphAdorner emits all word tag values even if redundant.
|
|
xml.output_nonredundant_token_attribute
|
true to emit only non-redundant token attributes,
false to emit all token attributes.
A redundant token attribute specifies the same text as the
data enclosed by the tags.
By default MorphAdorner emits all token values even if redundant.
|
|
xml.output_pseudo_page_boundary_milestones
|
true to emit XML pseudopage boundary milestone elements,
false to not emit these milestones.
|
|
xml.output_whitespace_elements
|
true to emit whitespace elements (e.g., ) between
word elements in XML, false to not emit these whitespace elements.
This setting should be true in most cases.
|
|
xml.pseudo_page_container_div_types
|
The list of XML tags which close a pseudopage, separated by blanks.
MorphAdorner uses the following soft tags for the default
TEI Analytics XML input files.
volume chapter sermon
|
|
xml.pseudo_page_size
|
The maximum length in words of a pseudopage. The default is 300 words.
|
|
xml.soft_tags
|
The list of XML soft tags, separated by blanks.
MorphAdorner uses the following soft tags for the default
TEI Analytics XML input files.
abbr add address author c cl corr date emph
foreign gap hi l lb location m mentioned
milestone money name num organization orig
pb person phr reg rs s sb seg sic soCalled
sub sup term time title unclear w zzzzsw
|
|
xml.surround_marker
|
The marker character used internally for surrounding distinct segments of text.
Default is Unicode character \ue501 . This should not be changed.
|
|
xml.word_delimiters
|
Output word delimiters for adorned word output.
The default is Ascii \r\n . This should not be changed.
|
|
xml.word_tag_name
|
The name of the XML tag which is used to mark an adorned output word.
The default is w.
|
|
xml.xml_schema
|
The name of the default scheme used for parsing an XML file when
none appears in the XML text. For MorphAdorner, the default is
the TEI Analytics scheme which appears at
http://ariadne.northwestern.edu/monk/schemata/TEIAnalytics.rng
.
|