NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Configuration Settings

The configuration settings for MorphAdorner appear in utf-8 text files. Each setting takes the form setting=value and appears on a separate line in the configuration file. The default settings file is called morphadorner.properties. To change the default settings file name, specify the alternative default using the d= parameter on the MorphAdorner Command Line.

Overriding settings may be specified by in a file named by the p= parameter on the MorphAdorner Command Line. A number of sample settings files are provided in the MorphAdorner release materials, corresponding to settings used when adorning files in the various corpora used in projects which used MorphAdorner.

Properties file Corpus
docsouth.propertiesDocumenting the American South XML files.
eaf.propertiesEarly American Fiction XML files.
ecco.propertiesEighteenth Century Collections Online XML files.
ece.propertiesEighteenth century English XML files.
eme.propertiesEarly Modern English XML files.
emeplaintext.propertiesEarly Modern English plain text files.
ncf.propertiesNineteenth Century Fiction XML files in which the apostrophe, opening single quote, and closing single quote are the same character.
ncfa.propertiesNineteenth Century Fiction XML files in which the apostrophe is distinguished from the opening and closing single quote characters,.
plaintext.propertiesPlain text of nineteenth century vintage or later.
wright.propertiesWright Fiction Archive XML files.

The following table lists the setting names and their definitions, along with typical values.

MorphAdorner Configuration Settings
Setting Name Description and Values
abbreviations.abbreviations_url Specifies the URL for an extra list of abbreviations. Such a list for Early Modern English texts may be found in data/emeabbreviations.
abbreviations.main.abbreviations_url Specifies the URL for an extra list of abbreviations to be used only for main text.
abbreviations.main.abbreviations_url Specifies the URL for an extra list of abbreviations to be used only for side text, e.g., paratext.
adornedwordoutputter.class Class which produces adorned word values. The following output classes are currently implemented in MorphAdorner.
  • PrintStreamAdornedWordOutputter writes words and their adornments as plain utf-8 text in tab-separated columns to a file. This is the default output format.
  • ConsoleAdornedWordOutputter writes words and their adornments as plain utf-8 text in tab-separated columns to the default system output device.
  • ListAdornedWordOutputter writes words and adornments to an internal list of strings. This is used when processing XML input files.
  • SimpleXMLAdornedWordOutputter outputs words and their adornments to a file in a simple XML format.
    
    <words>
      <word id="1">
        <tok>Poets</tok>
        <spe>Poets</spe>
        <pos>n2</pos>
        <reg>Poets</reg>
        <lem>poet</lem>
        <eos>0</eos>
      </word>
      <word id="2">
        ...
      </word>
      ...
    </words>
    
    
  • ByteStreamAdornedWordOutputter writes words and adornments to an internal byte stream.
adorner.handle_xml true to use the TEI XML handler, false to use the ordinary text handler.
adorner.lemmatization.ignorelexiconentries true to ignore lemma definitions in the current lexicon file when generating output lemmata, and use only the current lemmatizer. false to look at the lemma definitions in the lexicon first, and use the lemmatizer only when there is no lemma definition in the lexicon.
adorner.output.end_of_sentence_flag true to output an end of sentence flag for each adorned word, false to not generate this flag. The attribute value is set to "1" when a word ends a sentence and "0" otherwise.
adorner.output.end_of_sentence_flag_attribute The name of the XML word attribute for the end of sentence flag. The default value is eos.
adorner.output.kwic true to output keyword in context (kwic) entries for each adorned word, false to not generate these entries.
adorner.output.kwic.width The number of characters of kwic text to output. 80 is a typical value, which is split between the left and right kwic text.
adorner.output.kwic_left_attribute The name of the XML word attribute for the kwic text appearing before a word. The default value is kl.
adorner.output.kwic_right_attribute The name of the XML word attribute for the kwic text appearing after a word. The default value is kr.
adorner.output.lemma true to output the lemma for an adorned word, false otherwise.
adorner.output.lemma_attribute The name of the XML word attribute for the lemmata of an adorned word. The default value is lem.
adorner.output.original_token true to output the original word token for an adorned word, false otherwise.
adorner.output.original_token_attribute The name of the XML word attribute for the original word token of an adorned word. The default value is tok.
adorner.output.part_of_speech true to output the part of speech for an adorned word, false otherwise.
adorner.output.part_of_speech_attribute The name of the XML word attribute for the part of speech of an adorned word. The default value is pos.
adorner.output.running_word_numbers true to output the word numbers for adorned words as continuously ascending values. false to restart the word numbers over for each sentence.
adorner.output.sentence_number true to output the sentence number for an adorned word, false otherwise.
adorner.output.sentence_number_attribute The name of the XML word attribute for the sentence number for an adorned word. The default value is sn.
adorner.output.spelling true to output the spelling for an adorned word, false otherwise.
adorner.output.spelling_attribute The name of the XML word attribute for the spelling for an adorned word. The default value is spe.
adorner.output.standard_spelling true to output the standard spelling for an adorned word, false otherwise.
adorner.output.standard_spelling_attribute The name of the XML word attribute for the standard spelling for an adorned word. The default value is reg.
adorner.output.word_number true to output the word number for an adorned word, false otherwise.
adorner.output.word_number_attribute The name of the XML word attribute for the word number for an adorned word. The default value is wn.
adorner.output.word_ordinal true to output the word ordinal for an adorned word, false otherwise.
adorner.output.word_ordinal_attribute The name of the XML word attribute for the word ordinal for an adorned word. The default value is ord.
corpus.name The name of the corpus for this configuration. Usually a short string such as "ncf" for "nineteenth century fiction." Used by the MorphAdorner server when displaying the available configurations. The server ignores MorphAdorner configurations which do not have the corpus.name set.
corpus.description Longer description the corpus for this configuration. Used by the MorphAdorner server when displaying the available configurations. The server ignores MorphAdorner configurations which do not have the corpus.dcescription set.
initialspellingstandardizer.class The initial spelling standardizer class. This is used when guessing parts of speech for words not present in the lexicon. NoopSpellingStandardizer, the default, leaves spellings unstandardized when guessing parts of speech.
lexicon.suffix_lexicon The file containing the suffix lexicon. For the standard MorphAdorner release, the lexicon files appear in the data/ subdirectory. The 19th century fiction suffix lexicon is data/ncfsuffixlexicon.lex and the Early Modern English suffix lexicon is data/emesuffixlexicon.lex. This value may be overridden on the MorphAdorner command line by the -u parameter.
lexicon.word_lexicon The file containing the word lexicon. For the standard MorphAdorner release, the lexicon files appear in the data/ subdirectory. The 19th century fiction word lexicon is data/ncfwordlexicon.lex and the Early Modern English word lexicon is data/emewordlexicon.lex. This value may be overridden on the MorphAdorner command line by the -l parameter.
morphadornerxmlwriter.class The class for writing adorned XML files. DefaultMorphAdornerXMLWriter is the default. This should not be changed unless you implement a new XML writer class.
namestandardizer.class The proper name standardizer class. DefaultNameStandardizer is the default, which implements the scheme described in Standardizing Proper Names. The NoopNameStandardizer class leaves names unstandardized. The EEBOSimpleNameStandardizer class corrects a handful of names when processing early modern English texts.
partofspeechguesser.check_possessives true to check for possessive endings when guessing the part of speech for an unknown word, false otherwise. The default setting is false, which is also the recommended setting.
partofspeechguesser.class The part of speech guesser class, which tries to determine the most likely parts of speech for an unknown word. DefaultPartOfSpeechGuesser is the default which is designed for English words.
partofspeechguesser.try_standard_spellings true to use standard spellings when guessing the parts of speech for unknown words, false to use the original spellings only. The default setting is true.
partofspeechretagger.class The class which corrects the initial part of speech tagging. The IRetagger class applies a short list of fixup rules to improve the tagging of I tokens. The NoopRetagger class leaves the original tagging unchanged. The PronounRetagger class applies a short list of fixup rules to improve the tagging of pronouns. The DefaultPartOfSpeechRetagger is the same as IRetagger.
partofspeechtagger.class The class which perform part of speech tagging. The default TrigramTagger which is a hidden Markov model based trigram tagger. This is the workhorse tagger in MorphAdorner. Other taggers, mostly experimental, include:
  • AffixTagger uses an affix lexicon to assign a part of speech tag to a word based upon the prefixes or suffixes of the word.
  • BigramTagger is a hidden Markov model based bigram tagger. It is faster but less accurate than the trigram tagger.
  • BigramHybridTagger combines the bigram tagger with a second pass by a Hepple tagger to correct the initial tagging. Note: you must supply the correction rules, none are provided by default.
  • HeppleTagger is Mark Hepple's rule-based part of speech tagger modified from the version in Gate to work with the MorphAdorner lexicons, guessers, etc.
  • RegexpTagger uses regular expressions to assign a part of speech tag to a word. You must supply the regular expressions, none are provided by default.
  • SimpleTagger assigns a "noun" part of speech to all words, except those that appear to be numbers. Numbers are assigned a "number" part of speech. Words starting with a capital letter can be assigned a separate "proper name" part of speech. This tagger is mostly useful as a backup to a more sophisticated tagger.
  • SimpleRuleBasedTagger assigns the most commonly occurring part of speech to all words using a lexicon, and then applies a small set of contextual rules to "fix up" the tagging. This simple tagger is useful when very fast tagging without high accuracy is useful, e.g., in sentence splitting.
  • TrigramHybridTagger combines the trigram tagger with a second pass by a Hepple tagger to correct the initial tagging. Note: you must supply the correction rules, none are provided by default.
  • UnigramTagger uses a lexicon to assign the most frequently occurring part of speech tag to a word.
partofspeechtagger.transition_matrix The file containing the tag transition probability matrix data. For the standard MorphAdorner release, these files appear in the data/ subdirectory. The 19th century fiction transition matrix file is data/ncftransmat.mat and the Early Modern English transition matrix file is data/emetransmat.mat. This value may be overridden on the MorphAdorner command line by the -t parameter.
pretokenizer.class The class which applies any pretokenization corrections to the text to prepare it for initial token extraction. The default is DefaultPreTokenizer which ensures that characters which should always be separate tokens are surrounded by whitespace. In general this class should always be used. The EEBOPreTokenizer was written to correct the text for EEBO texts before those texts were modified by Abbott to conform to TEI-Analytics standards.
posttokenizer.class The class which applies any tokenization corrections to the initial token extraction. The default is DefaultPostTokenizer. The EEBOPostTokenizer was written to correct tokens extracted from EEBO texts before those texts were modified by Abbott to conform to TEI-Analytics standards.
sentencesplitter.class The class which determines sentence boundaries. ICU4JBreakIteratorSentenceSplitter uses an ICU4J BreakIterator to identify candidate sentences. Several heuristics are used to correct the initial sentence identification for English sentences. The DefaultSentenceSplitter is the same as ICU4JBreakIteratorSentenceSplitter.
spelling.spelling_pairs The spelling data file which maps variant spellings to standard spellings. For the standard MorphAdorner release, these files appear in the data/ subdirectory. The 19th century fiction spelling map file is data/ncfmergedspellingpairs.tab and the Early Modern English spelling map file is data/ememergedspellingpairs.tab. This value may be overridden on the MorphAdorner command line by the -a parameter.
spelling.spelling_pairs_by_word_class The spelling data file which maps variant spellings to standard spellings by word class. For the standard MorphAdorner release, these files appear in the data/ subdirectory. The spelling map by word class file used for all periods is data/spellingsbywordclass.txt . This value may be overridden on the MorphAdorner command line by the -w parameter.
spelling.standard_spellings The spelling data file which list standard spellings. For the standard MorphAdorner release, this file is data/standardspellings.txt . This value may be overridden on the MorphAdorner command line by the -s parameter.
spellingmapper.class The spelling mapper class which maps spellings from one dialect to another. The USToBritishSpellingMapper maps United States spellings to British spellings, while BritishToUSSpellingMapper maps British spellings to United States spellings.
spellingstandardizer.class The class which maps variant spellings to standard spellings. The DefaultSpellingStandardizer class is the ExtendedSimpleSpellingStandardizer which uses spelling maps along with a few simple heuristics to find standard spellings given a variant spelling. The SimpleSpellingStandardizer class only uses spelling maps. The ExtendedSearchSpellingStandardizer implements the full scheme discussed at Spelling Standardization Process which can lead to exotically erroneous standard spellings in some cases.
textinputter.class The class which reads input text for adornment. The DefaultTextInputter class is the URLTextInputter which reads utf-8 text from a URL. The SimpleXMLTextInputter reads utf-8 text from a TEI or EEBO XML file. The DiskBasedXMLTextInputter also reads utf-8 text from a TEI or EEBO XML file, but divides the file into smaller sections which are stored in temporary disk files and adorned separately. This is useful for working with large XML input files. The FirstTokenURLTextInputterreads only the first token in each line from a URL.
wordlists.use_latin_word_list true to use an extended list of Latin words when adding part of speech tags to words, false to not use the extended list.
wordtokenizer.class Class which splits a sentence into word tokens. DefaultWordTokenizer is the default and is suitable for English text.
xml.adorn_existing_xml_files true to adorn XML files with an existing adorned version in the output directory, false to skip adornment for existing files. true is the default value. When set true and an existing adorned file exists, a versioned output file name is created to avoid overwriting the previous adorned version. For example, if the file "aaa.xml" is to be adorned, and the adorned version "aaa.xml" already exists in the output directory, then the file "aaa-001.xml" is created. If "aaa-001.xml" already exists, "aaa-002.xml" is created, and so on.
xml.close_sentence_at_end_of_hard_tag true to force a sentence to close at the end of a hard tag, false to allow a sentence to cross across hard tags. In many literary texts sentences do cross hard tag (usually paragraph) boundaries, so this setting should be set false.
xml.close_sentence_at_end_of_jump_tag true to force a sentence to close at the end of a jump tag, false to allow a sentence to cross across hard tags. This setting should generally be set to true.
xml.disallow_word_elements_in=figDesc sic Specifies the XML elements in which to disallow generated and elements. Element names are separated by blanks. The default list is figDesc sic.
xml.field_delimiters Field delimiters for adorned word output. The default is the Ascii tab character \t . This should not generally be changed.
xml.fix_gap_tags true to fix <gap> tags in XML texts, false to leave them alone. In general, if the input texts are in TEI-Analytics format, this setting should be false.
xml.fix_orig_tags true to fix <orig> tags in XML texts, false to leave them alone. In general, if the input texts are in TEI-Analytics format, this setting should be false.
xml.fix_split_words true to fix split words in XML texts, false to leave them alone. The match patterns are regular expressions specified by the settings xml.fix_split_words.match1, xml.fix_split_words.match2, etc. The corresponding corrections are specified by the settings xml.fix_split_words.replace1, xml.fix_split_words.replace2, etc. These patterns may actually be used for more general purposes than splitting or joining words. Examples of these settings may be found in the eme.properties settings file in the MorphAdorner release.
xml.id.attribute The name of the XML word ID attribute. The default value is xml:id.
xml.id.spacing This setting gives the spacing between ID values. For example, an increment of 10 spaces reading_context_order or wordinblock values by 10. This allows new values to be interpolated for editing purposes. The default value is 10.
xml.id.type Word IDs start with the work identifier, taken from the file name of the work.

reading_context_order appends integer values whose order gives the reading context order defined by the classification of hard, soft, and jump tags.

word_within_page_block appends two integer values in the the form pageblocknumber-wordinblock, where pageblocknumber is the ordinal of the current (page break) entry, and wordinblock is the number of the word within the page block (starting at 1).
xml.ignore_tag_case true to ignore the case of XML tags when processing them, false to consider different tag case significant. The default is true.
xml.jump_tags The list of XML jump tags, separated by blanks.

MorphAdorner uses the following jump tags for the default TEI-Analytics XML input files.

bibl figdesc figDesc figure footnote note ref stage tailnote
xml.log true to enable extended logging, false otherwise. The default is false.
xml.output_nonredundant_attributes_only true to emit only non-redundant word tag attributes, false to emit all word attributes. A word attribute is redundant if its value can be determined from the data enclosed by the tags or from another tag value. By default MorphAdorner emits all word tag values even if redundant.
xml.output_nonredundant_eos_attribute true to emit only non-redundant eos attributes, false to emit all eos attributes. By default MorphAdorner emits all eos values even if redundant, assuming that the adorner.output.end_of_sentence_flag is true, and the xml.use_pc_to_mark_end_of_sentence is false.
xml.output_nonredundant_part_attribute true to emit only non-redundant part attributes, false to emit all part attributes. By default MorphAdorner emits all part values even if redundant.
xml.output_nonredundant_token_attribute true to emit only non-redundant token attributes, false to emit all token attributes. A redundant token attribute specifies the same text as the data enclosed by the tags. By default MorphAdorner emits all token values even if redundant.
xml.output_pseudo_page_boundary_milestones true to emit XML pseudopage boundary milestone elements, false to not emit these milestones.
xml.output_whitespace_elements true to emit whitespace elements (e.g., ) between word elements in XML, false to not emit these whitespace elements. This setting should be true in most cases.
xml.pseudo_page_container_div_types The list of XML tags which close a pseudopage, separated by blanks.

MorphAdorner uses the following soft tags for the default TEI-Analytics XML input files.

volume chapter sermon
xml.pseudo_page_size The maximum length in words of a pseudopage. The default is 300 words.
xml.soft_tags The list of XML soft tags, separated by blanks.

MorphAdorner uses the following soft tags for the default TEI-Analytics XML input files.

abbr add address author c cl corr date emph foreign gap hi l lb location m mentioned milestone money name num organization orig pb person phr reg rs s sb seg sic soCalled sub sup term time title unclear w zzzzsw
xml.surround_marker The marker character used internally for surrounding distinct segments of text. Default is Unicode character \ue501 . This should not be changed.
xml.tokenlabel.emit True to emit a token label which contains an image number for the current page, a letter for the current column on the page, the word number multiplied by the label spacing within the column. This is used when adorning Text Creation Partnership texts to relate words to the source page images.
xml.tokenlabel.attribute The token label attribute name. The default is n.
xml.token.label.spacing Increment value for generating token labels. The default is 10.
xml.token.label.prependworkname Set to true to prepend the work name to the token label. The default is false; the work name will not be prepended to the token label.
xml.use_pc_to_mark_end_of_sentence Add a unit="sentence" attribute to mark the end of a sentence. This is the default in MorphAdorner v2 (in v1, the eos was used instead).
xml.word_delimiters Output word delimiters for adorned word output. The default is Ascii \r\n . This should not be changed.
xml.word_tag_name The name of the XML tag which is used to mark an adorned output word. The default is w.
xml.xml_schema The name of the default scheme used for parsing an XML file when none appears in the XML text. For MorphAdorner, the default is the TEI-Analytics scheme which appears at http://morphadorner.northwestern.edu/schemata/TEIAnalytics.rng .
Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk