NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
The configuration settings for MorphAdorner appear in utf-8 text files. Each setting takes the form setting=value and appears on a separate line in the configuration file. The default settings file is called morphadorner.properties. To change the default settings file name, specify the alternative default using the d= parameter on the MorphAdorner Command Line.
Overriding settings may be specified by in a file named by the p= parameter on the MorphAdorner Command Line. A number of sample settings files are provided in the MorphAdorner release materials, corresponding to settings used when adorning files in the various corpora used in projects which used MorphAdorner.
Properties file | Corpus |
---|---|
docsouth.properties | Documenting the American South XML files. |
eaf.properties | Early American Fiction XML files. |
ecco.properties | Eighteenth Century Collections Online XML files. |
ece.properties | Eighteenth century English XML files. |
eme.properties | Early Modern English XML files. |
emeplaintext.properties | Early Modern English plain text files. |
ncf.properties | Nineteenth Century Fiction XML files in which the apostrophe, opening single quote, and closing single quote are the same character. |
ncfa.properties | Nineteenth Century Fiction XML files in which the apostrophe is distinguished from the opening and closing single quote characters,. |
plaintext.properties | Plain text of nineteenth century vintage or later. |
wright.properties | Wright Fiction Archive XML files. |
The following table lists the setting names and their definitions, along with typical values.
Setting Name | Description and Values |
---|---|
abbreviations.abbreviations_url | Specifies the URL for an extra list of abbreviations. Such a list for Early Modern English texts may be found in data/emeabbreviations. |
abbreviations.main.abbreviations_url | Specifies the URL for an extra list of abbreviations to be used only for main text. |
abbreviations.main.abbreviations_url | Specifies the URL for an extra list of abbreviations to be used only for side text, e.g., paratext. |
adornedwordoutputter.class |
Class which produces adorned word values. The following output classes
are currently implemented in MorphAdorner.
|
adorner.handle_xml | true to use the TEI XML handler, false to use the ordinary text handler. |
adorner.lemmatization.ignorelexiconentries | true to ignore lemma definitions in the current lexicon file when generating output lemmata, and use only the current lemmatizer. false to look at the lemma definitions in the lexicon first, and use the lemmatizer only when there is no lemma definition in the lexicon. |
adorner.output.end_of_sentence_flag | true to output an end of sentence flag for each adorned word, false to not generate this flag. The attribute value is set to "1" when a word ends a sentence and "0" otherwise. |
adorner.output.end_of_sentence_flag_attribute | The name of the XML word attribute for the end of sentence flag. The default value is eos. |
adorner.output.kwic | true to output keyword in context (kwic) entries for each adorned word, false to not generate these entries. |
adorner.output.kwic.width | The number of characters of kwic text to output. 80 is a typical value, which is split between the left and right kwic text. |
adorner.output.kwic_left_attribute | The name of the XML word attribute for the kwic text appearing before a word. The default value is kl. |
adorner.output.kwic_right_attribute | The name of the XML word attribute for the kwic text appearing after a word. The default value is kr. |
adorner.output.lemma | true to output the lemma for an adorned word, false otherwise. |
adorner.output.lemma_attribute | The name of the XML word attribute for the lemmata of an adorned word. The default value is lem. |
adorner.output.original_token | true to output the original word token for an adorned word, false otherwise. |
adorner.output.original_token_attribute | The name of the XML word attribute for the original word token of an adorned word. The default value is tok. |
adorner.output.part_of_speech | true to output the part of speech for an adorned word, false otherwise. |
adorner.output.part_of_speech_attribute | The name of the XML word attribute for the part of speech of an adorned word. The default value is pos. |
adorner.output.running_word_numbers | true to output the word numbers for adorned words as continuously ascending values. false to restart the word numbers over for each sentence. |
adorner.output.sentence_number | true to output the sentence number for an adorned word, false otherwise. |
adorner.output.sentence_number_attribute | The name of the XML word attribute for the sentence number for an adorned word. The default value is sn. |
adorner.output.spelling | true to output the spelling for an adorned word, false otherwise. |
adorner.output.spelling_attribute | The name of the XML word attribute for the spelling for an adorned word. The default value is spe. |
adorner.output.standard_spelling | true to output the standard spelling for an adorned word, false otherwise. |
adorner.output.standard_spelling_attribute | The name of the XML word attribute for the standard spelling for an adorned word. The default value is reg. |
adorner.output.word_number | true to output the word number for an adorned word, false otherwise. |
adorner.output.word_number_attribute | The name of the XML word attribute for the word number for an adorned word. The default value is wn. |
adorner.output.word_ordinal | true to output the word ordinal for an adorned word, false otherwise. |
adorner.output.word_ordinal_attribute | The name of the XML word attribute for the word ordinal for an adorned word. The default value is ord. |
corpus.name | The name of the corpus for this configuration. Usually a short string such as "ncf" for "nineteenth century fiction." Used by the MorphAdorner server when displaying the available configurations. The server ignores MorphAdorner configurations which do not have the corpus.name set. |
corpus.description | Longer description the corpus for this configuration. Used by the MorphAdorner server when displaying the available configurations. The server ignores MorphAdorner configurations which do not have the corpus.dcescription set. |
initialspellingstandardizer.class | The initial spelling standardizer class. This is used when guessing parts of speech for words not present in the lexicon. NoopSpellingStandardizer, the default, leaves spellings unstandardized when guessing parts of speech. |
lexicon.suffix_lexicon | The file containing the suffix lexicon. For the standard MorphAdorner release, the lexicon files appear in the data/ subdirectory. The 19th century fiction suffix lexicon is data/ncfsuffixlexicon.lex and the Early Modern English suffix lexicon is data/emesuffixlexicon.lex. This value may be overridden on the MorphAdorner command line by the -u parameter. |
lexicon.word_lexicon | The file containing the word lexicon. For the standard MorphAdorner release, the lexicon files appear in the data/ subdirectory. The 19th century fiction word lexicon is data/ncfwordlexicon.lex and the Early Modern English word lexicon is data/emewordlexicon.lex. This value may be overridden on the MorphAdorner command line by the -l parameter. |
morphadornerxmlwriter.class | The class for writing adorned XML files. DefaultMorphAdornerXMLWriter is the default. This should not be changed unless you implement a new XML writer class. |
namestandardizer.class | The proper name standardizer class. DefaultNameStandardizer is the default, which implements the scheme described in Standardizing Proper Names. The NoopNameStandardizer class leaves names unstandardized. The EEBOSimpleNameStandardizer class corrects a handful of names when processing early modern English texts. |
partofspeechguesser.check_possessives | true to check for possessive endings when guessing the part of speech for an unknown word, false otherwise. The default setting is false, which is also the recommended setting. |
partofspeechguesser.class | The part of speech guesser class, which tries to determine the most likely parts of speech for an unknown word. DefaultPartOfSpeechGuesser is the default which is designed for English words. |
partofspeechguesser.try_standard_spellings | true to use standard spellings when guessing the parts of speech for unknown words, false to use the original spellings only. The default setting is true. |
partofspeechretagger.class | The class which corrects the initial part of speech tagging. The IRetagger class applies a short list of fixup rules to improve the tagging of I tokens. The NoopRetagger class leaves the original tagging unchanged. The PronounRetagger class applies a short list of fixup rules to improve the tagging of pronouns. The DefaultPartOfSpeechRetagger is the same as IRetagger. |
partofspeechtagger.class |
The class which perform part of speech tagging. The default
TrigramTagger which is a hidden Markov model
based trigram tagger. This is the workhorse tagger in MorphAdorner.
Other taggers, mostly experimental, include:
|
partofspeechtagger.transition_matrix | The file containing the tag transition probability matrix data. For the standard MorphAdorner release, these files appear in the data/ subdirectory. The 19th century fiction transition matrix file is data/ncftransmat.mat and the Early Modern English transition matrix file is data/emetransmat.mat. This value may be overridden on the MorphAdorner command line by the -t parameter. |
pretokenizer.class | The class which applies any pretokenization corrections to the text to prepare it for initial token extraction. The default is DefaultPreTokenizer which ensures that characters which should always be separate tokens are surrounded by whitespace. In general this class should always be used. The EEBOPreTokenizer was written to correct the text for EEBO texts before those texts were modified by Abbott to conform to TEI-Analytics standards. |
posttokenizer.class | The class which applies any tokenization corrections to the initial token extraction. The default is DefaultPostTokenizer. The EEBOPostTokenizer was written to correct tokens extracted from EEBO texts before those texts were modified by Abbott to conform to TEI-Analytics standards. |
sentencesplitter.class | The class which determines sentence boundaries. ICU4JBreakIteratorSentenceSplitter uses an ICU4J BreakIterator to identify candidate sentences. Several heuristics are used to correct the initial sentence identification for English sentences. The DefaultSentenceSplitter is the same as ICU4JBreakIteratorSentenceSplitter. |
spelling.spelling_pairs | The spelling data file which maps variant spellings to standard spellings. For the standard MorphAdorner release, these files appear in the data/ subdirectory. The 19th century fiction spelling map file is data/ncfmergedspellingpairs.tab and the Early Modern English spelling map file is data/ememergedspellingpairs.tab. This value may be overridden on the MorphAdorner command line by the -a parameter. |
spelling.spelling_pairs_by_word_class | The spelling data file which maps variant spellings to standard spellings by word class. For the standard MorphAdorner release, these files appear in the data/ subdirectory. The spelling map by word class file used for all periods is data/spellingsbywordclass.txt . This value may be overridden on the MorphAdorner command line by the -w parameter. |
spelling.standard_spellings | The spelling data file which list standard spellings. For the standard MorphAdorner release, this file is data/standardspellings.txt . This value may be overridden on the MorphAdorner command line by the -s parameter. |
spellingmapper.class | The spelling mapper class which maps spellings from one dialect to another. The USToBritishSpellingMapper maps United States spellings to British spellings, while BritishToUSSpellingMapper maps British spellings to United States spellings. |
spellingstandardizer.class | The class which maps variant spellings to standard spellings. The DefaultSpellingStandardizer class is the ExtendedSimpleSpellingStandardizer which uses spelling maps along with a few simple heuristics to find standard spellings given a variant spelling. The SimpleSpellingStandardizer class only uses spelling maps. The ExtendedSearchSpellingStandardizer implements the full scheme discussed at Spelling Standardization Process which can lead to exotically erroneous standard spellings in some cases. |
textinputter.class | The class which reads input text for adornment. The DefaultTextInputter class is the URLTextInputter which reads utf-8 text from a URL. The SimpleXMLTextInputter reads utf-8 text from a TEI or EEBO XML file. The DiskBasedXMLTextInputter also reads utf-8 text from a TEI or EEBO XML file, but divides the file into smaller sections which are stored in temporary disk files and adorned separately. This is useful for working with large XML input files. The FirstTokenURLTextInputterreads only the first token in each line from a URL. |
wordlists.use_latin_word_list | true to use an extended list of Latin words when adding part of speech tags to words, false to not use the extended list. |
wordtokenizer.class | Class which splits a sentence into word tokens. DefaultWordTokenizer is the default and is suitable for English text. |
xml.adorn_existing_xml_files | true to adorn XML files with an existing adorned version in the output directory, false to skip adornment for existing files. true is the default value. When set true and an existing adorned file exists, a versioned output file name is created to avoid overwriting the previous adorned version. For example, if the file "aaa.xml" is to be adorned, and the adorned version "aaa.xml" already exists in the output directory, then the file "aaa-001.xml" is created. If "aaa-001.xml" already exists, "aaa-002.xml" is created, and so on. |
xml.close_sentence_at_end_of_hard_tag | true to force a sentence to close at the end of a hard tag, false to allow a sentence to cross across hard tags. In many literary texts sentences do cross hard tag (usually paragraph) boundaries, so this setting should be set false. |
xml.close_sentence_at_end_of_jump_tag | true to force a sentence to close at the end of a jump tag, false to allow a sentence to cross across hard tags. This setting should generally be set to true. |
xml.disallow_word_elements_in=figDesc sic |
Specifies the XML elements in which to disallow generated figDesc sic .
|
xml.field_delimiters | Field delimiters for adorned word output. The default is the Ascii tab character \t . This should not generally be changed. |
xml.fix_gap_tags | true to fix <gap> tags in XML texts, false to leave them alone. In general, if the input texts are in TEI-Analytics format, this setting should be false. |
xml.fix_orig_tags | true to fix <orig> tags in XML texts, false to leave them alone. In general, if the input texts are in TEI-Analytics format, this setting should be false. |
xml.fix_split_words | true to fix split words in XML texts, false to leave them alone. The match patterns are regular expressions specified by the settings xml.fix_split_words.match1, xml.fix_split_words.match2, etc. The corresponding corrections are specified by the settings xml.fix_split_words.replace1, xml.fix_split_words.replace2, etc. These patterns may actually be used for more general purposes than splitting or joining words. Examples of these settings may be found in the eme.properties settings file in the MorphAdorner release. |
xml.id.attribute | The name of the XML word ID attribute. The default value is xml:id. |
xml.id.spacing | This setting gives the spacing between ID values. For example, an increment of 10 spaces reading_context_order or wordinblock values by 10. This allows new values to be interpolated for editing purposes. The default value is 10. |
xml.id.type |
Word IDs start with the work identifier, taken from the file name
of the work.
reading_context_order appends integer values whose order gives the reading context order defined by the classification of hard, soft, and jump tags. word_within_page_block appends two integer values in the the form pageblocknumber-wordinblock, where pageblocknumber is the ordinal of the current |
xml.ignore_tag_case | true to ignore the case of XML tags when processing them, false to consider different tag case significant. The default is true. |
xml.jump_tags |
The list of XML jump tags, separated by blanks.
MorphAdorner uses the following jump tags for the default TEI-Analytics XML input files.
|
xml.log | true to enable extended logging, false otherwise. The default is false. |
xml.output_nonredundant_attributes_only |
true to emit only non-redundant word tag attributes,
false to emit all word attributes.
A word attribute is redundant if its value can be determined
from the data enclosed by the |
xml.output_nonredundant_eos_attribute | true to emit only non-redundant eos attributes, false to emit all eos attributes. By default MorphAdorner emits all eos values even if redundant, assuming that the adorner.output.end_of_sentence_flag is true, and the xml.use_pc_to_mark_end_of_sentence is false. |
xml.output_nonredundant_part_attribute | true to emit only non-redundant part attributes, false to emit all part attributes. By default MorphAdorner emits all part values even if redundant. |
xml.output_nonredundant_token_attribute |
true to emit only non-redundant token attributes,
false to emit all token attributes.
A redundant token attribute specifies the same text as the
data enclosed by the |
xml.output_pseudo_page_boundary_milestones | true to emit XML pseudopage boundary milestone elements, false to not emit these milestones. |
xml.output_whitespace_elements |
true to emit whitespace elements (e.g., |
xml.pseudo_page_container_div_types |
The list of XML tags which close a pseudopage, separated by blanks.
MorphAdorner uses the following soft tags for the default TEI-Analytics XML input files.
|
xml.pseudo_page_size | The maximum length in words of a pseudopage. The default is 300 words. |
xml.soft_tags |
The list of XML soft tags, separated by blanks.
MorphAdorner uses the following soft tags for the default TEI-Analytics XML input files.
|
xml.surround_marker | The marker character used internally for surrounding distinct segments of text. Default is Unicode character \ue501 . This should not be changed. |
xml.tokenlabel.emit | True to emit a token label which contains an image number for the current page, a letter for the current column on the page, the word number multiplied by the label spacing within the column. This is used when adorning Text Creation Partnership texts to relate words to the source page images. |
xml.tokenlabel.attribute | The token label attribute name. The default is n. |
xml.token.label.spacing | Increment value for generating token labels. The default is 10. |
xml.token.label.prependworkname | Set to true to prepend the work name to the token label. The default is false; the work name will not be prepended to the token label. |
xml.use_pc_to_mark_end_of_sentence | Add a unit="sentence" attribute to mark the end of a sentence. This is the default in MorphAdorner v2 (in v1, the eos was used instead). |
xml.word_delimiters | Output word delimiters for adorned word output. The default is Ascii \r\n . This should not be changed. |
xml.word_tag_name | The name of the XML tag which is used to mark an adorned output word. The default is w. |
xml.xml_schema | The name of the default scheme used for parsing an XML file when none appears in the XML text. For MorphAdorner, the default is the TEI-Analytics scheme which appears at http://morphadorner.northwestern.edu/schemata/TEIAnalytics.rng . |
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|