MorphAdorner: Tokenizing an XML Text

Tokenizing an XML Text

If you just want to tokenize XML texts rather than fully adorn them, select the batch file/script which most nearly matches the type of corpus you have, and add the "-k" command line option to the MorphAdorner invocation. This causes MorphAdorner to emit the tokenized text -- that is, the <w>, <pc> with word IDs include along with the whitepspace marker <c> elements. The other word-level adornments are not output. See MorphAdorner Command Line Syntax for details on the MorphAdorner command line.

As an example, let's modify the adornncf script so that it only tokenizes XML files rather than fully adorning them.

#!/bin/sh
java -Xmx1024m -Xss1m -cp .:bin/:dist/*:lib/* \
        edu.northwestern.at.morphadorner.MorphAdorner \
        -p ncf.properties \
        -l data/ncflexicon.lex \
        -t data/ncftransmat.mat \
        -u data/ncfsuffixlexicon.lex \
        -a data/ncfmergedspellingpairs.tab \
        -s data/standardspellings.txt \
        -w data/spellingsbywordclass.txt \
        -o $1 \
        $2 $3 $4 $5 $6 $7 $8 $9

All we need to do is to add the "-k" command line parameter and save the script to a new file, say, tokenizencf:

#!/bin/sh
java -Xmx1024m -Xss1m -cp .:bin/:dist/*:lib/* \
        edu.northwestern.at.morphadorner.MorphAdorner \
        -k \
        -p ncf.properties \
        -l data/ncflexicon.lex \
        -t data/ncftransmat.mat \
        -u data/ncfsuffixlexicon.lex \
        -a data/ncfmergedspellingpairs.tab \
        -s data/standardspellings.txt \
        -w data/spellingsbywordclass.txt \
        -o $1 \
        $2 $3 $4 $5 $6 $7 $8 $9

You can modify the Windows batch file similarly.

java -Xmx1024m -Xss1m -cp bin\;dist\*;lib\*; ^
        edu.northwestern.at.morphadorner.MorphAdorner ^
        -k
        -p ncf.properties ^
        -l data/ncflexicon.lex ^
        -t data/ncftransmat.mat ^
        -u data/ncfsuffixlexicon.lex ^
        -a data/ncfmergedspellingpairs.tab ^
        -s data/standardspellings.txt ^
        -w data/spellingsbywordclass.txt ^
        -o %1 ^
        %2 %3 %4 %5 %6 %7 %8 %9

For the Unix script, remember to make it executable.

chmod 755 tokenizencf

You can now tokenize one or more TEI XML files by invoking tokenizencf:

./tokenizencf /mytokenizedfiles /myteifiles/*.xml

MorphAdorner tokenizes all the TEI XML files in the directory /myteifiles and writes the tokenized versions in the directory /mytokenizedfiles .

The equivalent Windows command is:

tokenizencf \mytokenizedfiles \myteifiles\*.xml

Here is a brief sample of tokenized TEI XML text.

              <l>
                <w xml:id="K135834_000-000990">Or</w>
                <c> </c>
                <w xml:id="K135834_000-001000">those</w>
                <c> </c>
                <w xml:id="K135834_000-001010">whom</w>
                <c> </c>
                <w xml:id="K135834_000-001020">choice</w>
                <c> </c>
                <w xml:id="K135834_000-001030">and</w>
                <c> </c>
                <w xml:id="K135834_000-001040">common</w>
                <c> </c>
                <w xml:id="K135834_000-001050">good</w>
                <c> </c>
                <w xml:id="K135834_000-001060">ordain</w>
                <pc xml:id="K135834_000-001070" unit="sentence">.</pc>
              </l>

You can later fully adorn the tokenized files by inputting them to MorphAdorner using, e.g., adornncf.

The tokenized format is useful if you wish to edit the tokenized texts before performing full adornment. See Processing Text Creation Partnership Files for an example of how this proved useful when processing those files.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner