NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Tokenizing an XML Text

If you just want to tokenize XML texts rather than fully adorn them, select the batch file/script which most nearly matches the type of corpus you have, and add the "-k" command line option to the MorphAdorner invocation. This causes MorphAdorner to emit the tokenized text -- that is, the <w>, <pc> with word IDs include along with the whitepspace marker <c> elements. The other word-level adornments are not output. See MorphAdorner Command Line Syntax for details on the MorphAdorner command line.

As an example, let's modify the adornncf script so that it only tokenizes XML files rather than fully adorning them.

#!/bin/sh
java -Xmx1024m -Xss1m -cp .:bin/:dist/*:lib/* \
        edu.northwestern.at.morphadorner.MorphAdorner \
        -p ncf.properties \
        -l data/ncflexicon.lex \
        -t data/ncftransmat.mat \
        -u data/ncfsuffixlexicon.lex \
        -a data/ncfmergedspellingpairs.tab \
        -s data/standardspellings.txt \
        -w data/spellingsbywordclass.txt \
        -o $1 \
        $2 $3 $4 $5 $6 $7 $8 $9

All we need to do is to add the "-k" command line parameter and save the script to a new file, say, tokenizencf:

#!/bin/sh
java -Xmx1024m -Xss1m -cp .:bin/:dist/*:lib/* \
        edu.northwestern.at.morphadorner.MorphAdorner \
        -k \
        -p ncf.properties \
        -l data/ncflexicon.lex \
        -t data/ncftransmat.mat \
        -u data/ncfsuffixlexicon.lex \
        -a data/ncfmergedspellingpairs.tab \
        -s data/standardspellings.txt \
        -w data/spellingsbywordclass.txt \
        -o $1 \
        $2 $3 $4 $5 $6 $7 $8 $9

You can modify the Windows batch file similarly.

java -Xmx1024m -Xss1m -cp bin\;dist\*;lib\*; ^
        edu.northwestern.at.morphadorner.MorphAdorner ^
        -k
        -p ncf.properties ^
        -l data/ncflexicon.lex ^
        -t data/ncftransmat.mat ^
        -u data/ncfsuffixlexicon.lex ^
        -a data/ncfmergedspellingpairs.tab ^
        -s data/standardspellings.txt ^
        -w data/spellingsbywordclass.txt ^
        -o %1 ^
        %2 %3 %4 %5 %6 %7 %8 %9

For the Unix script, remember to make it executable.

chmod 755 tokenizencf

You can now tokenize one or more TEI XML files by invoking tokenizencf:

./tokenizencf /mytokenizedfiles /myteifiles/*.xml

MorphAdorner tokenizes all the TEI XML files in the directory /myteifiles and writes the tokenized versions in the directory /mytokenizedfiles .

The equivalent Windows command is:

tokenizencf \mytokenizedfiles \myteifiles\*.xml

Here is a brief sample of tokenized TEI XML text.

              <l>
                <w xml:id="K135834_000-000990">Or</w>
                <c> </c>
                <w xml:id="K135834_000-001000">those</w>
                <c> </c>
                <w xml:id="K135834_000-001010">whom</w>
                <c> </c>
                <w xml:id="K135834_000-001020">choice</w>
                <c> </c>
                <w xml:id="K135834_000-001030">and</w>
                <c> </c>
                <w xml:id="K135834_000-001040">common</w>
                <c> </c>
                <w xml:id="K135834_000-001050">good</w>
                <c> </c>
                <w xml:id="K135834_000-001060">ordain</w>
                <pc xml:id="K135834_000-001070" unit="sentence">.</pc>
              </l>

You can later fully adorn the tokenized files by inputting them to MorphAdorner using, e.g., adornncf.

The tokenized format is useful if you wish to edit the tokenized texts before performing full adornment. See Processing Text Creation Partnership Files for an example of how this proved useful when processing those files.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk