NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
If you just want to tokenize XML texts rather than fully adorn them, select the batch file/script which most nearly matches the type of corpus you have, and add the "-k" command line option to the MorphAdorner invocation. This causes MorphAdorner to emit the tokenized text -- that is, the <w>, <pc> with word IDs include along with the whitepspace marker <c> elements. The other word-level adornments are not output. See MorphAdorner Command Line Syntax for details on the MorphAdorner command line.
As an example, let's modify the adornncf script so that it only tokenizes XML files rather than fully adorning them.
#!/bin/sh java -Xmx1024m -Xss1m -cp .:bin/:dist/*:lib/* \ edu.northwestern.at.morphadorner.MorphAdorner \ -p ncf.properties \ -l data/ncflexicon.lex \ -t data/ncftransmat.mat \ -u data/ncfsuffixlexicon.lex \ -a data/ncfmergedspellingpairs.tab \ -s data/standardspellings.txt \ -w data/spellingsbywordclass.txt \ -o $1 \ $2 $3 $4 $5 $6 $7 $8 $9
All we need to do is to add the "-k" command line parameter and save the script to a new file, say, tokenizencf:
#!/bin/sh java -Xmx1024m -Xss1m -cp .:bin/:dist/*:lib/* \ edu.northwestern.at.morphadorner.MorphAdorner \ -k \ -p ncf.properties \ -l data/ncflexicon.lex \ -t data/ncftransmat.mat \ -u data/ncfsuffixlexicon.lex \ -a data/ncfmergedspellingpairs.tab \ -s data/standardspellings.txt \ -w data/spellingsbywordclass.txt \ -o $1 \ $2 $3 $4 $5 $6 $7 $8 $9
You can modify the Windows batch file similarly.
java -Xmx1024m -Xss1m -cp bin\;dist\*;lib\*; ^ edu.northwestern.at.morphadorner.MorphAdorner ^ -k -p ncf.properties ^ -l data/ncflexicon.lex ^ -t data/ncftransmat.mat ^ -u data/ncfsuffixlexicon.lex ^ -a data/ncfmergedspellingpairs.tab ^ -s data/standardspellings.txt ^ -w data/spellingsbywordclass.txt ^ -o %1 ^ %2 %3 %4 %5 %6 %7 %8 %9
For the Unix script, remember to make it executable.
chmod 755 tokenizencf
You can now tokenize one or more TEI XML files by invoking tokenizencf:
./tokenizencf /mytokenizedfiles /myteifiles/*.xml
MorphAdorner tokenizes all the TEI XML files in the directory /myteifiles and writes the tokenized versions in the directory /mytokenizedfiles .
The equivalent Windows command is:
tokenizencf \mytokenizedfiles \myteifiles\*.xml
Here is a brief sample of tokenized TEI XML text.
<l> <w xml:id="K135834_000-000990">Or</w> <c> </c> <w xml:id="K135834_000-001000">those</w> <c> </c> <w xml:id="K135834_000-001010">whom</w> <c> </c> <w xml:id="K135834_000-001020">choice</w> <c> </c> <w xml:id="K135834_000-001030">and</w> <c> </c> <w xml:id="K135834_000-001040">common</w> <c> </c> <w xml:id="K135834_000-001050">good</w> <c> </c> <w xml:id="K135834_000-001060">ordain</w> <pc xml:id="K135834_000-001070" unit="sentence">.</pc> </l>
You can later fully adorn the tokenized files by inputting them to MorphAdorner using, e.g., adornncf.
The tokenized format is useful if you wish to edit the tokenized texts before performing full adornment. See Processing Text Creation Partnership Files for an example of how this proved useful when processing those files.
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|