NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Adorning A Text

The MorphAdorner distribution comes packaged with Windows batch files and Unix/Linux script files to execute MorphAdorner for texts from several corpora. You may use these batch files as a basis for developing scripts to adorn other collections of texts.

The Linux/Unix scripts assume that the "java" command invokes the Java 1.6 (or later) run time environment. The standard Oracle and OpenJDK releases work fine. Other Java implementations may not work with MorphAdorner.

  1. The adorndocsouth.bat Windows batch file and the adorndocsouth Unix shell script execute MorphAdorner using data files suitable for adorning texts from the Documenting the American South nineteenth century English language texts. The texts must be encoded in TEI (Text Encoding Initiative) format using the utf-8 character set.

  2. The adornece.bat Windows batch file and the adornecee Unix shell script execute MorphAdorner using data files suitable for adorning eighteenth century English language texts. The texts must be encoded in TEI (Text Encoding Initiative) format using the utf-8 character set.

  3. The adornncf.bat Windows batch file and the adornncf Unix shell script execute MorphAdorner using data files suitable for adorning nineteenth century English language fiction texts. The texts must be encoded in TEI (Text Encoding Initiative) format using the utf-8 character set.

  4. The adornncfa.bat Windows batch file and the adornncfa Unix shell script execute MorphAdorner using data files suitable for adorning nineteenth century English language fiction texts in which apostrophes are completely distinguished from left and right single quotes (e.g., the standard Unicode curly quote characters for left and right single quote are used, and the usual apostrophe character is reserved for actual apostrophes). The texts must be encoded in TEI (Text Encoding Initiative) format using the utf-8 character set.

  5. The adornecco.bat Windows batch file and the adornecco Unix shell script execute MorphAdorner using data files suitable for adorning eighteenth century English language texts. The texts must be encoded in TEI (Text Encoding Initiative) format using the utf-8 character set.

  6. The adorneme.bat Windows batch file and the adorneme Unix shell script execute MorphAdorner using data files suitable for adorning early modern English language texts. The texts must be encoded in TEI (Text Encoding Initiative) format or the EEBO/TCP format using the utf-8 character set.

  7. The adornplainemetext.bat Windows batch file and the adornplainemetext Unix shell script execute MorphAdorner using the early modern English data files. The input texts must be plain Ascii texts encoded using the utf-8 character set.

  8. The adornplaintext.bat Windows batch file and the adornplaintext Unix shell script execute MorphAdorner using the nineteenth century fiction data files. The input texts must be plain Ascii texts encoded using the utf-8 character set.

  9. The adornwright.bat Windows batch file and the adornwright Unix shell script execute MorphAdorner using data files suitable for adorning nineteenth century texts from the Wright fiction archive. This script is probably suitable for other American texts of the nineteenth century. The texts must be encoded in TEI (Text Encoding Initiative) format using the utf-8 character set.

The Unix shell scripts should work with little or no modification under Mac OSX.

For example, to adorn a nineteenth century fiction text on a Windows system, open a command line prompt and move to the MorphAdorner installation directory. Then type the following command:

adornncf \outputdir \inputdir\mytext.xml

where \outputdir specifies the name of a directory into which to write the adorned xml output, and \inputdir\mytext.xml specifies the file name of the text to adorn. The output file name will be the same as the input file name. However, if a file of that name already exists in the output directory, a "versioned" file name will be created to avoid overwriting the existing file. For example, should the file "mytext.xml" already exist in the output directory, the output file name will be changed to "mytext-001.xml". More generally, the three digit version number starts at "001" and is incremented as necessary to produce a non-existing file name.

Alternatively, MorphAdorner optionally allows you to specify that texts with a matching adorned version in the current output directory should not be readorned. See the description of the xml.adorn_existing_xml_files configuration setting for more details.

You may specify more than one file to adorn, and you may specify wildcards to match more than one file. For example:

adornncf \outputdir \inputdir\*.xml

adorns all the files with the extension .xml in the directory \inputdir.

On a Unix/Linux/Mac OSX system, open a terminal window, move to the MorphAdorner installation directory, and type the following command:

./adornncf /outputdir /inputdir/mytext.xml

Don't forget to mark the adornncf script file as executable before using it. On most Unix/Linux systems you can use the chmod command to do this:

chmod 755 adornncf

If you know for certain that the text you wish to adorn distinguishes the use of the apostrophe character (') from left and right single quotes (Unicode characters 0x2018 and 0x2019 respectively), you may use adornncfa instead of adornncf.

To adorn an early modern English text, substitute adorneme for adornncf in the command line. To adorn plain text using the nineteenth century data file, substitute adornplaintext for adornncf in the command line.

MorphAdorner writes a log of its activities to standard system output, which is usually the display. You may redirect standard output to another file in the usual fashion. For example, under Windows, to redirect the MorphAdorner log output to a disk file, type:

adornncf \outputdir \inputdir\mytext.xml >myoutput.lis

where myoutput.lis is the name of the file to which to redirect MorphAdorner's logging output. If you have the tee utility installed, you can redirect the output to a file and watch the output displayed to your screen at the same time:

adornncf \outputdir \inputdir\mytext.xml | tee myoutput.lis

The tee utility is usually provided by default on most Unix/Linux and Mac OSX systems. The tee utility is not provided as a standard part of Microsoft Windows operating systems. Third party Windows implementations are available. You may download a Windows implementation of tee as tee.zip. Use your favorite unzip program to extract tee.exe from tee.zip. Place tee.exe in the MorphAdorner installation directory.

Java OutOfMemory Errors

Each of the batch and script files above invoke MorphAdorner with a Java virtual machine size of 1024 megabytes. This means your PC needs to have a minimum of one gigabyte of memory. The 1024 megabyte size is sufficient for adorning texts containing up to a quarter of a million words or so. Longer texts may require a larger Java virtual machine memory allocation. If you see the error message

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

in the MorphAdorner output log, you need to specify a larger heap space setting to Java. MorphAdorner is a memory intensive program, especially when adorning large XML encoded texts.

If your system has more than a gigabyte of memory installed, you can raise the Java virtual machine size by modifying the value of the java -Xmx1024m parameter in the batch file or script you are using. To specify a larger heap size, e.g., 1,500 megabytes for example, change java -Xmx1024m to java -Xmx1500m . However, even when your system has more than two gigabytes of memory you may not be able to request a heap size that large on a 32 bit operating system. You will need to experiment with different heap size settings to find the maximum your particular system allows.

For large texts containing millions of words you may need to run MorphAdorner on a system with a 64 bit version of Java. For example, we found that several of the longest texts in the EEBO collection required a virtual machine size of several gigabytes, e.g., java -Xmx8g for an eight gigabyte size.

If you encounter the OutOfMemory error when running a MorphAdorner utility program, you can modify the heap size setting in the batch file or script for that program as well.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk