|
The MorphAdorner distribution comes packaged with Windows batch files
and Unix/Linux script files to execute MorphAdorner for the texts
contained in the Monk collection. You may use these batch files as a basis
for developing scripts to adorn other collections of texts.
The Linux/Unix scripts assume that
the "java" command invokes that standard Sun Java run time environment,
not the Gnu Java runtime. MorphAdorner does not run under the Gnu Java
run time environment. MorphAdorner requires Sun Java v1.5 or later.
The adorndocsouth.bat Windows batch file
and the adorndocsouth Unix shell script
execute MorphAdorner using data files suitable for adorning
texts from the Documenting the American South
nineteenth century English language texts. The
texts must be encoded in TEI (Text Encoding Initiative) format
using the utf-8 character set.
The adornncf.bat Windows batch file
and the adornncf Unix shell script
execute MorphAdorner using data files suitable for adorning
nineteenth century English language fiction texts. The
texts must be encoded in TEI (Text Encoding Initiative) format
using the utf-8 character set.
The adornncfa.bat Windows batch file
and the adornncfa Unix shell script
execute MorphAdorner using data files suitable for adorning
nineteenth century English language fiction texts in which
apostrophes are completely distinguished from left and right
single quotes (e.g., the standard Unicode curly quote
characters for left and right single quote are used, and
the usual apostrophe character is reserved for actual
apostrophes). The
texts must be encoded in TEI (Text Encoding Initiative) format
using the utf-8 character set.
The adornecco.bat Windows batch file
and the adornecco Unix shell script
execute MorphAdorner using data files suitable for adorning
eighteenth century English language texts. The
texts must be encoded in TEI (Text Encoding Initiative) format
using the utf-8 character set.
The adorneme.bat Windows batch file
and the adorneme Unix shell script
execute MorphAdorner using data files suitable for adorning
early modern English language texts. The
texts must be encoded in TEI (Text Encoding Initiative) format
or the EEBO/TCP format using the utf-8 character set.
The adornplainemetext.bat Windows batch file
and the adornplainemetext Unix shell script
execute MorphAdorner using the early modern English data files.
The input texts must be plain Ascii texts encoded using
the utf-8 character set.
The adornplaintext.bat Windows batch file
and the adornplaintext Unix shell script
execute MorphAdorner using the nineteenth century
fiction data files. The input texts must be plain Ascii texts
encoded using the utf-8 character set.
The adornwright.bat Windows batch file
and the adornwright Unix shell script
execute MorphAdorner using data files suitable for adorning
nineteenth century texts from the Wright fiction archive. This
script is probably suitable for other American texts of the
nineteenth century. The texts must be encoded in TEI
(Text Encoding Initiative) format using the utf-8 character set.
The Unix shell scripts should work with little or no modification under
Mac OSX.
For example, to adorn a nineteenth century fiction text on a Windows system,
open a command line prompt and move to the MorphAdorner
installation directory. Then type the following command:
adornncf \outputdir \inputdir\mytext.xml
where
\outputdir
specifies the name of a directory into which to write the adorned
xml output, and \inputdir\mytext.xml specifies the
file name of the text to adorn. The output file name will be the same as
the input file name. However, if a file of that name already exists in
the output directory, a "versioned" file name will be created to avoid
overwriting the existing file. For example, should the file "mytext.xml"
already exist in the output directory, the output file name will be changed
to "mytext-001.xml". More generally, the three digit version number
starts at "001" and is incremented as necessary to produce a non-existing
file name.
Alternatively, MorphAdorner optionally allows you to specify that texts
with a matching adorned version in the current output directory should not
be readorned. See the description of the xml.adorn_existing_xml_files
configuration setting
for more details.
You may specify more than one file to adorn, and you may specify
wildcards to match more than one file. For example:
adornncf \outputdir \inputdir\*.xml
adorns all the files with the extension .xml
in the directory \inputdir.
On a Unix/Linux/Mac OSX system, open a terminal window, move to
the MorphAdorner installation directory, and type the following
command:
./adornncf /outputdir /inputdir/mytext.xml
Don't forget to mark the adornncf script file as
executable before using it. On most Unix/Linux systems you can
use the chmod command to do this:
chmod 755 adornncf
If you know for certain that the text you wish to adorn distinguishes
the use of the apostrophe character (') from left and right single
quotes (Unicode characters 0x2018 and 0x2019 respectively), you may
use adornncfa instead of adornncf.
To adorn an early modern English text, substitute
adorneme
for adornncf in the command line. To adorn plain text
using the nineteenth century data file, substitute
adornplaintext for adornncf
in the command line.
MorphAdorner writes a log of its activities to standard system output,
which is usually the display. You may redirect standard output to another
file in the usual fashion. For example, under Windows, to redirect the
MorphAdorner log output to a disk file, type:
adornncf \outputdir \inputdir\mytext.xml >myoutput.lis
where myoutput.lis is the
name of the file to which to redirect MorphAdorner's logging output.
If you have the tee utility installed, you can redirect
the output to a file and watch the output displayed to your screen
at the same time:
adornncf \outputdir \inputdir\mytext.xml | tee myoutput.lis
The tee utility is usually provided by default on most
Unix/Linux and Mac OSX systems. The tee utility
is not provided as a standard part of Microsoft Windows operating
systems. Third party Windows implementations are available. You may
download a Windows implementation of tee as
tee.zip.
Use your favorite unzip program to extract
tee.exe from tee.zip. Place
tee.exe in the MorphAdorner installation directory.
Java OutOfMemory Errors
Each of the batch and script files above invoke MorphAdorner with a
Java virtual machine size of 720 megabytes. This means your PC needs
to have a minimum of one gigabyte of memory. The 720 megabyte
size is sufficient for adorning texts containing up to a quarter of
a million words or so. Longer texts may require a larger Java virtual
machine memory allocation. If you see the error message
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
in the MorphAdorner output log, you need to specify a
larger heap space setting to Java. MorphAdorner is a memory intensive
program, especially when adorning large XML encoded texts.
If your system has more than a gigabyte of
memory installed, you can raise the Java virtual machine size by modifying
the value of the java -Xmx720m parameter in the
batch file or script you are using. To specify a
larger heap size, e.g., 1,500 megabytes for example, change
java -Xmx720m to
java -Xmx1500m .
However, even when your system has more than two gigabytes of memory you may
not be able to request a heap size that large
on a 32 bit operating system.
You will need to experiment with different heap size settings to find the
maximum your particular system allows.
For large texts containing millions of words you may need
to run MorphAdorner on a system with a 64 bit version of Java. For
example, we found that several of the longest texts in the EEBO collection
required a virtual machine size of several gigabytes, e.g.,
java -Xmx8g for an eight gigabyte size.
If you encounter the OutOfMemory error when running a
MorphAdorner utility program, you can modify the heap size setting in
the batch file or script for that program as well.
|