public class AdornedToSketch
extends java.lang.Object
AdornedToSketch converts adorned TEI XML files to the verticalized format required as input to the Sketch or NoSketch corpus search engines.
Usage:
java edu.northwestern.at.morphadorner.tools.adornedtosketch.AdornedToSketch sketchinput.txt corpusname adorned1.xml adorned2.xml ...
where
Known flaw: AdornedToSketch does not generate the "glue" elements which bind punctuation marks to word tokens. Searching the corpus still works fine in the Sketch or NoSketch engine, but the punctuation marks are displayed detached from any token to which they would normally be attached.
Modifier and Type | Field and Description |
---|---|
protected static java.lang.String |
corpusName
Corpus name.
|
protected static int |
currentDocNumber
Current document.
|
protected static int |
docsToProcess
Number of documents to process.
|
protected static int |
INITPARAMS
# params before input file specs.
|
protected static java.lang.String |
inputDirectory
Input directory.
|
protected static java.lang.String |
outputFile
Output file name.
|
protected static java.io.PrintStream |
outputFileStream
Output file stream.
|
protected static java.io.PrintStream |
printStream
Wrapper for printStream to allow utf-8 output.
|
Constructor and Description |
---|
AdornedToSketch() |
Modifier and Type | Method and Description |
---|---|
protected static boolean |
initialize(java.lang.String[] args)
Initialize.
|
static void |
main(java.lang.String[] args)
Main program.
|
protected static int |
processFiles(java.lang.String[] args)
Process files.
|
protected static void |
processOneFile(java.lang.String xmlFileName)
Process one file.
|
protected static java.lang.String[] |
splitPath(java.lang.String path)
Split word path into separate tags.
|
protected static java.lang.String[] |
splitPathFull(java.lang.String path)
Split word path into separate tags.
|
protected static void |
terminate(int filesProcessed,
long processingTime)
Terminate.
|
protected static int docsToProcess
protected static int currentDocNumber
protected static java.lang.String inputDirectory
protected static java.lang.String outputFile
protected static java.io.PrintStream outputFileStream
protected static java.io.PrintStream printStream
protected static java.lang.String corpusName
protected static final int INITPARAMS
public static void main(java.lang.String[] args)
args
- Program parameters.protected static boolean initialize(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
protected static java.lang.String[] splitPathFull(java.lang.String path)
path
- The word path.protected static java.lang.String[] splitPath(java.lang.String path)
path
- The word path.protected static void processOneFile(java.lang.String xmlFileName)
xmlFileName
- Adorned XML file name to reformat for CWB.protected static int processFiles(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
protected static void terminate(int filesProcessed, long processingTime)
filesProcessed
- Number of files processed.processingTime
- Processing time in seconds.