Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Documentation > Correcting Quote Marks
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Correcting Quote Marks
 
 

FixXMLQuotes attempts to convert straight double quotes (Ascii/Unicode 34) into "curly" left and right double quotes (Unicode 8220 and 8221 respectively). It also attempts to convert straight single quotes (Ascii/Unicode 39) into "curly" left and right single quotes (Unicode 8216 and 8217 respectively) and to distinguish these from the use of the single quote as an apostrophe. FixXMLQuotes makes mistakes, so its output should be corrected manually. FixXMLQuotes accepts XML files in TEI format as input.

Usage:

fixxmlquotes softtags.txt jumptags.txt outputdirectory input1.xml input2.xml ...

where

  • softtags.txt specifies a text file containing list of soft XML tags, one per line. A sample is included as part of the MorphAdorner distribution.
  • jumptags.txt specifies a text file containing list of jump XML tags, one per line. A sample is included as part of the MorphAdorner distribution.
  • outputdirectory specifies the output directory to receive xml files with quote marks fixed.
  • input*.xml specifies the input TEI XML files.

For each of the input XML files, FixXMLQuotes attempts to correct the quotes and writes a corrected XML file of the same name in the specified output directory.

The companion FixQuotes program provides the same approach to correcting quote marks, but for plain text files instead of XML files.

Usage:

fixquotes input.txt output.txt

where

  • input.txt specifies the input text file with quote marks to correct.
  • output.txt specifies the output text file with quote marks fixed.

At best fixxmlquotes and fixquote correct 90% of the quotes. The remainder need to be corrected manually.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Wed Apr 01 23:38:46 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University