NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
MorphAdorner Server Services: Word Tokenizer Service

Service name: wordtokenizer
Service description: Split text into words and punctuation.
HTTP methods allowed: GET, POST, OPTIONS
POST accepts as input: application/x-www-form-urlencoded
HTTP return codes: 200: service succeeded
400: service failed with an error

Query parameters

    corpusConfig Corpus configuration name. In the standard distribution these are ece, eme, and ncf.
    media Result format. One of json, xml, html, text .
    text Text to be processed.
    includeInputText Allowed values are true to include the input text in the output and false to not include the input text.
    langCode ISO language code. These are two or three character codes. The default is en, English. You may specify *** Detect *** to indicate that the server should try to determine the language from the text provided.

Sample POST form

<form accept-charset="UTF-8" method="post" action="wordtokenizer"
      target="_blank"
      name="wordtokenizer">
<table cellpadding="0" cellspacing="5">
<tr>
<td><strong>Text:</strong></td>
<td colspan="2">
<textarea name="text" rows="15" cols="76"></textarea>
</td>
</tr>
<tr>
<td valign="top">
<strong>
Lexicon:</strong>
</td>
<td>
<input type="radio" name="corpusConfig" value="eme">Early Modern English</input><br />
<input type="radio" name="corpusConfig" value="ece">Eighteen Century English</input><br />
<input type="radio" name="corpusConfig" value="ncf" checked="checked">Nineteenth Century Fiction</input>
</td>
</tr>
<tr>
<td><strong>Language:</strong></td>
<td>
<select name="langCode">
<option value="en" selected="selected">English</option>
<option value="">*** Detect ***</option>
<option value="af">Afrikaans</option>
<option value="ak">Akan</option>
<option value="sq">Albanian</option>
<option value="am">Amharic</option>
<option value="ar">Arabic</option>
<option value="hy">Armenian</option>
<option value="as">Assamese</option>
<option value="az">Azerbaijani</option>
<option value="bm">Bambara</option>
<option value="bas">Basa</option>
<option value="eu">Basque</option>
<option value="be">Belarusian</option>
<option value="bem">Bemba</option>
<option value="bn">Bengali</option>
<option value="bs">Bosnian</option>
<option value="br">Breton</option>
<option value="bg">Bulgarian</option>
<option value="my">Burmese</option>
<option value="ca">Catalan</option>
<option value="chr">Cherokee</option>
<option value="zh">Chinese</option>
<option value="kw">Cornish</option>
<option value="hr">Croatian</option>
<option value="cs">Czech</option>
<option value="da">Danish</option>
<option value="dua">Duala</option>
<option value="nl">Dutch</option>
<option value="eo">Esperanto</option>
<option value="et">Estonian</option>
<option value="ee">Ewe</option>
<option value="ewo">Ewondo</option>
<option value="fo">Faroese</option>
<option value="fil">Filipino</option>
<option value="fi">Finnish</option>
<option value="fr">French</option>
<option value="ff">Fulah</option>
<option value="gl">Gallegan</option>
<option value="lg">Ganda</option>
<option value="ka">Georgian</option>
<option value="de">German</option>
<option value="el">Greek</option>
<option value="kl">Greenlandic</option>
<option value="gu">Gujarati</option>
<option value="ha">Hausa</option>
<option value="haw">Hawaiian</option>
<option value="iw">Hebrew</option>
<option value="hi">Hindi</option>
<option value="hu">Hungarian</option>
<option value="is">Icelandic</option>
<option value="ig">Igbo</option>
<option value="in">Indonesian</option>
<option value="ga">Irish</option>
<option value="it">Italian</option>
<option value="ja">Japanese</option>
<option value="kab">Kabyle</option>
<option value="kam">Kamba</option>
<option value="kn">Kannada</option>
<option value="kk">Kazakh</option>
<option value="km">Khmer</option>
<option value="ki">Kikuyu</option>
<option value="rw">Kinyarwanda</option>
<option value="kok">Konkani</option>
<option value="ko">Korean</option>
<option value="lv">Latvian</option>
<option value="ln">Lingala</option>
<option value="lt">Lithuanian</option>
<option value="lu">Luba-Katanga</option>
<option value="mk">Macedonian</option>
<option value="mg">Malagasy</option>
<option value="ms">Malay</option>
<option value="ml">Malayalam</option>
<option value="mt">Maltese</option>
<option value="gv">Manx</option>
<option value="mr">Marathi</option>
<option value="mas">Masai</option>
<option value="ne">Nepali</option>
<option value="nd">North Ndebele</option>
<option value="nb">Norwegian Bokm�l</option>
<option value="nn">Norwegian Nynorsk</option>
<option value="nyn">Nyankole</option>
<option value="or">Oriya</option>
<option value="om">Oromo</option>
<option value="pa">Panjabi</option>
<option value="fa">Persian</option>
<option value="pl">Polish</option>
<option value="pt">Portuguese</option>
<option value="ps">Pushto</option>
<option value="rm">Raeto-Romance</option>
<option value="ro">Romanian</option>
<option value="rn">Rundi</option>
<option value="ru">Russian</option>
<option value="sg">Sango</option>
<option value="sr">Serbian</option>
<option value="sn">Shona</option>
<option value="ii">Sichuan Yi</option>
<option value="si">Sinhalese</option>
<option value="sk">Slovak</option>
<option value="sl">Slovenian</option>
<option value="so">Somali</option>
<option value="es">Spanish</option>
<option value="sw">Swahili</option>
<option value="sv">Swedish</option>
<option value="gsw">Swiss German</option>
<option value="ta">Tamil</option>
<option value="te">Telugu</option>
<option value="th">Thai</option>
<option value="bo">Tibetan</option>
<option value="ti">Tigrinya</option>
<option value="to">Tonga</option>
<option value="tr">Turkish</option>
<option value="uk">Ukrainian</option>
<option value="ur">Urdu</option>
<option value="uz">Uzbek</option>
<option value="vai">Vai</option>
<option value="vi">Vietnamese</option>
<option value="cy">Welsh</option>
<option value="yo">Yoruba</option>
<option value="zu">Zulu</option>
</select>
</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>
<input type="checkbox" name="includeInputText" value="true"
       checked="checked"/>
Include input text in results
</td>
</tr>
<tr>
<td>
&nbsp;
</td>
<td>
&nbsp;
</td>
</tr>
<tr>
<td valign="top">
<strong>Results format:</strong>
</td>
<td>
<input type="radio" name="media" value="json">JSON format</input><br />
<input type="radio" name="media" value="xml" checked="checked">XML format</input><br />
<input type="radio" name="media" value="html">HTML format</input><br />
<input type="radio" name="media" value="text">Text format</input>
</td>
</tr>
<tr>
<td>
&nbsp;
</td>
<td>
&nbsp;
</td>
</tr>
<tr>
<td colspan="2">
<input type="submit" name="tokenizer" value="Tokenize" />
</td>
</tr>
</table>
</form>

Output

Here we tokenize the first two sentences of Sarah Hale's poem "Mary had a little lamb."

Mary had a little lamb,
whose fleece was white as snow.
And everywhere that Mary went,
the lamb was sure to go.

The JSON and XML WordTokenizerResult echo the input text, the ISO language code langCode, and the corpusConfig. The sentences container wraps a sequence of sentence entries each of which represents a single parsed sentence from the input text. Each sentence contains a sequence of token entries representing the words and punctuation in the sentence. The HTML and text versions provide displayable versions of the tokenized sentences.

JSON output

{
  "WordTokenizerResult": {
    "text": "Mary had a little lamb,  whose fleece was white as snow.  And everywhere that Mary went,  the lamb was sure to go.",
    "langCode": "en",
    "corpusConfig": "ncf",
    "sentences": [
      {
        "sentence": [
          {
            "token": [
              "Mary",
              "had",
              "a",
              "little",
              "lamb",
              ",",
              "whose",
              "fleece",
              "was",
              "white",
              "as",
              "snow",
              "."
            ]
          },
          {
            "token": [
              "And",
              "everywhere",
              "that",
              "Mary",
              "went",
              ",",
              "the",
              "lamb",
              "was",
              "sure",
              "to",
              "go",
              "."
            ]
          }
        ]
      }
    ]
  }
}

XML output

<WordTokenizerResult>
    <text>Mary had a little lamb,  whose fleece was white as snow.  And everywhere that Mary went,  the lamb was sure to go.</text>
    <langCode>en</langCode>
    <corpusConfig>ncf</corpusConfig>
    <sentences>
        <sentence>
            <token>Mary</token>
            <token>had</token>
            <token>a</token>
            <token>little</token>
            <token>lamb</token>
            <token>,</token>
            <token>whose</token>
            <token>fleece</token>
            <token>was</token>
            <token>white</token>
            <token>as</token>
            <token>snow</token>
            <token>.</token>
        </sentence>
        <sentence>
            <token>And</token>
            <token>everywhere</token>
            <token>that</token>
            <token>Mary</token>
            <token>went</token>
            <token>,</token>
            <token>the</token>
            <token>lamb</token>
            <token>was</token>
            <token>sure</token>
            <token>to</token>
            <token>go</token>
            <token>.</token>
        </sentence>
    </sentences>
</WordTokenizerResult>

HTML output (source)

<h3>26 words in 2 sentences found.</h3>
<table border="0">
<tr>
<th align="left">S#</th>
<th align="left">W#</th>
<th align="left">Token</th>
<th align="left">Type</th>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">Mary</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">2</td>
<td valign="top" align="left">had</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">a</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">little</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">5</td>
<td valign="top" align="left">lamb</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">6</td>
<td valign="top" align="left">,</td>
<td valign="top" align="left">punctuation</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">7</td>
<td valign="top" align="left">whose</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">8</td>
<td valign="top" align="left">fleece</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">9</td>
<td valign="top" align="left">was</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">10</td>
<td valign="top" align="left">white</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">11</td>
<td valign="top" align="left">as</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">snow</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>1</strong></td>
<td valign="top" align="left">13</td>
<td valign="top" align="left">.</td>
<td valign="top" align="left">punctuation</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">And</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">2</td>
<td valign="top" align="left">everywhere</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">that</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">4</td>
<td valign="top" align="left">Mary</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">5</td>
<td valign="top" align="left">went</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">6</td>
<td valign="top" align="left">,</td>
<td valign="top" align="left">punctuation</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">7</td>
<td valign="top" align="left">the</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">8</td>
<td valign="top" align="left">lamb</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">9</td>
<td valign="top" align="left">was</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">10</td>
<td valign="top" align="left">sure</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">11</td>
<td valign="top" align="left">to</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">12</td>
<td valign="top" align="left">go</td>
<td valign="top" align="left">token</td>
</tr>
<tr>
<td valign="top" align="left"><strong>2</strong></td>
<td valign="top" align="left">13</td>
<td valign="top" align="left">.</td>
<td valign="top" align="left">punctuation</td>
</tr>
</table>

HTML output (display)

26 words in 2 sentences found.

S# W# Token Type
1 1 Mary token
1 2 had token
1 3 a token
1 4 little token
1 5 lamb token
1 6 , punctuation
1 7 whose token
1 8 fleece token
1 9 was token
1 10 white token
1 11 as token
1 12 snow token
1 13 . punctuation
2 1 And token
2 2 everywhere token
2 3 that token
2 4 Mary token
2 5 went token
2 6 , punctuation
2 7 the token
2 8 lamb token
2 9 was token
2 10 sure token
2 11 to token
2 12 go token
2 13 . punctuation

Text output

26 words in 2 sentences found.
S#	W#	Token	Type
1	1	Mary	token
1	2	had	token
1	3	a	token
1	4	little	token
1	5	lamb	token
1	6	,	punctuation
1	7	whose	token
1	8	fleece	token
1	9	was	token
1	10	white	token
1	11	as	token
1	12	snow	token
1	13	.	punctuation
2	1	And	token
2	2	everywhere	token
2	3	that	token
2	4	Mary	token
2	5	went	token
2	6	,	punctuation
2	7	the	token
2	8	lamb	token
2	9	was	token
2	10	sure	token
2	11	to	token
2	12	go	token
2	13	.	punctuation
Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk