NU
IT
Northwestern University Information Technology |
MorphAdorner V2.0 | Site Map |
Service name: | wordtokenizer |
Service description: | Split text into words and punctuation. |
HTTP methods allowed: | GET, POST, OPTIONS |
POST accepts as input: | application/x-www-form-urlencoded |
HTTP return codes: | 200: service succeeded 400: service failed with an error |
Query parameters |
|
corpusConfig | Corpus configuration name. In the standard distribution these are ece, eme, and ncf. |
media | Result format. One of json, xml, html, text . |
text | Text to be processed. |
includeInputText | Allowed values are true to include the input text in the output and false to not include the input text. |
langCode | ISO language code. These are two or three character codes. The default is en, English. You may specify *** Detect *** to indicate that the server should try to determine the language from the text provided. |
<form accept-charset="UTF-8" method="post" action="wordtokenizer" target="_blank" name="wordtokenizer"> <table cellpadding="0" cellspacing="5"> <tr> <td><strong>Text:</strong></td> <td colspan="2"> <textarea name="text" rows="15" cols="76"></textarea> </td> </tr> <tr> <td valign="top"> <strong> Lexicon:</strong> </td> <td> <input type="radio" name="corpusConfig" value="eme">Early Modern English</input><br /> <input type="radio" name="corpusConfig" value="ece">Eighteen Century English</input><br /> <input type="radio" name="corpusConfig" value="ncf" checked="checked">Nineteenth Century Fiction</input> </td> </tr> <tr> <td><strong>Language:</strong></td> <td> <select name="langCode"> <option value="en" selected="selected">English</option> <option value="">*** Detect ***</option> <option value="af">Afrikaans</option> <option value="ak">Akan</option> <option value="sq">Albanian</option> <option value="am">Amharic</option> <option value="ar">Arabic</option> <option value="hy">Armenian</option> <option value="as">Assamese</option> <option value="az">Azerbaijani</option> <option value="bm">Bambara</option> <option value="bas">Basa</option> <option value="eu">Basque</option> <option value="be">Belarusian</option> <option value="bem">Bemba</option> <option value="bn">Bengali</option> <option value="bs">Bosnian</option> <option value="br">Breton</option> <option value="bg">Bulgarian</option> <option value="my">Burmese</option> <option value="ca">Catalan</option> <option value="chr">Cherokee</option> <option value="zh">Chinese</option> <option value="kw">Cornish</option> <option value="hr">Croatian</option> <option value="cs">Czech</option> <option value="da">Danish</option> <option value="dua">Duala</option> <option value="nl">Dutch</option> <option value="eo">Esperanto</option> <option value="et">Estonian</option> <option value="ee">Ewe</option> <option value="ewo">Ewondo</option> <option value="fo">Faroese</option> <option value="fil">Filipino</option> <option value="fi">Finnish</option> <option value="fr">French</option> <option value="ff">Fulah</option> <option value="gl">Gallegan</option> <option value="lg">Ganda</option> <option value="ka">Georgian</option> <option value="de">German</option> <option value="el">Greek</option> <option value="kl">Greenlandic</option> <option value="gu">Gujarati</option> <option value="ha">Hausa</option> <option value="haw">Hawaiian</option> <option value="iw">Hebrew</option> <option value="hi">Hindi</option> <option value="hu">Hungarian</option> <option value="is">Icelandic</option> <option value="ig">Igbo</option> <option value="in">Indonesian</option> <option value="ga">Irish</option> <option value="it">Italian</option> <option value="ja">Japanese</option> <option value="kab">Kabyle</option> <option value="kam">Kamba</option> <option value="kn">Kannada</option> <option value="kk">Kazakh</option> <option value="km">Khmer</option> <option value="ki">Kikuyu</option> <option value="rw">Kinyarwanda</option> <option value="kok">Konkani</option> <option value="ko">Korean</option> <option value="lv">Latvian</option> <option value="ln">Lingala</option> <option value="lt">Lithuanian</option> <option value="lu">Luba-Katanga</option> <option value="mk">Macedonian</option> <option value="mg">Malagasy</option> <option value="ms">Malay</option> <option value="ml">Malayalam</option> <option value="mt">Maltese</option> <option value="gv">Manx</option> <option value="mr">Marathi</option> <option value="mas">Masai</option> <option value="ne">Nepali</option> <option value="nd">North Ndebele</option> <option value="nb">Norwegian Bokm�l</option> <option value="nn">Norwegian Nynorsk</option> <option value="nyn">Nyankole</option> <option value="or">Oriya</option> <option value="om">Oromo</option> <option value="pa">Panjabi</option> <option value="fa">Persian</option> <option value="pl">Polish</option> <option value="pt">Portuguese</option> <option value="ps">Pushto</option> <option value="rm">Raeto-Romance</option> <option value="ro">Romanian</option> <option value="rn">Rundi</option> <option value="ru">Russian</option> <option value="sg">Sango</option> <option value="sr">Serbian</option> <option value="sn">Shona</option> <option value="ii">Sichuan Yi</option> <option value="si">Sinhalese</option> <option value="sk">Slovak</option> <option value="sl">Slovenian</option> <option value="so">Somali</option> <option value="es">Spanish</option> <option value="sw">Swahili</option> <option value="sv">Swedish</option> <option value="gsw">Swiss German</option> <option value="ta">Tamil</option> <option value="te">Telugu</option> <option value="th">Thai</option> <option value="bo">Tibetan</option> <option value="ti">Tigrinya</option> <option value="to">Tonga</option> <option value="tr">Turkish</option> <option value="uk">Ukrainian</option> <option value="ur">Urdu</option> <option value="uz">Uzbek</option> <option value="vai">Vai</option> <option value="vi">Vietnamese</option> <option value="cy">Welsh</option> <option value="yo">Yoruba</option> <option value="zu">Zulu</option> </select> </td> </tr> <tr> <td> </td> <td> <input type="checkbox" name="includeInputText" value="true" checked="checked"/> Include input text in results </td> </tr> <tr> <td> </td> <td> </td> </tr> <tr> <td valign="top"> <strong>Results format:</strong> </td> <td> <input type="radio" name="media" value="json">JSON format</input><br /> <input type="radio" name="media" value="xml" checked="checked">XML format</input><br /> <input type="radio" name="media" value="html">HTML format</input><br /> <input type="radio" name="media" value="text">Text format</input> </td> </tr> <tr> <td> </td> <td> </td> </tr> <tr> <td colspan="2"> <input type="submit" name="tokenizer" value="Tokenize" /> </td> </tr> </table> </form>
Here we tokenize the first two sentences of Sarah Hale's poem "Mary had a little lamb."
Mary had a little lamb,
whose fleece was white as snow.
And everywhere that Mary went,
the lamb was sure to go.
The JSON and XML WordTokenizerResult echo the input text, the ISO language code langCode, and the corpusConfig. The sentences container wraps a sequence of sentence entries each of which represents a single parsed sentence from the input text. Each sentence contains a sequence of token entries representing the words and punctuation in the sentence. The HTML and text versions provide displayable versions of the tokenized sentences.
{ "WordTokenizerResult": { "text": "Mary had a little lamb, whose fleece was white as snow. And everywhere that Mary went, the lamb was sure to go.", "langCode": "en", "corpusConfig": "ncf", "sentences": [ { "sentence": [ { "token": [ "Mary", "had", "a", "little", "lamb", ",", "whose", "fleece", "was", "white", "as", "snow", "." ] }, { "token": [ "And", "everywhere", "that", "Mary", "went", ",", "the", "lamb", "was", "sure", "to", "go", "." ] } ] } ] } }
<WordTokenizerResult> <text>Mary had a little lamb, whose fleece was white as snow. And everywhere that Mary went, the lamb was sure to go.</text> <langCode>en</langCode> <corpusConfig>ncf</corpusConfig> <sentences> <sentence> <token>Mary</token> <token>had</token> <token>a</token> <token>little</token> <token>lamb</token> <token>,</token> <token>whose</token> <token>fleece</token> <token>was</token> <token>white</token> <token>as</token> <token>snow</token> <token>.</token> </sentence> <sentence> <token>And</token> <token>everywhere</token> <token>that</token> <token>Mary</token> <token>went</token> <token>,</token> <token>the</token> <token>lamb</token> <token>was</token> <token>sure</token> <token>to</token> <token>go</token> <token>.</token> </sentence> </sentences> </WordTokenizerResult>
<h3>26 words in 2 sentences found.</h3> <table border="0"> <tr> <th align="left">S#</th> <th align="left">W#</th> <th align="left">Token</th> <th align="left">Type</th> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">1</td> <td valign="top" align="left">Mary</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">2</td> <td valign="top" align="left">had</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">3</td> <td valign="top" align="left">a</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">4</td> <td valign="top" align="left">little</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">5</td> <td valign="top" align="left">lamb</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">6</td> <td valign="top" align="left">,</td> <td valign="top" align="left">punctuation</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">7</td> <td valign="top" align="left">whose</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">8</td> <td valign="top" align="left">fleece</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">9</td> <td valign="top" align="left">was</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">10</td> <td valign="top" align="left">white</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">11</td> <td valign="top" align="left">as</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">12</td> <td valign="top" align="left">snow</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>1</strong></td> <td valign="top" align="left">13</td> <td valign="top" align="left">.</td> <td valign="top" align="left">punctuation</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">1</td> <td valign="top" align="left">And</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">2</td> <td valign="top" align="left">everywhere</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">3</td> <td valign="top" align="left">that</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">4</td> <td valign="top" align="left">Mary</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">5</td> <td valign="top" align="left">went</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">6</td> <td valign="top" align="left">,</td> <td valign="top" align="left">punctuation</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">7</td> <td valign="top" align="left">the</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">8</td> <td valign="top" align="left">lamb</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">9</td> <td valign="top" align="left">was</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">10</td> <td valign="top" align="left">sure</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">11</td> <td valign="top" align="left">to</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">12</td> <td valign="top" align="left">go</td> <td valign="top" align="left">token</td> </tr> <tr> <td valign="top" align="left"><strong>2</strong></td> <td valign="top" align="left">13</td> <td valign="top" align="left">.</td> <td valign="top" align="left">punctuation</td> </tr> </table>
S# | W# | Token | Type |
---|---|---|---|
1 | 1 | Mary | token |
1 | 2 | had | token |
1 | 3 | a | token |
1 | 4 | little | token |
1 | 5 | lamb | token |
1 | 6 | , | punctuation |
1 | 7 | whose | token |
1 | 8 | fleece | token |
1 | 9 | was | token |
1 | 10 | white | token |
1 | 11 | as | token |
1 | 12 | snow | token |
1 | 13 | . | punctuation |
2 | 1 | And | token |
2 | 2 | everywhere | token |
2 | 3 | that | token |
2 | 4 | Mary | token |
2 | 5 | went | token |
2 | 6 | , | punctuation |
2 | 7 | the | token |
2 | 8 | lamb | token |
2 | 9 | was | token |
2 | 10 | sure | token |
2 | 11 | to | token |
2 | 12 | go | token |
2 | 13 | . | punctuation |
26 words in 2 sentences found. S# W# Token Type 1 1 Mary token 1 2 had token 1 3 a token 1 4 little token 1 5 lamb token 1 6 , punctuation 1 7 whose token 1 8 fleece token 1 9 was token 1 10 white token 1 11 as token 1 12 snow token 1 13 . punctuation 2 1 And token 2 2 everywhere token 2 3 that token 2 4 Mary token 2 5 went token 2 6 , punctuation 2 7 the token 2 8 lamb token 2 9 was token 2 10 sure token 2 11 to token 2 12 go token 2 13 . punctuation
Home | |
Welcome | |
Announcements and News | |
Announcements and news about changes to MorphAdorner | |
Documentation | |
Documentation for using MorphAdorner | |
Download MorphAdorner | |
Downloading and installing the MorphAdorner client and server software | |
Glossary | |
Glossary of MorphAdorner terms | |
Helpful References | |
Natural language processing references | |
Licenses | |
Licenses for MorphAdorner and Associated Software | |
Server | |
Online examples of MorphAdorner Server facilities. | |
Talks | |
Slides from talks about MorphAdorner. | |
Tech Talk | |
Technical information for programmers using MorphAdorner |
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |
Contact Us.
|