Northwestern University Information Technology
PunktAbbreviationDetector finds abbreviations in a set of untagged utf-8 encoded texts using the Punkt algorithm of Tibor Kiss and Jan Strunk.
The Punkt algorithm adapts collocation extraction methodology to the problem of determining when a period-terminated token is an abbreviation. For each token ending with a period, PUNKT compiles counts of the occurrences of the token with and without the trailing period. When the token appears statistically far more often with a period than without, it is a candidate abbreviation. Some additional heuristics refine the selection process.
This algorithm works well for English and other Western European languages. Its main weakness is that it fails when the collection of texts being analyzed contains many instances in which genuine abbreviations appear without the terminating period. Biblical references in early modern English texts provide a good example. Biblical book names that are abbreviated often do not end with a period. As a result, Biblical book name abbreviations in early texts will typically not be recognized as abbreviations.
punktabbreviationdetector isolangcode abbrevs.txt text1.txt text2.txt ...
Kiss, Tibor and Strunk, Jan (2006). Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
|Announcements and News
|Announcements and news about changes to MorphAdorner
|Documentation for using MorphAdorner
|Downloading and installing the MorphAdorner client and server software
|Glossary of MorphAdorner terms
|Natural language processing references
|Licenses for MorphAdorner and Associated Software
|Online examples of MorphAdorner Server facilities.
|Slides from talks about MorphAdorner.
|Technical information for programmers using MorphAdorner
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. |