NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Extracting Abbreviations Using PUNKT

PunktAbbreviationDetector finds abbreviations in a set of untagged utf-8 encoded texts using the Punkt algorithm of Tibor Kiss and Jan Strunk.

The Punkt algorithm adapts collocation extraction methodology to the problem of determining when a period-terminated token is an abbreviation. For each token ending with a period, PUNKT compiles counts of the occurrences of the token with and without the trailing period. When the token appears statistically far more often with a period than without, it is a candidate abbreviation. Some additional heuristics refine the selection process.

This algorithm works well for English and other Western European languages. Its main weakness is that it fails when the collection of texts being analyzed contains many instances in which genuine abbreviations appear without the terminating period. Biblical references in early modern English texts provide a good example. Biblical book names that are abbreviated often do not end with a period. As a result, Biblical book name abbreviations in early texts will typically not be recognized as abbreviations.

Usage:

punktabbreviationdetector isolangcode abbrevs.txt text1.txt text2.txt ...

where

  • isolangcode specifies the two or three character ISO language code in which the texts to be analyzed are written.
  • abbrevs.txt specifies the name of the output file to receive the abbreviations extracted from the texts.
  • text1 text2 ... specify the names of utf-8 encoded text files from which to extract potential abbreviations.

Reference

Kiss, Tibor and Strunk, Jan (2006). Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk