MorphAdorner: Extracting Abbreviations Using PUNKT

Extracting Abbreviations Using PUNKT

PunktAbbreviationDetector finds abbreviations in a set of untagged utf-8 encoded texts using the Punkt algorithm of Tibor Kiss and Jan Strunk.

The Punkt algorithm adapts collocation extraction methodology to the problem of determining when a period-terminated token is an abbreviation. For each token ending with a period, PUNKT compiles counts of the occurrences of the token with and without the trailing period. When the token appears statistically far more often with a period than without, it is a candidate abbreviation. Some additional heuristics refine the selection process.

This algorithm works well for English and other Western European languages. Its main weakness is that it fails when the collection of texts being analyzed contains many instances in which genuine abbreviations appear without the terminating period. Biblical references in early modern English texts provide a good example. Biblical book names that are abbreviated often do not end with a period. As a result, Biblical book name abbreviations in early texts will typically not be recognized as abbreviations.

Usage:

punktabbreviationdetector isolangcode abbrevs.txt text1.txt text2.txt ...

where

isolangcode specifies the two or three character ISO language code in which the texts to be analyzed are written.
abbrevs.txt specifies the name of the output file to receive the abbreviations extracted from the texts.
text1 text2 ... specify the names of utf-8 encoded text files from which to extract potential abbreviations.

Reference

Kiss, Tibor and Strunk, Jan (2006). Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

	Home
	Welcome
	Announcements and News
	Announcements and news about changes to MorphAdorner
	Documentation
	Documentation for using MorphAdorner
	Download MorphAdorner
	Downloading and installing the MorphAdorner client and server software
	Glossary
	Glossary of MorphAdorner terms
	Helpful References
	Natural language processing references
	Licenses
	Licenses for MorphAdorner and Associated Software
	Server
	Online examples of MorphAdorner Server facilities.
	Talks
	Slides from talks about MorphAdorner.
	Tech Talk
	Technical information for programmers using MorphAdorner