NU IT
Northwestern University Information Technology
MorphAdorner Northwestern
 
Comparing String Counts

CompareStringCounts compares two columnar files containing spellings and part of speech tags.

Usage:

comparestringcounts analysis.tab reference.tab

where

  • analysis.tab is an input tab-separated file of strings and counts for an analysis text.
  • reference.tab is an input tab-separated file of strings and counts for a reference text.

The analysis.tab and reference.tab files contain strings and counts of those strings compiled from two texts or corpora. Both files contain two tab-separated columns. The first column is a string. The second column contains the count of the number of times that string occurred in the associated text.

The output contains seven tab-separated columns, sorted in descending order by log-likelihood value. One line of output appears for each string in the analysis text.

  1. The first column contains the string. This may be a spelling, a lemma, a part of speech, a spelling bigram, or any other string of interest.
  2. The second column contains a "+" when the string is overused in the analysis text with respect to the reference text, a "-" when the string is underused, and a blank when the string is used the same amount in both texts.
  3. The third column contains Dunning's log-likelihood value.
  4. The fourth column shows the relative frequency of occurrence of the string in the analysis text as fractional parts per ten thousand.
  5. The fifth column shows the relative frequency of occurrence of the string in the reference text as fractional parts per ten thousand.
  6. The sixth column shows the number of times the string occurred in the analysis text.
  7. The seventh column shows the number of times the string occurred in the reference text.

These results are written to the standard output file which can be redirected to another file. A brief summary of the analysis is written to the standard error file.

Statistical Background

Comparisons tell you whether there is more of this here or less of that there. Knowing that individual word forms in one text occur more or less often than in another text may help characterize some generic differences between those texts. Statistics on how often the words occur add rigor and provide a framework for judging whether the observed differences are likely or unlikely to have occurred by chance, and so deserve futher attention and interpretation.

Log-likelihood for comparing texts

CompareStringCounts allows you to compare the frequencies of word occurrences in two texts and obtain a statistical measure of the significance of the differences. CompareStringCounts uses the log-likelihood ratio G2 , also known as Dunning's Log-Likelihood, as a measure of difference. To compute G2, CompareStringCounts constructs a two-by-two contingency table of frequencies for each word.

Analysis Text

Reference Text

Total

Count of word form

a

b

a+b

Count of other word forms

c-a

d-b

c+d-a-b

Total

c

d

c+d

The value of "a" is the number of times the word occurs in the analysis text. The value of "b" is the number of times the word occurs in the reference text. The value of "c" is the total number of words in the analysis text. The value of "d" is the total number of words in the reference text.

Given this contingency table, CompareStringCounts calculates the log-likelihood ratio statistic G2 to assess the size and significance of the difference of a word's frequency of use in the two texts. The log-likelihood ratio measures the discrepancy of the the observed word frequencies from the values which we would expect to see if the word frequencies (by percentage) were the same in the two texts. The larger the discrepancy, the larger the value of G2, and the more statistically significant the difference between the word frequencies in the texts. Simply put, the log-likelihood value tells us how much more likely it is that the frequencies are different than that they are the same.

The log-likelihood value is computed as the sum over all terms of the form "O * ln(O/E)" where "O" is the observed value of a contingency table entry, "E" is the expected value under a model of homogeneity for frequencies for the two texts, and "ln" is the natural log. If the observed value is zero, we ignore that table entry in computing the total. CompareStringCounts calculates the log-likelihood value G2 for each two-by-two contingency table as follows.

E1=c*(a+b)/(c+d)
E2=d*(a+b)/(c+d)
G
2=2*((a*ln(a/E1)) + (b*ln(b/E2)))

To determine the statistical significance of G2, we refer the G2 value to the chi-square distribution with one degree of freedom. The significance value tells you how often a G2 as large as the one CompareStringCounts computed could occur by chance. For example, a log-likelihood value of 6.63 should occur by chance only about one in a hundred times. This means the significance of a G2 value of 6.63 is 0.01 .

References

Ted Dunning's paper discusses the use of the log-likelihood test for general textual analysis.

  • Dunning, Ted. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, Volume 19, number 1, pp. 61-74.

Rayson and Garside discuss the use of the log-likelihood test for comparing corpora.

  • Rayson, P. and Garside, R. 2000. Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong.

Home
 
Announcements and News
 
Documentation
 
Download MorphAdorner
 
Glossary
 
Helpful References
 
Licenses
 
Server
 
Talks
 
Tech Talk