NU
IT
Northwestern University Information Technology 
MorphAdorner V2.0  Site Map 
CompareStringCounts compares two columnar files containing spellings and part of speech tags.
Usage:
comparestringcounts analysis.tab reference.tab
where
The analysis.tab and reference.tab files contain strings and counts of those strings compiled from two texts or corpora. Both files contain two tabseparated columns. The first column is a string. The second column contains the count of the number of times that string occurred in the associated text.
The output contains seven tabseparated columns, sorted in descending order by loglikelihood value. One line of output appears for each string in the analysis text.
These results are written to the standard output file which can be redirected to another file. A brief summary of the analysis is written to the standard error file.
Comparisons tell you whether there is more of this here or less of that there. Knowing that individual word forms in one text occur more or less often than in another text may help characterize some generic differences between those texts. Statistics on how often the words occur add rigor and provide a framework for judging whether the observed differences are likely or unlikely to have occurred by chance, and so deserve futher attention and interpretation.
CompareStringCounts allows you to compare the frequencies of word occurrences in two texts and obtain a statistical measure of the significance of the differences. CompareStringCounts uses the loglikelihood ratio G^{2 }, also known as Dunning's LogLikelihood, as a measure of difference. To compute G^{2}, CompareStringCounts constructs a twobytwo contingency table of frequencies for each word.
Analysis Text 
Reference Text 
Total 

Count of word form 
a 
b 
a+b 
Count of other word forms 
ca 
db 
c+dab 
Total 
c 
d 
c+d 
The value of "a" is the number of times the word occurs in the analysis text. The value of "b" is the number of times the word occurs in the reference text. The value of "c" is the total number of words in the analysis text. The value of "d" is the total number of words in the reference text.
Given this contingency table, CompareStringCounts calculates the loglikelihood ratio statistic G^{2} to assess the size and significance of the difference of a word's frequency of use in the two texts. The loglikelihood ratio measures the discrepancy of the the observed word frequencies from the values which we would expect to see if the word frequencies (by percentage) were the same in the two texts. The larger the discrepancy, the larger the value of G^{2}, and the more statistically significant the difference between the word frequencies in the texts. Simply put, the loglikelihood value tells us how much more likely it is that the frequencies are different than that they are the same.
The loglikelihood value is computed as the sum over all terms of the form "O * ln(O/E)" where "O" is the observed value of a contingency table entry, "E" is the expected value under a model of homogeneity for frequencies for the two texts, and "ln" is the natural log. If the observed value is zero, we ignore that table entry in computing the total. CompareStringCounts calculates the loglikelihood value G^{2} for each twobytwo contingency table as follows.
E1=c*(a+b)/(c+d)
E2=d*(a+b)/(c+d)
G^{2}=
2*((a*ln(a/E1)) + (b*ln(b/E2)))
To determine the statistical significance of G^{2}, we refer the G^{2} value to the chisquare distribution with one degree of freedom. The significance value tells you how often a G^{2} as large as the one CompareStringCounts computed could occur by chance. For example, a loglikelihood value of 6.63 should occur by chance only about one in a hundred times. This means the significance of a G^{2} value of 6.63 is 0.01 .
Ted Dunning's paper discusses the use of the loglikelihood test for general textual analysis.
Dunning, Ted. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, Volume 19, number 1, pp. 6174.
Rayson and Garside discuss the use of the loglikelihood test for comparing corpora.
Rayson, P. and Garside, R. 2000. Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). 18 October 2000, Hong Kong.
Home  
Welcome  
Announcements and News  
Announcements and news about changes to MorphAdorner  
Documentation  
Documentation for using MorphAdorner  
Download MorphAdorner  
Downloading and installing the MorphAdorner client and server software  
Glossary  
Glossary of MorphAdorner terms  
Helpful References  
Natural language processing references  
Licenses  
Licenses for MorphAdorner and Associated Software  
Server  
Online examples of MorphAdorner Server facilities.  
Talks  
Slides from talks about MorphAdorner.  
Tech Talk  
Technical information for programmers using MorphAdorner 
Academic Technologies and Research Services,
NU Library 2East, 1970 Campus Drive Evanston, IL 60208. 
Contact Us.
