|
CompareStringCounts
compares two columnar files containing spellings and part of
speech tags.
Usage:
comparestringcounts analysis.tab reference.tab
where
- analysis.tab is an input tab-separated file of strings and counts
for an analysis text.
- reference.tab is an input tab-separated file of strings and counts
for a reference text.
The analysis.tab and reference.tab files contain strings and
counts of those strings compiled from two texts or corpora.
Both files contain two tab-separated columns.
The first column is a string.
The second column contains the count of the number
of times that string occurred in the associated text.
The output contains seven tab-separated columns,
sorted in descending order by log-likelihood value.
One line of output appears for each string in the
analysis text.
-
The first column contains the string. This may be
a spelling, a lemma, a part of speech, a spelling bigram,
or any other string of interest.
-
The second column contains a "+" when the string is overused in
the analysis text with respect to the reference text, a "-" when the
string is underused, and a blank when the string is used the same
amount in both texts.
-
The third column contains Dunning's log-likelihood value.
-
The fourth column shows the relative frequency of occurrence of the string
in the analysis text as fractional parts per ten thousand.
-
The fifth column shows the relative frequency of occurrence of the string in
the reference text as fractional parts per ten thousand.
-
The sixth column shows the number of times the string
occurred in the analysis text.
-
The seventh column shows the number of times the string
occurred in the reference text.
These results are written to the standard output file which can be
redirected to another file. A brief summary of the analysis is written
to the standard error file.
Statistical Background
Comparisons tell you whether there is more of this
here or less of that there. Knowing that individual word forms in one
text occur more or less often than in another text may help
characterize some generic differences between those texts. Statistics
on how often the words occur add rigor and provide a framework for
judging whether the observed differences are likely or unlikely to
have occurred by chance, and so deserve futher attention and
interpretation.
Log-likelihood for comparing texts
CompareStringCounts allows you to compare the
frequencies of word occurrences in two texts and obtain a statistical
measure of the significance of the differences. CompareStringCounts
uses the log-likelihood ratio G2 , also
known as Dunning's Log-Likelihood, as a measure of difference. To
compute G2, CompareStringCounts constructs a two-by-two
contingency table of frequencies for each word.
|
Analysis Text
|
Reference Text
|
Total
|
|
Count of word form
|
a
|
b
|
a+b
|
|
Count of other word forms
|
c-a
|
d-b
|
c+d-a-b
|
|
Total
|
c
|
d
|
c+d
|
The value of "a" is the number of times
the word occurs in the analysis text. The value of "b" is
the number of times the word occurs in the reference text. The value
of "c" is the total number of words in the analysis text.
The value of "d" is the total number of words in the
reference text.
Given this contingency table, CompareStringCounts
calculates the log-likelihood ratio statistic G2
to assess the size and significance of the difference of a word's
frequency of use in the two texts. The log-likelihood ratio measures
the discrepancy of the the observed word frequencies from the values
which we would expect to see if the word frequencies (by percentage)
were the same in the two texts. The larger the discrepancy, the
larger the value of G2, and the more statistically
significant the difference between the word frequencies in the texts.
Simply put, the log-likelihood value tells us how much more likely it
is that the frequencies are different than that they are the same.
The log-likelihood value is computed as the sum over
all terms of the form "O * ln(O/E)" where "O" is
the observed value of a contingency table entry, "E" is the
expected value under a model of homogeneity for frequencies for the
two texts, and "ln" is the natural log. If the observed
value is zero, we ignore that table entry in computing the total.
CompareStringCounts calculates the log-likelihood value G2
for each two-by-two contingency table as follows.
E1=c*(a+b)/(c+d) E2=d*(a+b)/(c+d) G2=2*((a*ln(a/E1))
+ (b*ln(b/E2)))
To determine the statistical significance of G2,
we refer the G2 value to the chi-square distribution with
one degree of freedom. The significance value tells you how often a
G2 as large as the one CompareStringCounts computed could
occur by chance. For example, a log-likelihood value of 6.63 should
occur by chance only about one in a hundred times. This means the
significance of a G2 value of 6.63 is 0.01 .
References
Ted Dunning's paper discusses the use of the
log-likelihood test for general textual analysis.
Dunning, Ted. 1993. Accurate Methods for the
Statistics of Surprise and Coincidence. Computational
Linguistics, Volume 19, number 1, pp. 61-74.
Rayson and Garside discuss the use of the
log-likelihood test for comparing corpora.
Rayson, P. and Garside, R. 2000. Comparing
corpora using frequency profiling. In Proceedings of the
workshop on Comparing Corpora, held in conjunction with the
38th annual meeting of the Association for Computational Linguistics
(ACL 2000). 1-8 October 2000, Hong Kong.
|