edu.northwestern.at.utils.corpuslinguistics.textsegmenter.c99
Class C99

java.lang.Object
  extended by edu.northwestern.at.utils.corpuslinguistics.textsegmenter.c99.C99

public class C99
extends java.lang.Object

Choi's C99 algorithm for linear text segmentation

Author:
Freddy Choi, Philip R. Burns. Modified for integration in MorphAdorner.

Use of this code is free for academic, education, research and other non-profit making uses only.


Nested Class Summary
protected static class C99.Region
          Text segment region.
 
Constructor Summary
C99()
           
 
Method Summary
protected static int[] boundaries(double[][] m, int n)
          Find density maximizing boundaries for regions in a similarity matrix.
protected static ContextVector[] normalize(java.lang.String[][] document, ContextVector tf, StopWords stopWords, Stemmer stemmer)
          Produce stem frequency tables for a tokenized document.
protected static ContextVector[] normalize(java.lang.String[][] document, StopWords stopWords, Stemmer stemmer)
          Produce stem frequency tables for a tokenized document.
protected static double[][] rank(double[][] f, int maskSize)
          Apply hard ranking to matrix using a mask.
static java.lang.String[][][] segment(java.lang.String[][] document, int n, int s, StopWords stopWords, Stemmer stemmer)
          Segment document into coherent topic segments.
static java.lang.String[][][] segmentW(java.lang.String[][] document, int n, int s, StopWords stopWords, Stemmer stemmer)
          Segment document into coherent topic segments.
protected static double[][] similarity(ContextVector[] v)
          Given context vectors, compute the similarity matrix.
protected static double[][] similarity(ContextVector[] v, EntropyVector entropy)
          Given context vectors, compute the similarity matrix.
protected static java.lang.String[][][] split(java.lang.String[][] text, int[] boundaries)
          Split text into segment blocks given topic boundaries.
protected static double[][] sum(double[][] rankMatrix)
          Compute sum of rank matrix.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

C99

public C99()
Method Detail

boundaries

protected static int[] boundaries(double[][] m,
                                  int n)
Find density maximizing boundaries for regions in a similarity matrix.

Parameters:
m - Similarity matrix.
n - Number of regions to find. If n = 1, the algorithm will determine the number of regions.
Returns:
Boundaries of regions in selection order.

normalize

protected static ContextVector[] normalize(java.lang.String[][] document,
                                           StopWords stopWords,
                                           Stemmer stemmer)
Produce stem frequency tables for a tokenized document.

Parameters:
document - Tokenized document.
stopWords - Stop words.
stemmer - Stemmer.
Returns:
Context vector of stem frequencies.

normalize

protected static ContextVector[] normalize(java.lang.String[][] document,
                                           ContextVector tf,
                                           StopWords stopWords,
                                           Stemmer stemmer)
Produce stem frequency tables for a tokenized document.

Parameters:
document - Tokenized document.
tf - Term frequencies in document.
stopWords - Stop words.
stemmer - Stemmer.
Returns:
Context vector of stem frequencies.

rank

protected static double[][] rank(double[][] f,
                                 int maskSize)
Apply hard ranking to matrix using a mask.

Parameters:
f - Matrix to which to apply hard ranking.
maskSize - Mask size.

Hard ranking replaces a pixel value with the proportion of neighboring values it exceeds, using a maskSize x maskSize size mask.


segment

public static java.lang.String[][][] segment(java.lang.String[][] document,
                                             int n,
                                             int s,
                                             StopWords stopWords,
                                             Stemmer stemmer)
Segment document into coherent topic segments.

Parameters:
document - Document text as list of elementary text blocks.
n - Number of topic segments desired. Set n = -1 to have algorithm select number of topic segments by monitoring the rate of increase in segment density.
s - Size of ranking mask. Must be odd number >= 3.
stopWords - Stop words.
stemmer - Stemmer.
Returns:
Coherent topic segments.

segmentW

public static java.lang.String[][][] segmentW(java.lang.String[][] document,
                                              int n,
                                              int s,
                                              StopWords stopWords,
                                              Stemmer stemmer)
Segment document into coherent topic segments.

Parameters:
document - Document text as list of elementary text blocks.
n - Number of topic segments desired. Set n = -1 to have algorithm select number of topic segments by monitoring the rate of increase in segment density.
s - Size of ranking mask. Must be odd number >= 3.
stopWords - Stop words.
stemmer - Stemmer.
Returns:
Coherent topic segments.

similarity

protected static double[][] similarity(ContextVector[] v)
Given context vectors, compute the similarity matrix.

Parameters:
v - context vectors.
Returns:
similarity matrix.

similarity

protected static double[][] similarity(ContextVector[] v,
                                       EntropyVector entropy)
Given context vectors, compute the similarity matrix.

Parameters:
v - context vectors.
entropy - entropy vector.
Returns:
similarity matrix.

split

protected static java.lang.String[][][] split(java.lang.String[][] text,
                                              int[] boundaries)
Split text into segment blocks given topic boundaries.

Parameters:
text - Source text.
boundaries - Boundaries.
Returns:
Topic segments.

sum

protected static double[][] sum(double[][] rankMatrix)
Compute sum of rank matrix.

Parameters:
rankMatrix - Rank matrix.
Returns:
Sum of rank matrix.