Palenque

Needles in a Giga-Stack

The ancient city of Palenque is both grand and mysterious. Some of the most fascinating ancient ruins of Mexico can be found here.

Home

This is our project, it’s call needles in a Giga-Stack and this problem consists of taking an arbitrary large set of text (GigaWord Corpus), extracting out sets of terms of a given maximum and minimum length, checking how many times a unique sentences repeats, and then ranking them into a top R length set- R, M, and N are supplied by the user.  The ranking is based on the TF*IDF measure, for a given term X (which may consist of 1 or more words), TF*IDF(X) is defined as follows:

 

TF(X) = the number of times the term X occurs in the GigaWord Corpus

 

IDF(X) = log (D/DF(X)), where D is the total number of articles in GigaWord Corpus, and D(X) is the total number of articles that contain one or more occurrences of the term X. Assume that the log is base 2.

To contact us:

E-mail: Josh <clark617 at d.umn.edu>

           Anthony <mill3206 at d.umn.edu>

           Robert <flai0014 at d.umn.edu>

SourceForge.net Logo