atom feed16 messages in org.apache.lucene.mahout-userCentroid calculations with sparse vec...
FromSent OnAttachments
Shashikant KoreMay 27, 2009 6:19 am 
Jeff EastmanMay 27, 2009 7:29 pm 
Shashikant KoreMay 27, 2009 11:52 pm 
Ted DunningMay 28, 2009 12:00 am 
Sean OwenMay 28, 2009 12:07 am 
Ted DunningMay 28, 2009 12:13 am 
Ted DunningMay 28, 2009 12:18 am 
Sean OwenMay 28, 2009 12:24 am 
Shashikant KoreMay 28, 2009 12:30 am 
Jeff EastmanMay 28, 2009 5:51 am 
Ted DunningMay 28, 2009 8:44 am 
Shashikant KoreMay 28, 2009 10:56 pm 
Jeff EastmanMay 29, 2009 7:36 am 
Ted DunningMay 29, 2009 12:30 pm 
Shashikant KoreJun 1, 2009 6:12 am 
Ted DunningJun 1, 2009 11:41 am 
Subject:Centroid calculations with sparse vectors
From:Shashikant Kore (
Date:May 27, 2009 6:19:01 am


To calculate the centroid (say in Canopy clustering) of a set of sparse vectors, all the non-zero weights are added for each term and then divided by the cardinality of the vector. Which is the average of weights of a term in all the vectors.

I have sparse vectors of cardinalty of 50,000+, but each vector has only couple of hundreds of terms. While calculating centroid, for each term, only few hundred documents with non-zero term weights contribute to the total weight, but since it is divided by the cardinalty(50,000), the final weight is miniscule. This results into small document being marked closer to the centroid as they have fewer terms in them. The clusters don't look "right."

I am wondering if the term weights of centroid should be calculated by considering only the non-zero elements. That is, if a term has occurs in 10 vectors, then the weight of the term in centroid is the average of these 10 weight values. I couldn't locate any literature which specifically talks about the case of sparse vectors in centroid calculation. Any pointers are appreciated.

Thanks, --shashi