To calculate the centroid (say in Canopy clustering) of a set of
sparse vectors, all the non-zero weights are added for each term and
then divided by the cardinality of the vector. Which is the average of
weights of a term in all the vectors.
I have sparse vectors of cardinalty of 50,000+, but each vector has
only couple of hundreds of terms. While calculating centroid, for
each term, only few hundred documents with non-zero term weights
contribute to the total weight, but since it is divided by the
cardinalty(50,000), the final weight is miniscule. This results into
small document being marked closer to the centroid as they have fewer
terms in them. The clusters don't look "right."
I am wondering if the term weights of centroid should be calculated by
considering only the non-zero elements. That is, if a term has occurs
in 10 vectors, then the weight of the term in centroid is the average
of these 10 weight values. I couldn't locate any literature which
specifically talks about the case of sparse vectors in centroid
calculation. Any pointers are appreciated.