From | Sent On | Attachments |
---|---|---|

Shashikant Kore | May 27, 2009 6:19 am | |

Jeff Eastman | May 27, 2009 7:29 pm | |

Shashikant Kore | May 27, 2009 11:52 pm | |

Ted Dunning | May 28, 2009 12:00 am | |

Sean Owen | May 28, 2009 12:07 am | |

Ted Dunning | May 28, 2009 12:13 am | |

Ted Dunning | May 28, 2009 12:18 am | |

Sean Owen | May 28, 2009 12:24 am | |

Shashikant Kore | May 28, 2009 12:30 am | |

Jeff Eastman | May 28, 2009 5:51 am | |

Ted Dunning | May 28, 2009 8:44 am | |

Shashikant Kore | May 28, 2009 10:56 pm | |

Jeff Eastman | May 29, 2009 7:36 am | |

Ted Dunning | May 29, 2009 12:30 pm | |

Shashikant Kore | Jun 1, 2009 6:12 am | |

Ted Dunning | Jun 1, 2009 11:41 am |

Subject: | Centroid calculations with sparse vectors | |
---|---|---|

From: | Shashikant Kore (shas...@gmail.com) | |

Date: | May 27, 2009 6:19:01 am | |

List: | org.apache.lucene.mahout-user |

Hi,

To calculate the centroid (say in Canopy clustering) of a set of sparse vectors, all the non-zero weights are added for each term and then divided by the cardinality of the vector. Which is the average of weights of a term in all the vectors.

I have sparse vectors of cardinalty of 50,000+, but each vector has only couple of hundreds of terms. While calculating centroid, for each term, only few hundred documents with non-zero term weights contribute to the total weight, but since it is divided by the cardinalty(50,000), the final weight is miniscule. This results into small document being marked closer to the centroid as they have fewer terms in them. The clusters don't look "right."

I am wondering if the term weights of centroid should be calculated by considering only the non-zero elements. That is, if a term has occurs in 10 vectors, then the weight of the term in centroid is the average of these 10 weight values. I couldn't locate any literature which specifically talks about the case of sparse vectors in centroid calculation. Any pointers are appreciated.

Thanks, --shashi