| From | Sent On | Attachments |
|---|---|---|
| Stefan Henß | Feb 22, 2011 8:03 pm | |
| Stefan Henß | Feb 22, 2011 9:14 pm | |
| Bruce Dou | Feb 22, 2011 9:25 pm | |
| Sean Owen | Feb 23, 2011 12:28 am | |
| Isabel Drost | Feb 23, 2011 5:07 am | |
| Ted Dunning | Feb 23, 2011 9:09 am | |
| Ted Dunning | Feb 23, 2011 9:34 am | |
| Stefan Henß | Feb 23, 2011 10:52 pm | |
| Bruce Dou | Feb 23, 2011 11:11 pm | |
| Stefan Henß | Feb 23, 2011 11:57 pm | |
| Stefan Henß | Feb 24, 2011 12:36 am | |
| Stefan Henß | Mar 7, 2011 2:51 am | |
| Stefan Henß | Jun 9, 2011 1:16 pm | |
| Lance Norskog | Jun 10, 2011 6:16 pm |
| Subject: | Automatically extracted Mahout FAQs | |
|---|---|---|
| From: | Stefan Henß (stef...@googlemail.com) | |
| Date: | Feb 22, 2011 9:14:35 pm | |
| List: | org.apache.lucene.mahout-user | |
Hi everybody,
I'm currently doing research for my bachelor thesis on how to automatically extract FAQs from unstructured data.
For this I've built a system automatically performing the following: - Load thousands of conversations from forums and mailing lists (don't mind the categories there). - Build categorization solely based on the conversation's texts (by clustering). - Pick the best modelled categories as basis for one FAQ each. - For each question (first entry in a conversation) find the best reply from its answers. - Select the most relevant and well formatted question/answer-pairs for each FAQ.
Most of the steps almost completely rely on the data from the categorization step which is obtained using the latent Dirichlet allocation model.
For the evaluation part I'd like to ask you for having a look at one or two FAQs and maybe give some comments on how far the questions matched the FAQ's title, how relevant they were etc.
Here's the direct link to the Mahout FAQs: http://faqcluster.com/mahout-data
(There are some other interesting FAQs as well at http://faqcluster.com/)
Thanks for your help
Stefan





