|Stefan Henß||Feb 22, 2011 8:03 pm|
|Stefan Henß||Feb 22, 2011 9:14 pm|
|Bruce Dou||Feb 22, 2011 9:25 pm|
|Sean Owen||Feb 23, 2011 12:28 am|
|Isabel Drost||Feb 23, 2011 5:07 am|
|Ted Dunning||Feb 23, 2011 9:09 am|
|Ted Dunning||Feb 23, 2011 9:34 am|
|Stefan Henß||Feb 23, 2011 10:52 pm|
|Bruce Dou||Feb 23, 2011 11:11 pm|
|Stefan Henß||Feb 23, 2011 11:57 pm|
|Stefan Henß||Feb 24, 2011 12:36 am|
|Stefan Henß||Mar 7, 2011 2:51 am|
|Stefan Henß||Jun 9, 2011 1:16 pm|
|Lance Norskog||Jun 10, 2011 6:16 pm|
|Subject:||Re: Automatically extracted Mahout FAQs|
|From:||Stefan Henß (stef...@googlemail.com)|
|Date:||Feb 24, 2011 12:36:59 am|
Am 24.02.2011 08:11, schrieb Bruce Dou:
On Thu, Feb 24, 2011 at 2:52 PM, Stefan Henß <stef...@googlemail.com> wrote:
currently the answer selection is quite simple. We assume that a sophisticated answer has a quite firm use of the domain's terminology. So if someone has a high density of terms like "mahout", "hadoop", "clustering", "svn", "classifier" in his response to a mahout-related question we hope he knowns what he is talking about and gives straigth pointers to solutions etc. A model of the domain's terminology is given by the cluster (bag of words) the question/answer was assigned to, so what we basically do is to calculate the cosine similarity between the bag of words and the answer. High similarity - hopefully sophisticated. Of course there are some smaller additions like decreasing the score if the reply is by the same user as the question is but the similarity measure is the core idea.
I do not think in this way the best answer can be found.
For example: Q.How to install hadoop in Linux? A.<commands list>
And the answer may not include the terms in the question, since the answer is based on the question, always the terms are omitted.
The terms are not compared with the terms in the question but with the FAQ the question is assigned to. If such a question is frequently asked terms like "hadoop", "install", "linux" as well as terms from the answers such as "bin", "conf", ... should have a high weight for the FAQ. Just a list of commands would score very high due to the density of important keywords (as there is no noise etc.).
But I agree, there will still be (much?) better approaches for answer selection. But as this is done as bachelor thesis the time is limited and also the focus is set, i.e. how well one approach (LDA) is applicable for the whole task of FAQ extraction.
For the categorization we use the question as well as all replies to it as one single document we will then give to the clustering algorithm. On the assumption that the replies are not spam or alike this gives a far more precise characterization of the terminology simply due to the amount of text.
For categorization, the problem will be: Define the terms by us, or generate them from the questions content, If generate there will be lots of noise.
Sure there is a lot of noise and it's important to remove it (stopwords, detect stack traces etc.). We already observed quite stupid categories due to noise. But one of the questions of this research is how to automatically extract and present important information from a very large data set and human-defined terms would already require this information to be known to some extend (and also be time-consuming). That's why we have our try with the fully automatic approach :)
Am 23.02.2011 06:26, schrieb Bruce Dou:
How to find which answer is the best or relevant? How to do categorization? Based on the terms in the question?
On Wed, Feb 23, 2011 at 1:15 PM, Stefan Henß <stef...@googlemail.com> wrote:
I'm currently doing research for my bachelor thesis on how to automatically extract FAQs from unstructured data.
For this I've built a system automatically performing the following: - Load thousands of conversations from forums and mailing lists (don't mind the categories there). - Build categorization solely based on the conversation's texts (by clustering). - Pick the best modelled categories as basis for one FAQ each. - For each question (first entry in a conversation) find the best reply from its answers. - Select the most relevant and well formatted question/answer-pairs for each FAQ.
Most of the steps almost completely rely on the data from the categorization step which is obtained using the latent Dirichlet allocation model.
For the evaluation part I'd like to ask you for having a look at one or two FAQs and maybe give some comments on how far the questions matched the FAQ's title, how relevant they were etc.
Here's the direct link to the Mahout FAQs: http://faqcluster.com/mahout-data
(There are some other interesting FAQs as well at http://faqcluster.com/)
Thanks for your help