atom feed14 messages in org.apache.lucene.mahout-userRe: Automatically extracted Mahout FAQs
FromSent OnAttachments
Stefan HenßFeb 22, 2011 8:03 pm 
Stefan HenßFeb 22, 2011 9:14 pm 
Bruce DouFeb 22, 2011 9:25 pm 
Sean OwenFeb 23, 2011 12:28 am 
Isabel DrostFeb 23, 2011 5:07 am 
Ted DunningFeb 23, 2011 9:09 am 
Ted DunningFeb 23, 2011 9:34 am 
Stefan HenßFeb 23, 2011 10:52 pm 
Bruce DouFeb 23, 2011 11:11 pm 
Stefan HenßFeb 23, 2011 11:57 pm 
Stefan HenßFeb 24, 2011 12:36 am 
Stefan HenßMar 7, 2011 2:51 am 
Stefan HenßJun 9, 2011 1:16 pm 
Lance NorskogJun 10, 2011 6:16 pm 
Subject:Re: Automatically extracted Mahout FAQs
From:Ted Dunning (ted.@gmail.com)
Date:Feb 23, 2011 9:34:27 am
List:org.apache.lucene.mahout-user

This is very nice work!

If you have achieved this level of accuracy without direct editing, then this is very impressive. In reading through the Mahout and Math questions, I noted a few issues with quoting and a few complete failures, but the good answers were very good. I think that the quoting issues could be improved by looking at the degree of string matching relative to the previous items in the thread. Small n-grams are very effective for this and avoid the need for full edit distance calculations. For the failed cases, even a small amount of community feedback would suffice to knock out the bad answers. I think that the favorable ratio of high quality answers to low quality answers is definitely high enough to make it worth looking at. If the ratio were reversed, I think users would not find it worth the time to look.

I do note that there are a very small number of questions that have been answered compared to the number that I have seen go by on the mailing list. Is that because you are being very cautious about keeping precision high?

Finally, some questions:

a) do you use any sort of measure to determine how well written the questions and answers are?

b) is this a dead-end school project or do you plan to continue with it?

On Tue, Feb 22, 2011 at 9:15 PM, Stefan Henß <stef@googlemail.com>wrote:

Hi everybody,

I'm currently doing research for my bachelor thesis on how to automatically extract FAQs from unstructured data.

For this I've built a system automatically performing the following: - Load thousands of conversations from forums and mailing lists (don't mind the categories there). - Build categorization solely based on the conversation's texts (by clustering). - Pick the best modelled categories as basis for one FAQ each. - For each question (first entry in a conversation) find the best reply from its answers. - Select the most relevant and well formatted question/answer-pairs for each FAQ.

Most of the steps almost completely rely on the data from the categorization step which is obtained using the latent Dirichlet allocation model.

For the evaluation part I'd like to ask you for having a look at one or two FAQs and maybe give some comments on how far the questions matched the FAQ's title, how relevant they were etc.

Here's the direct link to the Mahout FAQs: http://faqcluster.com/mahout-data

(There are some other interesting FAQs as well at http://faqcluster.com/)

Thanks for your help