I'm currently doing research for my bachelor thesis on how to
automatically extract FAQs from unstructured data.
For this I've built a system automatically performing the following:
- Load thousands of conversations from forums and mailing lists (don't
mind the categories there).
- Build categorization solely based on the conversation's texts (by
- Pick the best modelled categories as basis for one FAQ each.
- For each question (first entry in a conversation) find the best reply
from its answers.
- Select the most relevant and well formatted question/answer-pairs for
Most of the steps almost completely rely on the data from the
categorization step which is obtained using the latent Dirichlet
For the evaluation part I'd like to ask you for having a look at one or
two FAQs and maybe give some comments on how far the questions matched
the FAQ's title, how relevant they were etc.