atom feed13 messages in org.apache.lucene.mahout-devclassifier architecture needed
FromSent OnAttachments
Ted DunningJun 21, 2010 11:12 am 
Robin AnilJun 21, 2010 8:35 pm 
Grant IngersollJun 22, 2010 8:33 am 
Ted DunningJun 22, 2010 9:15 am 
Ted DunningJun 22, 2010 9:25 am 
Ted DunningJun 22, 2010 9:32 am 
Robin AnilJun 22, 2010 9:44 am 
Robin AnilJun 22, 2010 9:47 am 
Robin AnilJun 22, 2010 10:07 am 
Ted DunningJun 22, 2010 12:11 pm 
Ted DunningJun 22, 2010 12:52 pm 
Ted DunningJun 22, 2010 12:59 pm 
Isabel DrostJul 11, 2010 1:34 am 
Subject:classifier architecture needed
From:Ted Dunning (
Date:Jun 21, 2010 11:12:47 am

We are now beginning to have lots of classifiers in Mahout. The naive Bayes, complementary naive Bayes and random Forest grandfathers have been joined by my recent SGD and Zhao Zhendong's prolific set of approaches for logistic regression and SVM variants.

All of these implementations have similar characteristics and virtually none are inter-operable.

Even worse, the model produced by a clustering system is really just like a model produced by a classifier so we should increase the number of sources of incompatible classifiers even more. Altogether, we probably have a dozen ways of building classifiers.

I would like to start a discussion about a framework that we can fit all of these approaches together in much the same way that the recommendations stuff has such nice pluggable properties.

As I see it, the opportunities for commonality (aka our current deficiencies) include:

- original input format reading

-- the naive Bayes code uses an ad hoc format similar to what Jason Rennie used for 20 news groups. This code uses Lucene 3.0 style analyzers.

-- Zhao uses something a lot like SVMLight input format

-- The SGD code looks at CSV data

-- Drew wrote some Avro document code

-- Lucene has been used as a sort of vectors for clustering

My summary here is that the Lucene analyzers look like they could be used very effectively for our purposes. We would need to write AttributeFilter's that do two kinds of vectorization (random project and dictionary based). We also should have 4 standard input format parsers as examples (CSV, SVMLight, VowpalWabbit, current naive Bayes format).

We need something simply and general that subsumes all of these input use cases.

- conversion to vectors

-- SGD introduced from random projection

-- Naive bayes has some dictionary based conversions

-- Other stuff does this or that

This should be subsumed into the AttributeFilters that I mentioned above. We really just need random projection and Salton style vector space models. Clearly, we should allow direct input of vectors as well in case the user is producing them for us.

- command line option processing

We really need to have a simple way to integrate all of the input processing options easily into new and old code

- model storage

It would be lovely if we could instantiate a model from a stored form without even known what kind of learning produced the model. All of the classifiers and clustering algorithms should put out something that can be instantiated this way. I used Gson in the SGD code and found it pretty congenial, but I didn't encode the class of the classifier, nor did I provide a classifier abstract class. I don't know what k-means or Canopy clustering produce, nor random forests or Naive Bayes, but I am sure that all of them are highly specific to the particular kind of model.

I don't know what is best here, but we definitely need something more common than what we have.

What do others think?