

![]() | Start a set with this search |
![]() | Include this search in one of my sets |
![]() | Exclude this search from one of my sets |
![]() | Permalink to these results Paste this link in email or IM: |
| Atom feed for tracking future search results Paste this URL into your reader: |
3 messages in org.apache.lucene.mahout-devGSoC Evolutionary Algorithm Proposal| From | Sent On | Attachments |
|---|---|---|
| deneche abdelhakim | Mar 27, 2008 6:17 am | |
| Grant Ingersoll | Mar 28, 2008 4:19 am | |
| Isabel Drost | Mar 28, 2008 11:13 am |

![]() | Permalink for this message Paste this link in email or IM: |
![]() | Permalink for this thread Paste this link in email or IM: |
| Atom feed for this thread Paste this URL into your reader: |
| Subject: | GSoC Evolutionary Algorithm Proposal | Actions |
|---|---|---|
| From: | deneche abdelhakim (a_de...@yahoo.fr) | |
| Date: | Mar 27, 2008 6:17:50 am | |
| List: | org.apache.lucene.mahout-dev | |
I've written my proposal, and because I could no more change it after I submit
it to GSoc, I first post it here
if someone have some suggestions you are welcome.
I will wait until saturday morning to post it to the GSoC
************************************************************************************** Application for Summer of Code 2008 Mahout Project
Deneche Abdel Hakim
Codename Mahout.GA
I. Synopsis
I will add a genetic algorithm (GA) for binary classification on large datasets
to the Mahout project. To gain time I will use an existing framework for genetic
algorithms WatchMaker [WatchMaker] with an Apache Software License. I will also
add a parallelized measure that indicates the quality of classification rules on
a given dataset, this measure will be available independently of the GA. And if
I have enough time I will make the GA more generic and apply it on a different
problem (multiclass classification).
II. Project
A GA works by evolving a population of individuals toward a desired goal. To get
a satisfying solution, the GA needs to run thousands of iterations with hundreds
of individuals. For each iteration and individual the fitness is calculated, it
indicates the closeness of that individual to the desired solution. The main
advantage of GAs is there ability to find solution of problems given only a
fitness measure (and of course a sufficient CPU power), this is particularly
helpful when the problem is complex and no mathematical solution is available.
My primary goal is to implement the GA described in [GA]. It uses a fitness
function that is easy to implement and can benefit from the Map-Reduce strategy
to exploit distributed computing (when the training dataset is very large). It
will be available as ready to use tool (Mahout.GA) that discovers binary
classification rules for any given dataset. Concretely, the main program will
launch the GA using WatchMaker, each time the GA needs to evaluate the fitness
of the population it calls a specific class given by us, this class will
configure and launch a Hadoop Job on a distributed cluster.
My secondary goal is to make Mahout.GA problem independent, thus allowing us to
use it for different problems such as multiclass classification, optimization,
clustering. This will be done by implementing a ready to use generic fitness
function for WatchMaker that calls internally Hadoop. As a proof of concept I
will use it for multiclass classification (if I don't run out of time of
course!).
III. Profit for Mahout
1.The GA will be integrated with Mahout as a ready to use rule discovering tool
for binary classification;
2.Explore the integration of existing frameworks with Mahout, for example how to
design the program in a way that the framework libraries will not be needed in
the slave nodes (technically its feasible, but I still need to learn how to do
it);
3.The parallelized fitness function can be used independently of Mahout.GA. It’s
a good measure of the quality of binary classification rules;
4.Simplify the process of using Mahout.GA for other problems. The user will
still need to design the solutions representation and to implement a fitness
function, but all the Hadoop stuff should be hidden or at least made simpler;
5.Apply the generalized Mahout.GA to multiclass classification and write a
corresponding tutorial that explains how to use Mahout.GA to solve new problems.
IV. Success Criteria
Main goals
1.Implement the parallelized fitness function described in [GA] and validate
its results on a small dataset;
2.Implement Mahout.GA for binary classification rule discovery. A simpler (not
parallelized) version of this algorithm should also be implemented to
validate the results of Mahout.GA;
Secondary goals
1.Allow the parallelized fitness function to be used independently of
Mahout.GA;
2.Use Mahout.GA on a different problem (multiclass classification) and write a
corresponding tutorial.
V. Roadmap
[April, 14: accepted students known]
1.Familiarize myself with Hadoop
Modify one of the examples of Hadoop to simulate an iterative process. For
each iteration, a new Job is executed with different parameters, and its results
are imported back by the program.
2.Implement the GA without parallelism
a.Start by implementing the tutorial example that comes with WatchMaker;
b.Implement my own Individual and Fitness function classes;
c.Validate the algorithm using a small dataset, and find the parameters that
will give acceptable results.
3.Prepare whatever I may need in the development period
[May, 26 coding starts]
4.Implement the parallelized fitness function
a.Use Hadoop Map-Reduce to implement it [2 weeks];
b.Validate it on a small dataset [1 week].
5.Implement Mahout.GA
a.Write an intermediary component between WatchMaker and the parallelized
fitness function. This component takes a population, configures and launches a
Job, waits for its end, then returns the calculated fitness values [2 weeks];
b.Validate Mahout.GA by comparing its results with the GA without
parallelism [1 week].
[July, 7-14 mid term evaluation]
6.Generic Mahout.GA
a.Identify the components that are problem dependant, and make them less
dependant of Hadoop as much as possible [2 weeks];
b.Implement the components for the multiclass classification problem and
validate Mahout.GA on a given dataset [2 week];
c.Write a tutorial that explains how to use Mahout.GA to solve new problems
(in this case the multiclass classification problem) [in parallel with 5.b].
[August, 11 suggested pencil 'down' date]
Clean the code and arrange the documentation.
[August, 18 final evaluations]
Note that this plan may change given my interaction with my Mentor and the
Mahout community.
VI. Biography
I am a PhD student at the University Mentouri of Constantine. My primary
research goal is a framework to help build Intelligent Adaptive Systems. I am
still on my first year, and there is a good chance that I will be working on
Distributed Evolutionary Algorithms for the next three years.
For the purpose of my Master, I worked on Artificial Immune Systems. I applied
them to handwritten digits recognition [PatternRecognition] and Muliple Sequence
Alignement (bioinformatics) [BioInformatics]. I also built a feature selection
operator for Yale (but for lack of time I never published it), and participated
in an internship at the LIFL laboratory (Lille, France), where I implemented
several operators for a C++ evolutionary computation framework [ParadisEO].
In parallel to my Master, I worked as a freelance programmer for my University.
I developed a Java scholar management system using Eclipse, TortoiseSVN and many
open source libraries. I gained a good experience on project management (how to
make a realistic plan and stick to it) and open source development (how to
choose a good open source library, use it, and work around known bugs).
VII. References
[GA] Bojarczuk CC, Lopes HS, and Freitas AA. "Discovering comprehensible
classification rules using genetic programming: a case study in a medical
domain". Proc. Genetic and Evolutionary Computation Conference GECCO99, 953-958.
Orlando, FL, USA, July 1999.
[WatchMaker] https://watchmaker.dev.java.net/
[PatternRecognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an
Artificial Immune System with a Clustering Method for Effective Pattern
Recognition", International Conference on Machine Intelligence ICMI’05, pp.
782-787, Tunis 2005.
[BioInformatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by Immune
Artificial System", ACS/IEEE International Conference on Computer Systems and
Applications AICCSA’07, Jordan 2007.
[ParadisEO]
http://paradiseo.gforge.inria.fr/index.php?n=Paradiseo.Home?from=Main.HomePage
----------------------------------------------------------------------------------------------
This proposal is inspired from the excellent one of Konstantin Kafer
[http://drupal.org/files/application.pdf]
*********************************************************************************************************
_____________________________________________________________________________
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails.
http://mail.yahoo.fr







