atom feed6 messages in org.apache.commons.devRe: [sandbox] New sandbox component
FromSent OnAttachments
Bruno P. KinoshitaOct 26, 2014 4:41 pm 
Benedikt RitterOct 27, 2014 12:44 am 
Luc MaisonobeOct 27, 2014 2:10 am 
Bruno P. KinoshitaOct 27, 2014 4:32 am 
Bruno P. KinoshitaOct 27, 2014 4:33 am 
Benedikt RitterOct 27, 2014 4:57 am 
Subject:Re: [sandbox] New sandbox component
From:Bruno P. Kinoshita (brun@yahoo.com.br)
Date:Oct 27, 2014 4:32:54 am
List:org.apache.commons.dev

Hi Benedikt!

Just let me know if you need help with the bootstraping of the new project.

Yes, please :)

Maybe we should even announce this on announce@. There my be other projects
interested in a library like this (for example Apache Tika [1])

Good idea! Should we drop a note there once the project has been created or
after we already have some code in there?

Thanks!Bruno

From: Benedikt Ritter <brit@apache.org> To: Commons Developers List <de@commons.apache.org>; Bruno P. Kinoshita
<brun@yahoo.com.br> Sent: Monday, October 27, 2014 5:45 AM Subject: Re: [sandbox] New sandbox component

No objections from my site. I think this is a good idea. Just let me know if you
need help with the bootstraping of the new project. Maybe we should even
announce this on announce@. There my be other projects interested in a library
like this (for example Apache Tika [1])

Benedikt

[1] http://tika.apache.org/

Hello all,  At the moment I'm working with data matching and record linkage, and had to port
some existing string comparison algorithms found in several open source projects
(fuzzy-search-tools, simmetrics, lingpipe, [lang], [codec]). At that time I noticed LANG-591 [1], which suggests a more complex levenshtein
distance algorithm. There are several other algorithms too (damerau-levenshtein,
jaro, jaro-wrinkler, jaccard, bitap, q-gram, soundex, metaphone). Instead of
trying to put them all in, say, [lang], I'd like to experiment with a new [text]
component in the sandbox, if there are no objections.  I will take a look at the existing code and its license, but most of these
algorithms have good Wiki pages with pseudo code available; as well as academic
papers.  Maybe this component could be useful for other projects like [lang], Lucene,
larsga/Duke, and Talend Open Studio. And even though my initial use case for
this would be string comparison, I think it could support other use cases too. Thoughts on this? Anyone else interested on such a component?  Thanks!Bruno [1] https://issues.apache.org/jira/browse/LANG-591