17 messages in org.apache.lucene.java-userRe: Wikia search goes live today
FromSent OnAttachments
Lukas VlcekJan 7, 2008 4:48 am 
Grant IngersollJan 7, 2008 5:13 am 
Grant IngersollJan 7, 2008 8:21 am 
Otis GospodneticJan 7, 2008 2:14 pm 
Lukas VlcekJan 7, 2008 11:48 pm 
Lukas VlcekJan 7, 2008 11:54 pm 
Grant IngersollJan 8, 2008 4:46 am 
Mike KlaasJan 8, 2008 11:59 am 
Dennis KubesJan 8, 2008 12:09 pm 
Michael StoppelmanJan 8, 2008 12:11 pm 
Lukas VlcekJan 8, 2008 12:15 pm 
Andrzej BialeckiJan 8, 2008 12:23 pm 
Ryan McKinleyJan 8, 2008 12:31 pm 
Lukas VlcekJan 8, 2008 12:36 pm 
Lukas VlcekJan 8, 2008 12:38 pm 
Andrzej BialeckiJan 8, 2008 2:23 pm 
Dennis KubesJan 8, 2008 2:53 pm 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: Wikia search goes live todayActions
From:Dennis Kubes (kub@apache.org)
Date:Jan 8, 2008 2:53:12 pm
List:org.apache.lucene.java-user

Sorry about not responding to this before now, been a little busy :).

For those of you who don't know me, I am a committer on the Nutch project. I have been working with Wikia since early July and more actively since the beginning of November. Before Wikia I helped start another search engine based on Nutch called Visvo.com.

For the record, yes Search Wikia is using and will be supporting Nutch/Hadoop/Lucene/Solr/HBase. It is the intention of Search Wikia to help develop these projects and their communities. We have no intention of keeping the changes we make "proprietary". Everything that Search Wikia develops (barring an user or personal data) will be considered open source and freely available. Any improvements made to the apache projects will be immediately donated back to the community through the respective project.

Making search open and transparent is not just limited to source code. It is our intention to make the Search Wikia data freely open and available as well. This means that people will be able to download the crawl data, link data, content shards, and completed indexes. Also the social networking functionality, named foowi, will become its own open source project (probably with an apache license), and will be available to download, use, and improve.

And Search Wikia is not alone in this. Visvo.com in coordination with Wikia will be releasing all of its data and source code improvements to the community under an OSI approved license, including a python framework for managing hadoop configurations on distributed machines, automating the fetching and indexing process, and for managing search shards.

In terms of the Nutch logo. There are two standard nutch installations and index farms at the following urls. One in an index hosted at the ISC and the other is Visvo's open index. The ISC index has approximately 35M pages while Visvo's index has a little over 50M pages.

http://search.isc.swlabs.org http://open-index.visvo.com

The main Search Wikia site is hosted in a secure underground hosting facility in a bunker in Iowa (http://usshc.com/) and calls to these indexes. So when showing cached pages and explain plans those requests go to their respective indexes.

Both indexes are available for search by either browser based or web 2.0 based clients. We are currently using NUTCH-594 to serve results from these indexes in both xml and JSON formats. An example request searching for java would be:


http://search.isc.swlabs.org/nutchsearch?query=java&hitsPerSite=1&lang=en&hitsPerPage=10&type=json
http://open-index.visvo.com/nutchsearch?query=java&hitsPerSite=1&lang=en&hitsPerPage=10&type=json

So we are busy working on getting the data avaiable for download. Hopefully we should have a site setup within the next day or so. If anybody has any questions or would like to get some specific data feel free to send me an email.

Dennis Kubes

Lukas Vlcek wrote:

I should note that this technique is probably not easily applicable to current Lucene scoring mechanism without additional development.

After checking the Lucene API of ParallelReader it seems that the star score could be stored in different index which shares the same identifier for the documents. Such index could be small (partitioned to many small indices?) so the updates can be fast. Is that what you meant Andrzej? ;-)

Anyway, I remember different technique which I once mentioned in Lucene mail list taking inspiration from book called Programming Collective Intelligence <http://www.oreilly.com/catalog/9780596529321/> . The idea is not to store score (may be I should call it user preference) into index but into neural net. One useful side effect is that this technique could score reasonably even document without any stars (meaning "similar" document to highly started documents could score better even if they haven't been stared by any user yet).

Regards, Lukas

On 1/8/08, Andrzej Bialecki <ab@getopt.org> wrote:

Lukas Vlcek wrote:

So staring will be accommodated only during indexing phase. Does it mean it will be pretty static value not a dynamically changing variable... correct? In other words if I add my starts to some document it won't affect the scoring immediately but after indexing cycle. Correct?

(I'm not involved in Wikia development). There are some ways to go about it even in the pure Lucene-land, so that the updates are fast without reindexing the main content. Hint: ParallelReader.