5 messages in org.apache.incubator.lucene-net-userRe: Faceting in Lucene.Net
FromSent OnAttachments
Soormasher SinghDec 16, 2007 10:33 am 
Jokin CuadradoDec 18, 2007 1:43 am 
Soormasher SinghDec 19, 2007 8:08 am 
Jokin CuadradoDec 19, 2007 9:04 am.zip
Soormasher SinghDec 19, 2007 3:34 pm 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: Faceting in Lucene.NetActions...
From:Jokin Cuadrado (joki@gmail.com)
Date:Dec 19, 2007 9:04:34 am
List:org.apache.incubator.lucene-net-user
Attachments:
SimpleFacets.zip - 2k

first of all, take a look at http://wiki.apache.org/lucene-java/BasicsOfPerformance

some things that i have noted: - you are opening an indexreader for every request, you should have a shared indexreader, term vectors and queryfilter for example are cached in it, so if you reopen the indexreader for every request you have to go to index for rebuild it.

- for categories whith much different values or searchs with small resultsets, you should use the collector approach. I attached a file, it's a custom translation of the ideas behind the faceting search in solr to c#, the usage is simple, once you have build the query, call to: (category is the field to facetize) SimpleFacets.facet(query, lucene_searcher, "category", MaxResults)

and will return a collection of value - count entries. If you set maxresults it will be limited to that, if not it will return a collection with the size of the categories. it may have conditions that are useful for our index, so you might have to tweak a bit. As you can see in the code, i left the class in the namespace Lucene.Net.Util so you have to reference or import it.

On Dec 19, 2007 5:09 PM, Soormasher Singh <soor@yahoo.com> wrote:

Thanks a lot for your response. My index isn't big-it has only around 100 to 120k documents at any given time.
But it does get updated roughly every 2 hours (new documents are added). Then
every night, the entire index is rebuilt to exclude the deleted documents. I've tried both the approaches you mentioned but the performance appears rather
slow. Without faceting, I can do a search on this index (including some math
calculations) in around 40 to 80 ms. when I include faceting for categories that are predefined (3 different fields
with 2 or 3 distinct values), the query time jumps quite a bit to around 200ms. So my typical query would be- a binary query with at least 4 queries with
faceting over 3 'static' fields (with 2 or 3 distinct values) and 2 'dynamic
fields' (with thousands of distinct values). When I do faceting with a field that has tens of thousands of distinct values,
the query time jumps drastically to over 1 second. Here are some snippets of code:

With smaller categories:

SortedList sl = new SortedList(); string indexloc = ConfigurationManager.AppSettings["DocIndexLoc"]; IndexSearcher searcher = new IndexSearcher(indexloc);

foreach (string s in SearchFilters.SourceTypes()) { TermQuery tq = new TermQuery(new Term("SourceType", s)); Filter f = new QueryFilter(tq);

sl.Add(s,searcher.Search(this.bq ,f).Length());

}

return sl;

With Bigger Categories

IndexReader reader = searcher.GetIndexReader(); QueryFilter baseQueryFilter = new QueryFilter(this.bq ); BitArray baseBitSet = baseQueryFilter.Bits (reader);

if (Cache["cities"] == null) { Cache["cities"] = Utilities.TopCities(); }

SortedList sl = new SortedList();

foreach (string s in (ArrayList)Application["cities"]) { TermQuery tq = new TermQuery(new Term("city", s)); Filter f = new QueryFilter(tq); BitArray baCity = f.Bits(reader);

baCity.And(baseBitSet); //do the cardinality function here

}

Am I doing something that is not so efficient? Any suggestions on boosting
performance?

Thanks a lot for your help!

----- Original Message ---- From: Jokin Cuadrado <joki@gmail.com> To: luce@incubator.apache.org Sent: Tuesday, December 18, 2007 2:44:12 AM Subject: Re: Faceting in Lucene.Net

could you be more explicit on your needs? How many documents have your index, how many different categories are and how much is the average search hit number would be enough to suggest an approach.

In my case i made an custom collector to count the hits on every category using a fieldcache item to get the item efficiently instead of call to hit.getDocument. (performance killer).

this is better if your searches return small sets and you have much categories.

If you have not many terms, and your searches return many results, you can use queryfilter.bits to get the masks, AND them, and count the number of set bits on the result. this have the drawback that .net implementation of Bitarray, don't have an efficient method of counting the set bits (cardinality on java), but you could get one from the bitvector class on lucene.net (you must use you own implementation of bitarray, or use reflection to access the backbone int32 array m_array and count over him).

here is the function to get the number of ones set in a bitarray:

Private Shared _bitsSetArray256 As Byte() = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8}

''' <summary> ''' return the number of bits on bitarray set to one ''' </summary> ''' <remarks></remarks> Private Function Cardinality(ByVal bits As BitArray) As Int32 Dim arr As UInt32() arr = bits.GetType().GetField("m_array", Reflection.BindingFlags.NonPublic Or Reflection.BindingFlags.Instance).GetValue(bits) Dim _count As Int32 = 0 For i As Int32 = 0 To arr.Length - 1 _count += _bitsSetArray256(arr(i) And &HFF) + _ _bitsSetArray256((arr(i) >> 8) And &HFF) + _ _bitsSetArray256((arr(i) >> 16) And &HFF) + _ _bitsSetArray256(arr(i) >> 24) Next i Return _count End Function

On Dec 16, 2007 7:33 PM, Soormasher Singh <soor@yahoo.com> wrote:

Hello All

I'm trying to use Lucene.Net for faceting (Category counting and

search refinement). I've not been able to find any examples of this using Lucene.Net. I've tried to use the approach used in Solr, but the performance hasn't been the greatest.

Can anyone please help me with this? Any code/examples of anyone using Lucene.Net for category counting/faceting?

Thanks a bunch!


____________________________________________________________________________________

Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ