atom feed5 messages in org.apache.incubator.lucene-net-userRe: Faceting in Lucene.Net
FromSent OnAttachments
Soormasher SinghDec 16, 2007 10:33 am 
Jokin CuadradoDec 18, 2007 1:43 am 
Soormasher SinghDec 19, 2007 8:08 am 
Jokin CuadradoDec 19, 2007 9:04 am.zip
Soormasher SinghDec 19, 2007 3:34 pm 
Subject:Re: Faceting in Lucene.Net
From:Soormasher Singh (soor@yahoo.com)
Date:Dec 19, 2007 3:34:58 pm
List:org.apache.incubator.lucene-net-user

Jokin

Can't thank you enough....I implemented the changes you suggested, along with
the Solr style faceting using your class! My initial tests show an order of magnitude improvement in performance. I'll
have the entire bunch of changes implemented and report on the query timings
etc.

thanks again!

----- Original Message ---- From: Jokin Cuadrado <joki@gmail.com> To: luce@incubator.apache.org Sent: Wednesday, December 19, 2007 10:04:57 AM Subject: Re: Faceting in Lucene.Net

first of all, take a look at http://wiki.apache.org/lucene-java/BasicsOfPerformance

some things that i have noted: - you are opening an indexreader for every request, you should have a shared indexreader, term vectors and queryfilter for example are cached in it, so if you reopen the indexreader for every request you have to go to index for rebuild it.

- for categories whith much different values or searchs with small resultsets, you should use the collector approach. I attached a file, it's a custom translation of the ideas behind the faceting search in solr to c#, the usage is simple, once you have build the query, call to: (category is the field to facetize) SimpleFacets.facet(query, lucene_searcher, "category", MaxResults)

and will return a collection of value - count entries. If you set maxresults it will be limited to that, if not it will return a collection with the size of the categories. it may have conditions that are useful for our index, so you might have to tweak a bit. As you can see in the code, i left the class in the namespace Lucene.Net.Util so you have to reference or import it.

On Dec 19, 2007 5:09 PM, Soormasher Singh <soor@yahoo.com> wrote:

Thanks a lot for your response. My index isn't big-it has only around 100 to 120k documents at any

given time. But it does get updated roughly every 2 hours (new documents are added). Then every night, the entire index is rebuilt to exclude the deleted documents.

I've tried both the approaches you mentioned but the performance

appears rather slow. Without faceting, I can do a search on this index (including some math calculations) in around 40 to 80 ms.

when I include faceting for categories that are predefined (3

different fields with 2 or 3 distinct values), the query time jumps quite a bit to around 200ms.

So my typical query would be- a binary query with at least 4 queries

with faceting over 3 'static' fields (with 2 or 3 distinct values) and 2 'dynamic fields' (with thousands of distinct values).

When I do faceting with a field that has tens of thousands of distinct values, the query time jumps drastically to over 1 second. Here are some snippets of code:

With smaller categories:

SortedList sl = new SortedList(); string indexloc = ConfigurationManager.AppSettings["DocIndexLoc"]; IndexSearcher searcher = new IndexSearcher(indexloc);

foreach (string s in SearchFilters.SourceTypes()) { TermQuery tq = new TermQuery(new Term("SourceType", s)); Filter f = new QueryFilter(tq);

sl.Add(s,searcher.Search(this.bq ,f).Length());

}

return sl;

With Bigger Categories

IndexReader reader = searcher.GetIndexReader(); QueryFilter baseQueryFilter = new QueryFilter(this.bq ); BitArray baseBitSet = baseQueryFilter.Bits (reader);

if (Cache["cities"] == null) { Cache["cities"] = Utilities.TopCities(); }

SortedList sl = new SortedList();

foreach (string s in (ArrayList)Application["cities"]) { TermQuery tq = new TermQuery(new Term("city", s)); Filter f = new QueryFilter(tq); BitArray baCity = f.Bits(reader);

baCity.And(baseBitSet); //do the cardinality function here

}

Am I doing something that is not so efficient? Any suggestions on boosting performance?

Thanks a lot for your help!

----- Original Message ---- From: Jokin Cuadrado <joki@gmail.com> To: luce@incubator.apache.org Sent: Tuesday, December 18, 2007 2:44:12 AM Subject: Re: Faceting in Lucene.Net

could you be more explicit on your needs? How many documents have your index, how many different categories are and how much is the average search hit number would be enough to suggest an approach.

In my case i made an custom collector to count the hits on every category using a fieldcache item to get the item efficiently instead of call to hit.getDocument. (performance killer).

this is better if your searches return small sets and you have much categories.

If you have not many terms, and your searches return many results, you can use queryfilter.bits to get the masks, AND them, and count the number of set bits on the result. this have the drawback that .net implementation of Bitarray, don't have an efficient method of counting the set bits (cardinality on java), but you could get one from the bitvector class on lucene.net (you must use you own implementation of bitarray, or use reflection to access the backbone int32 array m_array and count over him).

here is the function to get the number of ones set in a bitarray:

Private Shared _bitsSetArray256 As Byte() = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8}

''' <summary> ''' return the number of bits on bitarray set to one ''' </summary> ''' <remarks></remarks> Private Function Cardinality(ByVal bits As BitArray) As Int32 Dim arr As UInt32() arr = bits.GetType().GetField("m_array", Reflection.BindingFlags.NonPublic Or Reflection.BindingFlags.Instance).GetValue(bits) Dim _count As Int32 = 0 For i As Int32 = 0 To arr.Length - 1 _count += _bitsSetArray256(arr(i) And &HFF) + _ _bitsSetArray256((arr(i) >> 8) And &HFF) + _ _bitsSetArray256((arr(i) >> 16) And &HFF) + _ _bitsSetArray256(arr(i) >> 24) Next i Return _count End Function

On Dec 16, 2007 7:33 PM, Soormasher Singh <soor@yahoo.com>

wrote:

Hello All

I'm trying to use Lucene.Net for faceting (Category counting and

search refinement). I've not been able to find any examples of this using Lucene.Net. I've tried to use the approach used in Solr, but the performance hasn't been the greatest.

Can anyone please help me with this? Any code/examples of anyone using Lucene.Net for category counting/faceting?

Thanks a bunch!


____________________________________________________________________________________

Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


____________________________________________________________________________________

Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ