| From | Sent On | Attachments |
|---|---|---|
| Grant Ingersoll | Feb 24, 2010 5:41 am | |
| Chris Lu | Feb 24, 2010 9:00 am | |
| Simon Wistow | Feb 24, 2010 11:59 am | |
| Yuval Feinstein | Feb 24, 2010 12:09 pm | |
| Avi Rosenschein | Feb 24, 2010 12:18 pm | |
| Marcelo Ochoa | Feb 24, 2010 12:22 pm | |
| Michael van Rooyen | Feb 24, 2010 12:24 pm | |
| Aaron Lav | Feb 24, 2010 1:20 pm | |
| Paul Libbrecht | Feb 24, 2010 1:21 pm | |
| Avi Rosenschein | Feb 24, 2010 1:38 pm | |
| Ganesh | Feb 24, 2010 9:40 pm | |
| luoc...@sohu.com | Feb 25, 2010 12:14 am | |
| Paul Taylor | Feb 25, 2010 12:19 am | |
| Uwe Schindler | Feb 25, 2010 12:29 am | |
| Avi Rosenschein | Feb 25, 2010 1:44 am | |
| luocanrao | Feb 25, 2010 3:46 am | |
| Michael McCandless | Feb 25, 2010 7:19 am | |
| Glen Newton | Feb 25, 2010 9:21 am | |
| Jason Rutherglen | Feb 25, 2010 9:51 am | |
| Grant Ingersoll | Feb 25, 2010 10:00 am | |
| Grant Ingersoll | Feb 25, 2010 10:01 am | |
| Grant Ingersoll | Feb 25, 2010 10:02 am | |
| Mark Miller | Feb 25, 2010 10:33 am | |
| Jason Rutherglen | Feb 25, 2010 3:18 pm | |
| N. Hira | Feb 25, 2010 3:37 pm | |
| Mark Miller | Feb 25, 2010 4:02 pm | |
| Thomas Guttesen | Feb 25, 2010 4:05 pm | |
| luoc...@sohu.com | Feb 25, 2010 10:47 pm | |
| Michael McCandless | Feb 26, 2010 12:46 am | |
| Paul Taylor | Feb 26, 2010 1:30 am | |
| Glen Newton | Feb 27, 2010 7:03 am | |
| Uwe Schindler | Feb 27, 2010 7:17 am | |
| Glen Newton | Feb 27, 2010 8:18 am | |
| Ganesh | Mar 1, 2010 12:56 am |
| Subject: | Re: boosts for unstemmed matches (was Re: If you could have one feature in Lucene...) | |
|---|---|---|
| From: | Avi Rosenschein (aros...@gmail.com) | |
| Date: | Feb 24, 2010 1:38:12 pm | |
| List: | org.apache.lucene.java-user | |
On Wed, Feb 24, 2010 at 11:20 PM, Aaron Lav <as...@pobox.com> wrote:
On Wed, Feb 24, 2010 at 10:18:27PM +0200, Avi Rosenschein wrote:
On Wed, Feb 24, 2010 at 3:42 PM, Grant Ingersoll <gsin...@apache.org wrote:
What would it be?
For scoring to take into account the non-analyzed token stream.
That is, if a field is analyzed (stemmed, lowercased, maybe even stop words removed), that is fine for indexing. But tokens in the query matching the original form could still get a higher score than those that only match when analyzed.
You can get some of that effect by indexing stemmed and unstemmed forms, and letting IDF boost unstemmed results. (I picked this idea up from http://lingpipe-blog.com/2007/03/21/to-stem-or-not-to-stem/)
This is not quite the same (either in relevance or efficiency). I would like the infrastructure for this to be built into Lucene, so that queries and scorers could take advantage of it.
Also, this would maybe allow a flexible, run-time, decision of what analyzers to include. For example, I might want stemming turned on for normal search, but not for a PhraseQuery.
That's harder - different field names for the different analyses might work, but not for run-time decisions. I think the way Sun's Minion does it is morphologically-based query expansion (see http://blogs.sun.com/searchguy/entry/lightweight_morphology_vs_stemming), and you might be able to implement that via query rewriting.
Again, rather than forcing me to store a separate field for every possible type of query I might want to build, Lucene should be able to efficiently store the original information in a form conducive to using at query time.





