atom feed34 messages in org.apache.lucene.java-userRe: boosts for unstemmed matches (was...
FromSent OnAttachments
Grant IngersollFeb 24, 2010 5:41 am 
Chris LuFeb 24, 2010 9:00 am 
Simon WistowFeb 24, 2010 11:59 am 
Yuval FeinsteinFeb 24, 2010 12:09 pm 
Avi RosenscheinFeb 24, 2010 12:18 pm 
Marcelo OchoaFeb 24, 2010 12:22 pm 
Michael van RooyenFeb 24, 2010 12:24 pm 
Aaron LavFeb 24, 2010 1:20 pm 
Paul LibbrechtFeb 24, 2010 1:21 pm 
Avi RosenscheinFeb 24, 2010 1:38 pm 
GaneshFeb 24, 2010 9:40 pm 
luoc...@sohu.comFeb 25, 2010 12:14 am 
Paul TaylorFeb 25, 2010 12:19 am 
Uwe SchindlerFeb 25, 2010 12:29 am 
Avi RosenscheinFeb 25, 2010 1:44 am 
luocanraoFeb 25, 2010 3:46 am 
Michael McCandlessFeb 25, 2010 7:19 am 
Glen NewtonFeb 25, 2010 9:21 am 
Jason RutherglenFeb 25, 2010 9:51 am 
Grant IngersollFeb 25, 2010 10:00 am 
Grant IngersollFeb 25, 2010 10:01 am 
Grant IngersollFeb 25, 2010 10:02 am 
Mark MillerFeb 25, 2010 10:33 am 
Jason RutherglenFeb 25, 2010 3:18 pm 
N. HiraFeb 25, 2010 3:37 pm 
Mark MillerFeb 25, 2010 4:02 pm 
Thomas GuttesenFeb 25, 2010 4:05 pm 
luoc...@sohu.comFeb 25, 2010 10:47 pm 
Michael McCandlessFeb 26, 2010 12:46 am 
Paul TaylorFeb 26, 2010 1:30 am 
Glen NewtonFeb 27, 2010 7:03 am 
Uwe SchindlerFeb 27, 2010 7:17 am 
Glen NewtonFeb 27, 2010 8:18 am 
GaneshMar 1, 2010 12:56 am 
Subject:Re: boosts for unstemmed matches (was Re: If you could have one feature in Lucene...)
From:Avi Rosenschein (aros@gmail.com)
Date:Feb 24, 2010 1:38:12 pm
List:org.apache.lucene.java-user

On Wed, Feb 24, 2010 at 11:20 PM, Aaron Lav <as@pobox.com> wrote:

On Wed, Feb 24, 2010 at 10:18:27PM +0200, Avi Rosenschein wrote:

On Wed, Feb 24, 2010 at 3:42 PM, Grant Ingersoll <gsin@apache.org wrote:

What would it be?

For scoring to take into account the non-analyzed token stream.

That is, if a field is analyzed (stemmed, lowercased, maybe even stop words removed), that is fine for indexing. But tokens in the query matching the original form could still get a higher score than those that only match when analyzed.

You can get some of that effect by indexing stemmed and unstemmed forms, and letting IDF boost unstemmed results. (I picked this idea up from http://lingpipe-blog.com/2007/03/21/to-stem-or-not-to-stem/)

This is not quite the same (either in relevance or efficiency). I would like the infrastructure for this to be built into Lucene, so that queries and scorers could take advantage of it.

Also, this would maybe allow a flexible, run-time, decision of what analyzers to include. For example, I might want stemming turned on for normal search, but not for a PhraseQuery.

That's harder - different field names for the different analyses might work, but not for run-time decisions. I think the way Sun's Minion does it is morphologically-based query expansion (see http://blogs.sun.com/searchguy/entry/lightweight_morphology_vs_stemming), and you might be able to implement that via query rewriting.

Again, rather than forcing me to store a separate field for every possible type of query I might want to build, Lucene should be able to efficiently store the original information in a form conducive to using at query time.