atom feed30 messages in org.apache.lucene.java-devRe: Whither Query Norm?
FromSent OnAttachments
Grant IngersollNov 20, 2009 7:55 am 
Mark MillerNov 20, 2009 8:04 am 
Jake MannixNov 20, 2009 8:14 am 
Mark MillerNov 20, 2009 8:14 am 
Jake MannixNov 20, 2009 8:18 am 
Grant IngersollNov 20, 2009 10:08 am 
Jake MannixNov 20, 2009 10:24 am 
Grant IngersollNov 20, 2009 1:58 pm 
Mark MillerNov 20, 2009 2:24 pm 
Jake MannixNov 20, 2009 2:31 pm 
Mark MillerNov 20, 2009 2:39 pm 
Mark MillerNov 20, 2009 2:50 pm 
Jake MannixNov 20, 2009 3:39 pm 
Mark MillerNov 20, 2009 4:09 pm 
Mark MillerNov 20, 2009 4:20 pm 
Jake MannixNov 20, 2009 4:36 pm 
Jake MannixNov 20, 2009 4:42 pm 
Jake MannixNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:51 pm 
Jake MannixNov 20, 2009 4:56 pm 
Mark MillerNov 20, 2009 5:02 pm 
Jake MannixNov 20, 2009 5:10 pm 
Jake MannixNov 20, 2009 5:13 pm 
Otis GospodneticNov 24, 2009 9:18 pm 
Otis GospodneticNov 24, 2009 9:31 pm 
Jake MannixNov 24, 2009 9:39 pm 
Jake MannixNov 24, 2009 9:43 pm 
Jake MannixNov 24, 2009 9:55 pm 
Jake MannixNov 24, 2009 10:30 pm 
Subject:Re: Whither Query Norm?
From:Grant Ingersoll (gsin@apache.org)
Date:Nov 20, 2009 10:08:00 am
List:org.apache.lucene.java-dev

On Nov 20, 2009, at 11:19 AM, Jake Mannix wrote:

I should add in my $0.02 on whether to just get rid of queryNorm() altogether:

-1 from me, even though it's confusing, because having that call there
(somewhere, at least) allows you to actually do compare scores across queries if
you do the extra work of properly normalizing the documents as well (at index
time).

Do you have some references on this? I'm interested in reading more on the
subject. I've never quite been sold on how it is meaningful to compare scores
and would like to read more opinions.

And for people who actually do machine-learning training of their per-field
query boosts, this is pretty critical.

-jake

On Fri, Nov 20, 2009 at 8:15 AM, Jake Mannix <jake@gmail.com> wrote: The fact Lucene Similarity is most decidely *not* cosine similarity, but
strongly resembles it with the queryNorm() in there, means that I personally
would certainly like to see this called out, at least in the documentation.

As for performance, is the queryNorm() called ever in any loops? It's all set
up in the construction of the Weight, right? Which means that by the time
you're doing scoring, all the weighting factors are already factored into one?
What's the performance issue which would be saved here?

-jake

On Fri, Nov 20, 2009 at 7:56 AM, Grant Ingersoll <gsin@apache.org> wrote: For a long time now, we've been telling people not to compare scores across
queries, yet we maintain the queryNorm() code as an attempt to do this and the
javadocs even promote it. I'm in the process of researching this some more
(references welcomed), but wanted to hear what people think about it here. I
haven't profiled it just yet, but it seems like a good chunk of wasted
computation to me (loops, divisions and square roots). At a minimum, I think we
might be able to refactor the callback mechanism for it just as we did for the
collectors, such that we push of the actual calculation of the sum of squares
into Similarity, instead of just doing 1/sqrt(sumSqs). That way, when people
want to override queryNorm() to return 1, they are saving more than just the
1/sqrt calculation. I haven't tested it yet, but wanted to find out what others
think.

Thoughts?