atom feed30 messages in org.apache.lucene.java-devRe: Whither Query Norm?
FromSent OnAttachments
Grant IngersollNov 20, 2009 7:55 am 
Mark MillerNov 20, 2009 8:04 am 
Jake MannixNov 20, 2009 8:14 am 
Mark MillerNov 20, 2009 8:14 am 
Jake MannixNov 20, 2009 8:18 am 
Grant IngersollNov 20, 2009 10:08 am 
Jake MannixNov 20, 2009 10:24 am 
Grant IngersollNov 20, 2009 1:58 pm 
Mark MillerNov 20, 2009 2:24 pm 
Jake MannixNov 20, 2009 2:31 pm 
Mark MillerNov 20, 2009 2:39 pm 
Mark MillerNov 20, 2009 2:50 pm 
Jake MannixNov 20, 2009 3:39 pm 
Mark MillerNov 20, 2009 4:09 pm 
Mark MillerNov 20, 2009 4:20 pm 
Jake MannixNov 20, 2009 4:36 pm 
Jake MannixNov 20, 2009 4:42 pm 
Jake MannixNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:51 pm 
Jake MannixNov 20, 2009 4:56 pm 
Mark MillerNov 20, 2009 5:02 pm 
Jake MannixNov 20, 2009 5:10 pm 
Jake MannixNov 20, 2009 5:13 pm 
Otis GospodneticNov 24, 2009 9:18 pm 
Otis GospodneticNov 24, 2009 9:31 pm 
Jake MannixNov 24, 2009 9:39 pm 
Jake MannixNov 24, 2009 9:43 pm 
Jake MannixNov 24, 2009 9:55 pm 
Jake MannixNov 24, 2009 10:30 pm 
Subject:Re: Whither Query Norm?
From:Otis Gospodnetic (otis@yahoo.com)
Date:Nov 24, 2009 9:31:26 pm
List:org.apache.lucene.java-dev

Hello,

Regarding that monstrous term->idf map. Is this something that one could use to adjust the scores in
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
scenario? Say a map like that was created periodically for each shard and
distributed to all other nodes (so in the end each node has all maps locally).
Couldn't the local scorer in the Solr instance (and in distributed Lucene setup)
consult idfs for relevant terms in all those maps and adjust the scores of local
scores before returning results?

Otis

From: Jake Mannix <jake@gmail.com>

To: java@lucene.apache.org Sent: Fri, November 20, 2009 7:49:34 PM Subject: Re: Whither Query Norm?

On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <mark@gmail.com> wrote:

Mark Miller wrote:

Okay - I guess that somewhat makes sense - you can calculate the

magnitude of the doc vectors at index time. How is that impossible with incremental indexing though? Isn't it just expensive? Seems somewhat expensive in the non incremental case as well - your just eating it at index time rather than query time - though the same could be done for incremental? The information is all there in either case.

Ok, I think I see what you were imagining I was doing: you take the current state of the index as gospel for idf (when the index is already large, this

is a good approximation), and look up these factors at index time - this

means grabbing docFreq(Term) for each term in my document, and yes, this would be very expensive, I'd imagine. I've done it by pulling a

monstrous (the most common 1-million terms, say) Map<String, Float>

(effectively) outside of lucene entirely, which gives term idfs, and housing this in memory so that computing field norms for cosine is a very fast

operation at index time.

Doing it like this is hard from scratch, but is fine incrementally, because I've basically fixed idf using some previous corpus (and update the idfMap every once in a while, in cases where it doesn't change much). This has

the effect of also providing a global notion of idf in a distributed corpus.

-jake