atom feed30 messages in org.apache.lucene.java-devRe: Whither Query Norm?
FromSent OnAttachments
Grant IngersollNov 20, 2009 7:55 am 
Mark MillerNov 20, 2009 8:04 am 
Jake MannixNov 20, 2009 8:14 am 
Mark MillerNov 20, 2009 8:14 am 
Jake MannixNov 20, 2009 8:18 am 
Grant IngersollNov 20, 2009 10:08 am 
Jake MannixNov 20, 2009 10:24 am 
Grant IngersollNov 20, 2009 1:58 pm 
Mark MillerNov 20, 2009 2:24 pm 
Jake MannixNov 20, 2009 2:31 pm 
Mark MillerNov 20, 2009 2:39 pm 
Mark MillerNov 20, 2009 2:50 pm 
Jake MannixNov 20, 2009 3:39 pm 
Mark MillerNov 20, 2009 4:09 pm 
Mark MillerNov 20, 2009 4:20 pm 
Jake MannixNov 20, 2009 4:36 pm 
Jake MannixNov 20, 2009 4:42 pm 
Jake MannixNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:51 pm 
Jake MannixNov 20, 2009 4:56 pm 
Mark MillerNov 20, 2009 5:02 pm 
Jake MannixNov 20, 2009 5:10 pm 
Jake MannixNov 20, 2009 5:13 pm 
Otis GospodneticNov 24, 2009 9:18 pm 
Otis GospodneticNov 24, 2009 9:31 pm 
Jake MannixNov 24, 2009 9:39 pm 
Jake MannixNov 24, 2009 9:43 pm 
Jake MannixNov 24, 2009 9:55 pm 
Jake MannixNov 24, 2009 10:30 pm 
Subject:Re: Whither Query Norm?
From:Jake Mannix (jake@gmail.com)
Date:Nov 20, 2009 4:49:11 pm
List:org.apache.lucene.java-dev

On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <mark@gmail.com> wrote:

Mark Miller wrote: Okay - I guess that somewhat makes sense - you can calculate the magnitude of the doc vectors at index time. How is that impossible with incremental indexing though? Isn't it just expensive? Seems somewhat expensive in the non incremental case as well - your just eating it at index time rather than query time - though the same could be done for incremental? The information is all there in either case.

Ok, I think I see what you were imagining I was doing: you take the current state of the index as gospel for idf (when the index is already large, this is a good approximation), and look up these factors at index time - this means grabbing docFreq(Term) for each term in my document, and yes, this would be very expensive, I'd imagine. I've done it by pulling a monstrous (the most common 1-million terms, say) Map<String, Float> (effectively) outside of lucene entirely, which gives term idfs, and housing this in memory so that computing field norms for cosine is a very fast operation at index time.

Doing it like this is hard from scratch, but is fine incrementally, because I've basically fixed idf using some previous corpus (and update the idfMap every once in a while, in cases where it doesn't change much). This has the effect of also providing a global notion of idf in a distributed corpus.

-jake