atom feed30 messages in org.apache.lucene.java-devRe: Whither Query Norm?
FromSent OnAttachments
Grant IngersollNov 20, 2009 7:55 am 
Mark MillerNov 20, 2009 8:04 am 
Jake MannixNov 20, 2009 8:14 am 
Mark MillerNov 20, 2009 8:14 am 
Jake MannixNov 20, 2009 8:18 am 
Grant IngersollNov 20, 2009 10:08 am 
Jake MannixNov 20, 2009 10:24 am 
Grant IngersollNov 20, 2009 1:58 pm 
Mark MillerNov 20, 2009 2:24 pm 
Jake MannixNov 20, 2009 2:31 pm 
Mark MillerNov 20, 2009 2:39 pm 
Mark MillerNov 20, 2009 2:50 pm 
Jake MannixNov 20, 2009 3:39 pm 
Mark MillerNov 20, 2009 4:09 pm 
Mark MillerNov 20, 2009 4:20 pm 
Jake MannixNov 20, 2009 4:36 pm 
Jake MannixNov 20, 2009 4:42 pm 
Jake MannixNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:49 pm 
Mark MillerNov 20, 2009 4:51 pm 
Jake MannixNov 20, 2009 4:56 pm 
Mark MillerNov 20, 2009 5:02 pm 
Jake MannixNov 20, 2009 5:10 pm 
Jake MannixNov 20, 2009 5:13 pm 
Otis GospodneticNov 24, 2009 9:18 pm 
Otis GospodneticNov 24, 2009 9:31 pm 
Jake MannixNov 24, 2009 9:39 pm 
Jake MannixNov 24, 2009 9:43 pm 
Jake MannixNov 24, 2009 9:55 pm 
Jake MannixNov 24, 2009 10:30 pm 
Subject:Re: Whither Query Norm?
From:Mark Miller (mark@gmail.com)
Date:Nov 20, 2009 4:51:23 pm
List:org.apache.lucene.java-dev

Okay - my fault - I'm not really talking in terms of Lucene. Though even there I consider it possible. You'd just have to like, rewrite it :) And it would likely be pretty slow.

Jake Mannix wrote:

On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <mark@gmail.com <mailto:mark@gmail.com>> wrote:

Mark Miller wrote: > > it looks expensive to me to do both > of them properly. Okay - I guess that somewhat makes sense - you can calculate the magnitude of the doc vectors at index time. How is that impossible with incremental indexing though? Isn't it just expensive? Seems somewhat expensive in the non incremental case as well - your just eating it at index time rather than query time - though the same could be done for incremental? The information is all there in either case.

The expense, if you have the idfs of all terms in the vocabulary (keep them in the form of idf^2 for efficiency at index time), is pretty trivial, isn't it? If you have a document with 1000 terms, it's maybe 3000 floating point operations, all CPU actions, in memory, no disk seeks.

What it does require, is knowing, even when you have no documents yet on disk, what the idf of terms in the first few documents are. Where do you know this, in Lucene, if you haven't externalized some notion of idf?

-jake

--------------------------------------------------------------------- To unsubscribe, e-mail: java@lucene.apache.org <mailto:java@lucene.apache.org> For additional commands, e-mail: java@lucene.apache.org <mailto:java@lucene.apache.org>