|Grant Ingersoll||Nov 20, 2009 7:55 am|
|Mark Miller||Nov 20, 2009 8:04 am|
|Jake Mannix||Nov 20, 2009 8:14 am|
|Mark Miller||Nov 20, 2009 8:14 am|
|Jake Mannix||Nov 20, 2009 8:18 am|
|Grant Ingersoll||Nov 20, 2009 10:08 am|
|Jake Mannix||Nov 20, 2009 10:24 am|
|Grant Ingersoll||Nov 20, 2009 1:58 pm|
|Mark Miller||Nov 20, 2009 2:24 pm|
|Jake Mannix||Nov 20, 2009 2:31 pm|
|Mark Miller||Nov 20, 2009 2:39 pm|
|Mark Miller||Nov 20, 2009 2:50 pm|
|Jake Mannix||Nov 20, 2009 3:39 pm|
|Mark Miller||Nov 20, 2009 4:09 pm|
|Mark Miller||Nov 20, 2009 4:20 pm|
|Jake Mannix||Nov 20, 2009 4:36 pm|
|Jake Mannix||Nov 20, 2009 4:42 pm|
|Jake Mannix||Nov 20, 2009 4:49 pm|
|Mark Miller||Nov 20, 2009 4:49 pm|
|Mark Miller||Nov 20, 2009 4:51 pm|
|Jake Mannix||Nov 20, 2009 4:56 pm|
|Mark Miller||Nov 20, 2009 5:02 pm|
|Jake Mannix||Nov 20, 2009 5:10 pm|
|Jake Mannix||Nov 20, 2009 5:13 pm|
|Otis Gospodnetic||Nov 24, 2009 9:18 pm|
|Otis Gospodnetic||Nov 24, 2009 9:31 pm|
|Jake Mannix||Nov 24, 2009 9:39 pm|
|Jake Mannix||Nov 24, 2009 9:43 pm|
|Jake Mannix||Nov 24, 2009 9:55 pm|
|Jake Mannix||Nov 24, 2009 10:30 pm|
|Subject:||Re: Whither Query Norm?|
|From:||Jake Mannix (jake...@gmail.com)|
|Date:||Nov 20, 2009 5:10:53 pm|
Back to Grant's original question, for a second...
On Fri, Nov 20, 2009 at 1:59 PM, Grant Ingersoll <gsin...@apache.org>wrote:
This makes sense from a mathematical sense, assuming scores are comparable. What I would like to get at is why anyone thinks scores are comparable across queries to begin with. I agree it is beneficial in some cases (as you described) if they are. Probably a question suited for an academic IR list...
Well, without getting into the academic IR which I'm not really qualified to argue about, what is wrong with comparing two queries by saying that a document which "perfectly" matches a query should score 1.0, and scale with respect to that?
Maybe it's a better question to turn it around: can you give examples of two queries where you can see that it *doesn't* make sense to compare scores? Let's imagine we're doing pure, properly normalized tf-idf cosine scoring (not default Lucene scoring) on a couple of different fields at once. Then whenever a sub-query is exactly equal to the field it's hitting (or else the field is the repetition of that query some multiple number of times), the score for that sub-query will be 1.0. When the match isn't perfect, the score will go down, ok. Sub-queries hitting longer fields (which aren't just pathologically made up of just repetitions of a smaller set of terms) will in general have even the best scores be very low compared to the best scores on the small fields (this is true for Lucene as well, of course), but this makes sense: if you query with a very small set of terms (as is usually done, unless you're doing a MoreLikeThis kind of query), and you find a match in the "title" field which is exactly what you were looking for, that field match is far and away better than anything else you could get in a body match.
To put it more simply - if you do really have cosine similarity (or Jaccard/Tanimoto or something like that, if you don't care about idf for some reason), then queries scores are normalized relative to "how close did I find documents to *perfectly* matching my query" - 1.0 means you found your query in the corpus, and less than that means some fractional proximity. This is an absolute measure, across queries.
Of course, then you ask, well, in reality, in web and enterprise search, documents are big, queries are small, you never really find documents which are perfect matches, so if the best match for q1, out of your whole corpus, is 0.1 for doc1, and the best match for q2 is 0.25 for doc2, is it really true that the best match for the second query is "better" than the best match for the first query? I've typically tried to remain agnostic on that front, and instead as the related question: if the user (or really, a sampling of many users) queried for (q1 OR q2) and assuming for simplicity that q1 didn't match any of the good hits for q2, and vice-versa, then does the user (ie. your gold-standard training set) say that the best result is doc1, or doc2? If it's doc1, then you'd better have found a way to boost q1's score contribution higher than q2's, right? Is this wrong? (in the theoretical sense?)