

![]() | Start a set with this search |
![]() | Include this search in one of my sets |
![]() | Exclude this search from one of my sets |
![]() | Permalink to these results Paste this link in email or IM: |
| Atom feed for tracking future search results Paste this URL into your reader: |
7 messages in org.apache.lucene.java-devRE: Include BM25 in Lucene?| From | Sent On | Attachments |
|---|---|---|
| J.Zhu | Oct 17, 2006 2:50 am | |
| Grant Ingersoll | Oct 17, 2006 3:56 am | |
| J.Zhu | Oct 17, 2006 3:58 am | |
| Vic Bancroft | Oct 17, 2006 5:43 am | |
| J.Zhu | Oct 17, 2006 9:02 am | |
| Chuck Williams | Oct 17, 2006 12:41 pm | |
| Vic Bancroft | Oct 19, 2006 5:27 am |

![]() | Permalink for this message Paste this link in email or IM: |
![]() | Permalink for this thread Paste this link in email or IM: |
| Atom feed for this thread Paste this URL into your reader: |
| Subject: | RE: Include BM25 in Lucene? | Actions... |
|---|---|---|
| From: | J.Zhu (J.Z...@open.ac.uk) | |
| Date: | Oct 17, 2006 9:02:38 am | |
| List: | org.apache.lucene.java-dev | |
Hi, Vic,
Unfortunately BM25 uses IDF as well so splitting documents across machines will also affect it. How about storing these as global statistical data for sharing the search on these machines?
The equation of BM25 is clearly stated in Robertson's paper "Simple, proven approaches to text retrieval" (http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-356.pdf) as follows.
CW (i,j) = [ CFW (i) * TF (i,j) * (K1+1) ] /[ K1 * ( (1-b) + (b * (NDL (j)) ) ) + TF (i,j) ] CFW(i) is collection frequency weight of term i, TF(i,j) is term frequency of term i, NDL(j) is the normalized document length of document j, and K1 and b are tuning constants. The details are in the paper.
Univ. of Amsterdam has provided a downloadable version of a language modelling version of Lucene. Their language model is not BM25 but is quite similar in nature. The version is at: http://ilps.science.uva.nl/Resources/#lm-lucen
I have worked on their version a bit, they have created new classes: TermQueryLanguageModel, TermScorerLanguageModel, IndexSearcherLanguageModel, LanguageModelIndexReader etc. I think their work can be a basis.
Jianhan
-----Original Message----- From: Vic Bancroft [mailto:banc...@america.net] Sent: 17 October 2006 13:44 To: java...@lucene.apache.org; J.Z...@open.ac.uk Subject: Re: Include BM25 in Lucene?
J.Zhu wrote:
If I would like to contribute, what should I do? I am not a good Java developer myself though. Can I work with someone also interested?
In some of my group's usage of lucene over large document collections, we have split the documents across several machines. This has lead to a concern of whether the inverse document frequency was appropriate, since the score seems to be dependant on the partioning of documents over indexing hosts. We have not formulated an experiment to determine if it seriously effects our results, though it has been discussed.
If someone could elaborate how BM25 or some DFR algorithm would differ from what (TF/IDF) is implemented in lucene, I would be willing to help translate that into java as an indexing/searching option . . .
more, l8r, v
-- "The future is here. It's just not evenly distributed yet." -- William Gibson, quoted by Whitfield Diffie







