20 messages in org.apache.lucene.java-devRe: Unique doc ids
FromSent OnAttachments
Michael BuschJan 22, 2008 3:06 am 
Terry YangJan 22, 2008 6:37 am 
Paul ElschotJan 22, 2008 11:59 pm 
Michael BuschJan 23, 2008 1:04 am 
Michael BuschJan 23, 2008 1:08 am 
Michael McCandlessJan 23, 2008 3:34 am 
Grant IngersollJan 23, 2008 5:08 am 
Yonik SeeleyJan 23, 2008 5:50 am 
Nadav Har'ElJan 23, 2008 10:26 am 
Michael McCandlessJan 24, 2008 2:46 am 
Yonik SeeleyJan 24, 2008 5:29 am 
Manik SurtaniJan 24, 2008 6:06 am 
Michael McCandlessJan 24, 2008 7:50 am 
Manik SurtaniJan 29, 2008 7:37 am 
mark harwoodJan 29, 2008 10:23 am 
Chris HostetterJan 29, 2008 2:29 pm 
Manik SurtaniJan 30, 2008 5:38 am 
Karl WettinJan 30, 2008 2:06 pm 
Otis GospodneticJan 30, 2008 9:49 pm 
Manik SurtaniJan 31, 2008 3:02 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: Unique doc idsActions...
From:Yonik Seeley (yon@apache.org)
Date:Jan 24, 2008 5:29:45 am
List:org.apache.lucene.java-dev

On Jan 24, 2008 5:47 AM, Michael McCandless <luc@mikemccandless.com> wrote:

Yonik Seeley wrote:

On Jan 23, 2008 6:34 AM, Michael McCandless <luc@mikemccandless.com> wrote:

writer.freezeDocIDs(); try { get docIDs from somewhere & call writer.deleteByDocID } finally { writer.unfreezeDocIDs(); }

Interesting idea, but would require the IndexWriter to flush the buffered docs so an IndexReader could be created fro them. (or would require the existence of an UnflushedDocumentsIndexReader)

True.

Actually, an UnflushedDocumentsIndexReader would not be hard!

DocumentsWriter already has an IndexInput (ByteSliceReader) that can read the postings for a single term from the RAM buffer (this is used when flushing the segment). I think it'd be straightforward to get TermEnum/TermDocs/TermPositions iterators on the buffered docs. Norms are already stored as byte arrays in memory. FieldInfos is already available. The stored fields & term vectors are already flushed to the directory so they could be read normally.

Hmm, buffered delete terms are tricky. I guess freezeDocIDs would have to flush deleted terms (and queries, if we add that) before making a reader accessible,

If we buffer queries, that would seem to take care of 99% of the usecases that need an IndexReader, right? A custom query could get ids from an index however it wanted.

though, the cost is shared because the readers need to be opened anyway (so the app can find docIDs).

So maybe this approach becomes this:

// Returns a "point in time" frozen view of index... IndexReader reader = writer.getReader(); try { <get docIDs from reader, delete by docID> } finally { writer.releaseReader(); }

?

We may even be able to implement this w/o actually freezing the writer, ie, still allowing add/updateDocument calls to proceed. Merging could certainly still proceed. This way you could at any time ask a writer for a "point in time" reader, independent of what else you are doing with the writer. This would require, on flushing, that writer goes and swaps in a "real" segment reader, limited to a specified docID, for any point in time readers that are open.

Wow... sounds complex.

If we went that route, we'd need to expose methods in IndexWriter to let you get reader(s), and, to then delete by docID.

Right... I had envisioned a callback that was called after a new segment was created/flushed that passed IndexReader[]. In an environment of mixed deletes and adds, it would avoid slowing down the indexing part by limiting where the deletes happen.

This would certainly be less work :) I guess the question is how severely are we limiting the application by requiring that you can only do deletes when IW decides to flush, or, by forcing the application to flush when it wants to do deletes.

Seems like more work, rather than limiting... "when" really isn't as important as long as it's before a new external IndexReader is opened for searching.

It does put a little more burden on the user, but a slightly harder (but more powerful / more efficient) API is preferable since easier APIs can always be built on top (but not vice-versa).

True, though emulating the easier API on top of the "you get to delete only when IW flushes" means you are forcing a flush, right?

I was thinking via buffering (the same way term deletes are handled now). You keep track of maxDoc() at the time of the delete and defer it until later.

-Yonik