|Shai Erera||Apr 25, 2010 4:32 am|
|Michael McCandless||May 5, 2010 8:54 am|
|Shai Erera||May 5, 2010 10:17 am|
|Babak Farhang||May 8, 2010 11:39 pm|
|Shai Erera||May 8, 2010 11:49 pm|
|Babak Farhang||May 9, 2010 12:31 am|
|Shai Erera||May 9, 2010 4:38 am|
|Michael McCandless||May 10, 2010 12:43 am|
|Shai Erera||May 10, 2010 1:04 am|
|Michael McCandless||May 10, 2010 1:40 am|
|Babak Farhang||May 10, 2010 9:22 pm|
|Shai Erera||May 10, 2010 9:26 pm|
|Grant Ingersoll||May 11, 2010 12:40 pm|
|Babak Farhang||May 12, 2010 2:27 am|
|Michael McCandless||May 12, 2010 2:55 am|
|Jan Høydahl / Cominvent||Oct 7, 2010 1:59 am|
|Shai Erera||Oct 7, 2010 2:06 am|
|Andrzej Bialecki||May 23, 2011 9:07 am|
|Shai Erera||May 23, 2011 11:24 am|
|Michael McCandless||May 23, 2011 12:03 pm|
|Andrzej Bialecki||May 23, 2011 12:45 pm|
|Subject:||Re: Incremental Field Updates|
|From:||Shai Erera (ser...@gmail.com)|
|Date:||May 10, 2010 1:04:53 am|
That's an interesting scenario Mike.
Previously, I only handled boolean-like terms, as the scenarios we were asked to support involved just those types of terms. Obviously, when the approach allows for more, more scenarios pop to mind :).
I think we may still be able to resolve that case, but it becomes much more complicated. My design approach of adding the +/- affected the entire posting element, whereas the scenario you describe affects the positions of the posting element. This calls for a more complicated design and solution.
My take on it is that if someone wants to update the catch-all field, then reindexing the document may not be such a bad idea anyway. The purpose of those incremental updates is to cope w/ high frequency of updates, which usually happen on metadata fields, and not title.
But since one could add the 'tags' to the catch-all field as well, it brings us to the same point - how do I remove the positions of term X that relate to the tag X and not the potentially original term X that existed in the document?
This is a very advanced case (and interesting). I don't want to hold up the discussion on it, but want to make sure we do not deviate from getting the more simpler cases in first. Depending on the API, this might be very easy to solve, but might also complicate matters. Maybe, for a incr-field-updates-v1, we can do without it?
On Mon, May 10, 2010 at 10:43 AM, Michael McCandless < luc...@mikemccandless.com> wrote:
I think another example would be the catch-all field.
EG say my app concatenates the title, abstract and body of a document into the catch-all field.
But now I want to change just the title.
I think in theory (assuming we can work out an intuitive user-level API exposure of this...), on changing the title, we could incrementally re-index the catch-all field, to remove the old title and add the new one. Right?
But: would this approach work on the positions too?
EG say term X appears 4 times in the doc -- positions 1, 7, 22, 53. I've now replaced a part of this field, affecting only this term's occurrence at position 53, so that maybe that occurrence is now at 57. Are we able to handle this?
One possible low-level API (I think Doug originally suggested this) might be to allow .remove() calls on the enums (like many Java collection support from their iterators) -- eg as you are stepping through the DocsEnum, you could call .remove() to complete disassociate that term from that doc, or in DocsAndPositionsEnum you could call .removePosition to remove a specific position.
We would still need a higher level user API... maybe you'd provide a "text to remove" and a "text to add" for each field, pre-init'd with position/offset for the analyzer (if the field is analyzed)? I guess we'd make this a new attr on Field. Ie, Field normally contains just "text/tokens to add to the index", but we could also include a "text/tokens to remove". Or, each Field instance is marked as "to be added" vs "to be removed", and you have to add 2 Field instances to subtract then add.
On Sun, May 9, 2010 at 7:38 AM, Shai Erera <ser...@gmail.com> wrote:
When I update a field, I want to update all of it, not just part of it. No?
Well ... might be. But the most common case for us are fields to which we want to add data or remove. For example ACLs - you don't want to replace the entire field w/ the document, but simply to add/remove access for certain people/groups. Same goes for "social" fields, like tags, ratings, bookmarks etc. - the granularity of the update is to associate/dissociate a particular value w/ the field + doc, and not update the entire field.
On Sun, May 9, 2010 at 10:31 AM, Babak Farhang <farh...@gmail.com>
No, actually, you can update index-only fields also. It all depends on the operation that you ask to do on the fields.
I love the level of control this provides, but again, I was talking at the user level.
If you want to e.g. remove an entire field w/ such update operation, then it becomes more expensive
That's the typical usage scenario, I imagine. When I update a field, I want to update all of it, not just part of it. No?
(The lower level semantics of twiddling with the postings is poorly understood by the typical user..)
We didn't face such a scenario though, and I think it's probably a rare one?
As rare as any time you want to update an indexed-only field. Not a serious limitation (but perhaps worth noting?) Perhaps at the API level, you provide an updateIndexedOnlyField that takes the old value as well as the new value for the field.
Anyway, I think your approach would be a great addition to the toolkit. Would love to see it even in rough cut form :))
On Sun, May 9, 2010 at 12:49 AM, Shai Erera <ser...@gmail.com> wrote:
No, actually, you can update index-only fields also. It all depends on the operation that you ask to do on the fields. For example, if the query to execute is something like "update all documents w/ tags:ibm -> remove terms t1, t2, t3 and add terms t4, t5", then the result of such request would dissociate t1-3 from those docs that answer tags:ibm and associate them w/ t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the following posting updates will be done: t1: -1, -4, -10 t2: -1, -4, -10 t3: -1, -4, -10 t4, 1, 4, 10 t5, 1, 4, 10 (in addition to whatever other updates that are associated with those postings).
At search time, if you search for "t1 OR t2", then the regular t1 and t2 postings will be merged on-the-fly w/ the updated ones to remove docs 1, 4, 10.
If you want to e.g. remove an entire field w/ such update operation, then it becomes more expensive, but in general you'd need to iterate over the field's terms and dissociate the documents from all the terms. We didn't face such a scenario though, and I think it's probably a rare one?
On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <farh...@gmail.com>
I think this is an interesting approach. I can see how I could [incrementally] update a stored, indexed field this way, but I don't understand how I could update an indexed-only field. Here's why: for a stored (and indexed) field, I can always determine what terms to remove ('-') from the postings, but for an indexed-only field I'd have no [practical] way to know..
So under this approach, I'm thinking at a user level, a Lucene field would be updateable only if it's at least stored.
Is that right?
On Wed, May 5, 2010 at 11:17 AM, Shai Erera <ser...@gmail.com>
Yes Mike - I don't know yet if two MPs will be used, one for the stacked segments and one for the general segments (which will include the stacked ones as well) .. feels like one MP should be enough, but this can be decided on as we progress.
This approach allows you to update every term in every already indexed field, as well as add terms to already indexed fields ... and add totally new fields, with lots of text in them. So e.g. there are two neat use cases that come to mind: 1) Allow users to rate search results, favor them etc. 2) Or even comment them, I think Google offers the 2nd. Both translate into updating the search result's already indexed document w/ the new rating, comment etc. w/o needing to reindex the document.
I will try to find perf results numbers. It's been long time ago, hope all the documents are still where they were :). Indeed, for terms like ACLs, it means that each query had to merge dozens of postings lists. For that I implemented an alternative solution, which uses a payload-like structure that registers for each document the list of ACLs that are associated with it (not as strings, it was more efficient). Then, if the query included dozens of such terms, I created a filter-like object which for every matching document by the query checked if it matches the ACLs list of the document. This is usually slower, because the ACLs themselves don't drive the query, which means more matches will be found. That was a tradeoff which one could configure based on the number of such terms in the query, the number of updated terms etc.
But in essence you're right - if the solution is generic to cover any term, we cannot use such payload-based feature. We might need to merge the stacked segments more frequently. This is pending perf testing though.
On Wed, May 5, 2010 at 6:54 PM, Michael McCandless <luc...@mikemccandless.com> wrote:
Catching up here :)
This is great stuff Shai -- I like the notion of "negative" postings, that "subtract" docs from previous generations as you iterate them.
And I like the term "stacked segments". This fits very well with Lucene's write-once approach, ie, a writer can at any time stack a new segment (writes to new files) "over" an old segment, like the layers in photoshop. A reader merges all stacks on-the-fly when reading.
And the merge policy now picks from 2 dimensions right? Ie it may want to simply consolidate stacks on an old segment but otherwise not merge that segment (eg for very large segments that have accumulated updates), and normal merging would of course consolidate all stacks for all segments merged.
Wouldn't this approach conceivably allow you to alter single terms within a single field (we'd have to figure out how we'd expose the API for this)? EG if I appended some text to an already-indexed field? (In addition to adding a new field to an already indexed doc, or replacing an indexed field on a previously indexed doc).
Did you have any hard perf numbers? Merge sorting N streams is surprisingly costly... we may need/want to have a reader pre-merge (using up RAM) any "long tail" of stacked segments as long as they are small enough...
This sounds great!!
On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <ser...@gmail.com> wrote:
WARNING: following email is a bit long, but I think is worth the reading :)
I would like to describe my implementation of incremental field updates in Juru (the former search library I've worked on in IBM). I will later discuss how I think it can be implemented in Lucene.
The motivation/requirement came from a doc management system which used Juru as its search component. The system included document libraries where users could create files and upload documents. A user could belong to any number of libraries and complex ACLs model was used (down to the level of the file). ACLs and Folders were modeled as categories in the index (boolean-like terms). Files and folders could be moved around and access to a library/folder/file could be granted/revoked at any given time. Therefore, such updates usually affected hundreds (and thousands) of documents. Overall, the index managed millions of documents, tens of thousands of libraries and hundreds of thousands of ACLs (large organization :)). To get a rough understanding on the number of such updates - every several minutes, tens of thousands of documents were updated due to such changes only (in addition to the regular content updates).
We were asked to support requests in the following form: "update all docs that match <criteria> --> do <operation>" where: * <criteria> was "a single doc", "docs belonging to a category" and "docs belonging to a set of categories". * <operation> was "add categories NEW" (NEW might not even exist in the index yet, or already associated w/ the document), "remove categories OLD" (irregardless if the docs were already associated w/ OLD or not) and "remove all OLD and add all NEW".
The solution I've implemented to support these requests turned out to actually allow you to update every term (!) in the index: suppose that you have a table, which recorded tuples like <docid, term, +/->. The record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm", and the record <1, "hp", '-'> means that the same document is not associated w/ the word "hp". Then, you could very easily ask for all documents that are assoicated w/ "hp", and the result would not include doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the document received.
I've kept two types of postings in the index: regular and updates. Taking the above examples, "ibm" regular posting looked like <"ibm", 1, 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2, -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for "hp".
During search time, when a query with the word "ibm" was submitted, I create a virtual posting which reads from both the regular and the updates, and merges them on the fly according to the +/- signs. Since both postings are sorted in ascending order, the merge is very efficient, and query time is hardly affected.
Those postings are merged from time to time in a process that is similar to how Lucene works today, which keeps the update postings relatively small and manageable.
Now here comes the fun part - how I think it can be implemented in Lucene !
To be honest, this sat on my TODO list for a very long time and only a couple of months ago I figured out how to implement it in Lucene. The main difficulty I had was around the difference between the write-once policy in Juru and Lucene - in Lucene, once a segment is written, it cannot be changed. BUT, I've only recently realized that this isn't exactly true, because deleted docs do change existing segments. The deletes are kept in a separate file to the segment (.del) and have their own generation. Deletes, as I understood then, and Grant helped me term them better, can be defined as "Stacked Segments" - they add data to a segment, which from time to time are integrated into the segment (unlike Photoshop Layers, but my understanding of Photoshop is limited). And the Lucene engine knows how to combine the two, giving precedence to the deletes.
By introducing an "Updates Stacked Segment", we can encode postings w/ the '+'/'-' signs, and when TermDocs/Positions is requested, we can create a variation which merges the two lists. When segments are merged, the updates will be merged into the regular postings (just like deletes) and thus will be gone. In addition, this plays very nicely with readers that are currently reading the index, as well as we can have generations for the updates - really like deletes !
I think that Lucene's architecture allows for such a solution very cleanly and nicely (and I believe flex makes it even easier). We can (later, after you've digested the idea) discuss whether this should be built into the current IW, or an extension like UpdateableIW. The API I've been thinking about should really be like deletes, allowing to update docs based on Term or Query. I defer the API discussion for later for now.
As for stored fields, this was a real challenge to support in Juru, but I think that w/ "Stacked Segments" and Lucene's architecture, this should be much easier - adding stacked stored fields ...
As you've noticed, the update postings are not DGap encoded, and sign needs to be preserved. While I haven't implemented it in Juru, I think that perhaps this can be improved by keeping the '-' and '+' lists separated. We will need to register somewhere which came before which because order matters a lot here (and I'm not talking about concurrency - simple update instructions order). I have some idea how this can be achieved, but I refrain from describing it now, to not make this email even longer :).
I've mentioned that this approach can be applied to any term and not just categories under some circumstances. Basically, as soon as you update a term, its DF is no longer true, unless you are able to take the updates into account. We can defer the discussion on that, but clearly for many fields, incrementally update them should not affect precision, as they're not used for that type of scoring ... Maybe, by keeping separate '+' and '-' lists we can compute statistics precisely. And I haven't given much thought yet to how this and Mike's flex scoring will be integrated.
BTW, a word on Parallel Indexing - the two are completely orthogonal. Once PI is introduced, one can index all the updateable fields in a dedicated slice, for perhaps improving search performance for slices that are not updateable (not involving code which attempts to read and merge update and regular lists on the fly). Also, incremental field updates support all of PI's scenarios, even though some will be done more efficiently w/ PI. But this too is a matter for a separate discussion :).
That's it ! I believe I've given you all the details I have about the approach and high level proposed solution for Lucene. Perhaps some details slipped my mind, but if you ask the right questions, I'm sure I'll be able to answer them :). I would like to emphasize that since this was already implemented (in Juru) - this is more than just a "I think this approach can work" proposal ...
I would appreciate your comments on this. I would like to start implementing it soon, and so as a first step, please share your comments on the overall approach. I'll then write a more detailed description on how I think to impl it in Lucene (been spending some time on that), and we can have more detailed (and fun) discussions on the low level details.
On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang < farh...@gmail.com> wrote:
Good point. I meant the model at the document level: i.e. what milestones does a document go through in its life cycle. Today:
created --> deleted
With incremental updates:
created --> update1 --> update2 --> deleted
I think what I'm trying to say is that this second threaded sequence of state changes seems intuitively more fragile under concurrent scenarios. So for example, in a lock-free design, the system would also have to anticipate the following sequence of events:
created --> update1 --> deleted --> update2
and consider update2 a null op. I'm imagining there are other cases that I can't think of..
On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless <luc...@mikemccandless.com> wrote:
write once, plus the option to the app to keep multiple commit points around (by customizing the deletion policy).
Actually order of operations / commits very much matters in Lucene today.
Deletions are not idempotent: if you add a doc w/ term X, delete by term X, add a new doc with term X... that's very different than if you moved the delete op to the end. Ie the deletion only applies to the docs added before it.
On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <farh...@gmail.com> wrote:
Sure. Because of the write once principle. But at some cost (duplicated data). I was just agreeing that it would not be a good idea to bake in version-ing by keeping the layers around forever in a merged index; I wasn't keying in on transactions per se.
Speaking of transactions: I'm not sure if we should worry about this much yet, but with "updates" the order of the transaction commits seems important. I think commit order is less important today in Lucene because its model supports only 2 types of events: document creation--which only happens once, and document deletion, which is idempotent. What do you think? Will commits have to be ordered if we introduce updates? Or does the onus of maintaining order fall on the application?
On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless <luc...@mikemccandless.com> wrote:
On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <farh...@gmail.com> wrote:
I think they get merged in by the merger, ideally in the background.
That sounds sensible. (In other words, we wont concern ourselves with roll backs--something possible while a "layer" is still around.)
Actually roll backs would still be very possible even if layers are merged.
Ie, one could keep multiple commits around, and the older commits would still be referring to the old postings + layers, keeping them alive.
Lucene would still be transactional with such an approach.