| From | Sent On | Attachments |
|---|---|---|
| Gary William Flake | May 13, 2011 9:08 am |
| Subject: | forked riaksearch for better perf | |
|---|---|---|
| From: | Gary William Flake (ga...@flake.org) | |
| Date: | May 13, 2011 9:08:33 am | |
| List: | com.basho.lists.riak-users | |
Our team has a fork of the riaksearch source that we believe adds a very simple
but highly desirable improvement. We've reached out to Basho directly with the
idea but, as they say, code talks. The fork can be found at
https://github.com/gpascale/riak_search and we will be issuing a pull request
today for it. Moreover, we have a single line fork of riak-js as well located
at https://github.com/gpascale/riak-js.
In our use case, we need to access documents from a search index in time based
order (not relevance). Moreover, we need to paginate over these results. As
some of you know, bug 867 - https://issues.basho.com/show_bug.cgi?id=867 -
prevents sort, start, and rows from being correctly applied. The current
blessed work around is to write a M/R job to handle it. We found, however, that
the degradation in performance for having to do a M/R for each query was
problematic.
For very typical scenarios where you need to jump to a slice of results that are
ordered by something besides relevance, we are seeing a 10-100x performance gain
per query within our fork. We believe the technique we've implemented can be
used in several other examples. For example:
1. You need a bucket to act like an enormous container or queue. With our fork
you can quickly ask for the most (or least) recent N items, process them, then
remove them.
2. Your data is hierarchical (like employees in a company) and you need search
results to be ordered by seniority within the hierarchy.
3. You are building a twitter clone and you want search of tweets to show the
most recent.
4. Your data has a "natural" order such as price or amount that is always
preferred to TFIDF based relevance.
If you have one of these scenarios, I would encourage you to read further. In
the rest of this note I'll explain (A) why bug 867 is a show stopper for you,
(B) a representation hack that you can use to speed up your M/R calls if you
want to work around this with the current builds of riaksearch, and (C) show you
how to use our fork to see the most significant performance gains.
(A). Bug 867
Suppose you have a query that will normally return 300 results. You desire the
results to be ordered by a key within your results. Moreover, you want the
middle 100 results after the sort has been applied. The semantics of the SOLR
API for riaksearch are such that start=100&rows=100&sort=KEY is supposed to do
the job.
Bug 867 prevents this from working because riaksearch performs the slice and
sort in the wrong order, yielding you 100 results, but the 100 that are in the
middle of the results if they are sorted by relevance and THEN reordered by KEY
afterwards.
Fixing this bug is non-trivial because the most straightforward implementation
requires one to retrieve all of the documents in order to perform the sort by
KEY, whereas the relevance sort only needs the information in the the index.
The work around for this issue is to do something like the following (shown in
riak-js):
db.addSearch(bucketName, query) .map('Riak.mapValuesJson') .reduce('Riak.reduceSort', 'function(a,b){return a.key-b.key;}') .reduce('Riak.reduceSlice', [start, start + count]) .run(callback);
Here, you are seeding a M/R phase with the search results, pulling out the
records for all results, doing the sort as a reduce with your own comparison
function, then doing the slice.
For a query that returns a few thousand results, you can expect such a M/R job
to take a couple of seconds on typical hardware, which is hardly acceptable for
interactive applications.
(B). A representation hack
The M/R phase above suffers in that it has to retrieve the documents for all
valid results, not just the ones that are part of the final sliced result set.
We can improve upon this by putting our sort key as a prefix to the document ID.
Care must be taken to properly pad the keys so that they are of fixed length.
Moreover, you'll want your synthetic document keys to have something unique as
well. We set our document iDs to be the time stamp of object creation (properly
padded) plus a GUID. Thus, each document is unique, but sorts over keys do what
we want. With this change, we can rewrite the above M/R to:
db.addSearch(bucketName, query) .map('function (v) { return [v.key]; }') .reduce('Riak.reduceSort', 'function(a,b){return (a)-(b)}') .reduce('Riak.reduceSlice', [start, start + count]) .run(callback);
The difference is that our new map() does not pull out all of the documents.
Instead, this M/R completes with with a list of documents that are in the final
result (which we may still need to retrieve for our application).
We typically see 3-10x performance gains over the previous method, but this is
highly dependent on typical result sizes as well as typical document sizes.
(C). Our fork of Riak search.
The change we've introduced today takes the idea behind these synthetic keys,
but puts all of the work back into the search core, thereby eliminating the M/R.
Where the search core performs a sort by relevance (prior to doing the slice
specified by the SOLR arguments), we replace this with a sort by key. That's
the entire essence of the change.
By avoiding the M/R in it's entirety, we now often see performance gains of
3-10x over the previous method and, therefore, 10-100x gains over the most
simplistic approach described first.
We've also introduced an option called "presort" which can be be set to "key" or
"score". The former triggers our new behavior. The latter preserves the
original behavior. Leaving the option unspecified is the same as setting it to
"score", so our fork should not introduce any breaking changes.
Anyhow, these have turned out to be essential changes for us, which is why we
wanted to make them available to others.
Best, -- GWF
Founder, Clipboard, Inc.
_______________________________________________ riak-users mailing list riak...@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





