On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<jun...@almaden.ibm.com> wrote:
1. In addition to OrderPreservingPartitioner, it would be useful to support
MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
that sort-of works at this moment. The difficulty with random partitioner
is that it's a bit hard to generate the splits. In our prototype, we simply
map each row to a split. This is ok for fat rows (e.g., a row includes all
info for a user), but may be too fine-grained for other cases. Another
possibility is to generate a split that corresponds to a set of rows in a
hash-range (instead of key range). This requires some new apis in
-1 on adding new apis to pound a square peg into a round hole.
like range queries, hadoop splits only really make sense on OPP.
2. For better performance, in the future, it would be useful to expose and
exploit data locality in cassandra so that a map task is executed on a
cassandra node that owns the data locally. A related issue is
https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks
encapsulation, but it's worth thinking about. Google's DFS and Bigtable
both expose certain locality info for better performance.
That's why I'd like to ship hadoop integration out of the box, instead
of adding apis that should really be internal-use only for an external