| From | Sent On | Attachments |
|---|---|---|
| Jeff Hodges | Jul 24, 2009 1:23 am | |
| Jun Rao | Jul 24, 2009 9:07 am | |
| Jonathan Ellis | Jul 24, 2009 9:59 am | |
| Jeff Hodges | Jul 28, 2009 11:37 pm | |
| Jeff Hodges | Jul 28, 2009 11:47 pm | |
| Jonathan Ellis | Jul 30, 2009 2:09 pm | |
| Jun Rao | Jul 31, 2009 8:51 am | |
| Jonathan Ellis | Jul 31, 2009 8:57 am | |
| Jeff Hodges | Aug 17, 2009 2:24 am |
| Subject: | Re: hadoop tasks reading from cassandra | |
|---|---|---|
| From: | Jonathan Ellis (jbel...@gmail.com) | |
| Date: | Jul 24, 2009 9:59:57 am | |
| List: | org.apache.incubator.cassandra-dev | |
On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<jun...@almaden.ibm.com> wrote:
1. In addition to OrderPreservingPartitioner, it would be useful to support MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype that sort-of works at this moment. The difficulty with random partitioner is that it's a bit hard to generate the splits. In our prototype, we simply map each row to a split. This is ok for fat rows (e.g., a row includes all info for a user), but may be too fine-grained for other cases. Another possibility is to generate a split that corresponds to a set of rows in a hash-range (instead of key range). This requires some new apis in cassandra.
-1 on adding new apis to pound a square peg into a round hole.
like range queries, hadoop splits only really make sense on OPP.
2. For better performance, in the future, it would be useful to expose and exploit data locality in cassandra so that a map task is executed on a cassandra node that owns the data locally. A related issue is https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks encapsulation, but it's worth thinking about. Google's DFS and Bigtable both expose certain locality info for better performance.
That's why I'd like to ship hadoop integration out of the box, instead of adding apis that should really be internal-use only for an external hadoop layer.
-Jonathan





