atom feed9 messages in org.apache.incubator.cassandra-devRe: hadoop tasks reading from cassandra
FromSent OnAttachments
Jeff HodgesJul 24, 2009 1:23 am 
Jun RaoJul 24, 2009 9:07 am 
Jonathan EllisJul 24, 2009 9:59 am 
Jeff HodgesJul 28, 2009 11:37 pm 
Jeff HodgesJul 28, 2009 11:47 pm 
Jonathan EllisJul 30, 2009 2:09 pm 
Jun RaoJul 31, 2009 8:51 am 
Jonathan EllisJul 31, 2009 8:57 am 
Jeff HodgesAug 17, 2009 2:24 am 
Subject:Re: hadoop tasks reading from cassandra
From:Jonathan Ellis (jbel@gmail.com)
Date:Jul 24, 2009 9:59:57 am
List:org.apache.incubator.cassandra-dev

On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<jun@almaden.ibm.com> wrote:

1. In addition to OrderPreservingPartitioner, it would be useful to support MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype that sort-of works at this moment. The difficulty with random partitioner is that it's a bit hard to generate the splits. In our prototype, we simply map each row to a split. This is ok for fat rows (e.g., a row includes all info for a user), but may be too fine-grained for other cases. Another possibility is to generate a split that corresponds to a set of rows in a hash-range (instead of key range). This requires some new apis in cassandra.

-1 on adding new apis to pound a square peg into a round hole.

like range queries, hadoop splits only really make sense on OPP.

2. For better performance, in the future, it would be useful to expose and exploit data locality in cassandra so that a map task is executed on a cassandra node that owns the data locally. A related issue is https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks encapsulation, but it's worth thinking about. Google's DFS and Bigtable both expose certain locality info for better performance.

That's why I'd like to ship hadoop integration out of the box, instead of adding apis that should really be internal-use only for an external hadoop layer.

-Jonathan