atom feed6 messages in org.apache.helix.userRe: HDFS read load distribution using...
FromSent OnAttachments
Shekhar BansalJun 19, 2017 7:24 am 
kishore gJun 19, 2017 7:45 am 
Shekhar BansalJun 19, 2017 9:14 am 
kishore gJun 19, 2017 11:11 am 
Shirshanka DasJun 19, 2017 12:29 pm 
Tejeswar DasJun 27, 2017 11:35 am 
Subject:Re: HDFS read load distribution using helix
From:Shekhar Bansal (
Date:Jun 19, 2017 9:14:12 am

Thanks a lot Kishor.I think I can treat HDFS directory as resource and mode of
filename's hash as tasks, is there any better way of doing it in Helix? ThanksShekhar

On Monday, June 19, 2017 8:15 PM, kishore g <> wrote:

- Currently, Helix ensures even distribution of partitions within a resource,
not across resources. Is it possible for you to add tasks as part of the same
resource? -  &3 Yes, you can start the controller as part of your process. But since
you said you launch this on Kubernetes every 5 minutes, I suggest keeping
controller and zookeeper running all the time. Controllers are light weight and
you can get away with a very an entry level container spec. It's ok to launch
Helix Participants every 5 minutes. You should consider using Helix Task Framework. It provides all the
functionalities you need.

On Mon, Jun 19, 2017 at 7:24 AM, Shekhar Bansal <> wrote:

I have a standalone java app(containerised), it reads data from HDFS, does some
transformations and write data to remote storage. I want to make it scalable by
launching multiple instances of this java app. My problem is how to assign tasks
among these instances. can helix solve this problem? If yes, can you please help me with following  - I referred helix quickstart example and created 1 resource per file but
node1 got assigned master for all resources, is it because of simple
StateModelDefinition used in quickstart example or I am using it wrong way or is
it some limitation of helix

- I want to avoid running a separate controller process, so If I run start
controller as part of setup will helix be able to elect master controller (in
standalone mode), is it advisable to run tens of controllers in distributed

- I schedule my app every five minutes using kubernetes cron, is it advisable
to use helix for such short lived processes