atom feed11 messages in org.apache.incubator.hama-devRecovery Issues
FromSent OnAttachments
Thomas JungblutMar 10, 2012 1:11 am 
Suraj MenonMar 12, 2012 1:00 am 
Thomas JungblutMar 12, 2012 1:33 am 
Edward J. YoonMar 14, 2012 12:29 am 
Chia-Hung LinMar 14, 2012 5:59 am 
Suraj MenonMar 14, 2012 11:20 am 
Thomas JungblutMar 14, 2012 11:58 am 
Suraj MenonMar 14, 2012 12:05 pm 
Edward J. YoonMar 14, 2012 2:19 pm 
Suraj MenonMar 16, 2012 5:32 am 
Thomas JungblutMar 16, 2012 4:17 pm 
Subject:Recovery Issues
From:Thomas Jungblut (thom@googlemail.com)
Date:Mar 10, 2012 1:11:15 am
List:org.apache.incubator.hama-dev

I guess we have to slice some issues needed for checkpoint recovery.

In my opinion we have two types of recovery: - single task recovery - global recovery of all tasks

And I guess we can simply make a rule: If a task fails inside our barrier sync method (since we have a double barrier, after enterBarrier() and before leaveBarrier()), we have to do a global recovery. Else we can just do a single task rollback.

For those asking why we can't do just always a global rollback: it is too costly and we really do not need it in any case. But we need it in the case where a task fails inside the barrier (between enter and leave) just because a single rollbacked task can't trip the enterBarrier-Barrier.

Anything I have forgotten?