| From | Sent On | Attachments |
|---|---|---|
| Thomas Jungblut | Mar 10, 2012 1:11 am | |
| Suraj Menon | Mar 12, 2012 1:00 am | |
| Thomas Jungblut | Mar 12, 2012 1:33 am | |
| Edward J. Yoon | Mar 14, 2012 12:29 am | |
| Chia-Hung Lin | Mar 14, 2012 5:59 am | |
| Suraj Menon | Mar 14, 2012 11:20 am | |
| Thomas Jungblut | Mar 14, 2012 11:58 am | |
| Suraj Menon | Mar 14, 2012 12:05 pm | |
| Edward J. Yoon | Mar 14, 2012 2:19 pm | |
| Suraj Menon | Mar 16, 2012 5:32 am | |
| Thomas Jungblut | Mar 16, 2012 4:17 pm |
| Subject: | Recovery Issues | |
|---|---|---|
| From: | Thomas Jungblut (thom...@googlemail.com) | |
| Date: | Mar 10, 2012 1:11:15 am | |
| List: | org.apache.incubator.hama-dev | |
I guess we have to slice some issues needed for checkpoint recovery.
In my opinion we have two types of recovery: - single task recovery - global recovery of all tasks
And I guess we can simply make a rule: If a task fails inside our barrier sync method (since we have a double barrier, after enterBarrier() and before leaveBarrier()), we have to do a global recovery. Else we can just do a single task rollback.
For those asking why we can't do just always a global rollback: it is too costly and we really do not need it in any case. But we need it in the case where a task fails inside the barrier (between enter and leave) just because a single rollbacked task can't trip the enterBarrier-Barrier.
Anything I have forgotten?
-- Thomas Jungblut Berlin <thom...@gmail.com>





