I guess we have to slice some issues needed for checkpoint recovery.
In my opinion we have two types of recovery:
- single task recovery
- global recovery of all tasks
And I guess we can simply make a rule:
If a task fails inside our barrier sync method (since we have a double
barrier, after enterBarrier() and before leaveBarrier()), we have to do a
Else we can just do a single task rollback.
For those asking why we can't do just always a global rollback: it is too
costly and we really do not need it in any case.
But we need it in the case where a task fails inside the barrier (between
enter and leave) just because a single rollbacked task can't trip the