atom feed5 messages in org.apache.hadoop.zookeeper-userRe: Watchers & error handling
FromSent OnAttachments
Alexis MidonJun 12, 2010 10:07 pm 
Alexis MidonJun 24, 2010 1:25 pm 
Patrick HuntJun 25, 2010 12:36 pm 
Alexis MidonJun 25, 2010 2:47 pm 
Patrick HuntJun 25, 2010 3:33 pm 
Subject:Re: Watchers & error handling
From:Patrick Hunt (
Date:Jun 25, 2010 3:33:11 pm

On 06/25/2010 02:47 PM, Alexis Midon wrote:

1. Session events i.e. Type-None events are sent to all outstanding watch handlers. So if you do get(path, watcherX), both the default listener and watcherX will receive the session events.

That's true. This enables the watcher to handle the case (for example) when the client has become disconnected from the cluster. Per operation watchers was specifically added to support the "zk library" case - where more than a single consumer would be using the client connection. Makes it alot easier to add libraries dependent on zk.

2. Watchers are one-time triggers, however session events do NOT remove a watcher. In other words, if we're listening for NodeCreated event and a disconnection occurs, we will eventually get notify of a Disconnected, then a SyncConnected and finally a NodeCreated without having to set any new watcher.


3. If the invocation of a (synchronous or asynchronous) method fails, the watcher is not set. For instance if getChildren("/foo", mywatcher) fails because the client is disconnected, mywatcher won't be notified of futur events.

Correct, a watch is only valid if the operation was successful.

I apologize in advance if I'm stating the obvious but the differences between "path" events and "session" events were not clear to me.

No, this is great. Feel free to enter a JIRA if this is not clear enough.

This (3.1.1) is a pretty old version of the docs, I'd suggest that you look at the most recent before entering JIRAs:



On Fri, Jun 25, 2010 at 12:36 PM, Patrick Hunt < <>> wrote:

On 06/12/2010 10:07 PM, Alexis Midon wrote:

I implemented queues and locks on top of ZooKeeper, and I'm pretty happy so far. Thanks for the nice work. Tests look good. So good that we can focus on exception/error handling and I got a couple of questions.

#1. Regarding the use of the default watcher. A ZooKeeper instance has a default watcher, most operations can also specify a watcher. When both are set, does the operation watcher override the default watcher?

if you use the get(path, bool) then the default watcher is notified, if you use get(path, watcherX) then only "watcherX" is notified.

or will both watchers be invoked? if so in which order? Does each watcher receive all the types of event?

no, both watchers are not invoked.

I had a look at the code, and my understanding is that the default watcher will always receive the type-NONE events, even if an "operation" watcher is set. No guarantee on the order of invocation though. Could you confirm and/or complete please?

The watcher gets both state change notifications and watch events. You can register multiple watchers for the same path (incl the default), there is no guarantee on ordering at all.

#2 After a connection loss, the client will eventually reconnect to the ZK cluster so I guess I can keep using the same client instance. But are there


cases where it is necessary to re-instantiate a ZooKeeper client? As a first recovery-strategy, is that ok to always recreate a client so that any ephemeral node previously owned disappear?

if the session is expired that's the case you need to recreate the session object (or if you explicitly close).

Yes, this is a fine strategy if your application domain "fits". If you have a very expensive "recovery" or "bootstrap" process then recreating the session on every disconnect would be a bad idea.

The case I struggle with is the following: Let's say I've acquired a lock (i.e. an ephemeral locknode is created). Some application logic failed due to a connection loss. At this stage I'd like to give up/roll back. Here I would typically throw an exception, the lock being released in a finally. But I can't release the lock since the connection is down. Later the client eventually reconnects, the session didn't expire so the locknode still exists. Now no one else can acquire this lock until my session expires.

Yes, you are reading the situation correctly. In this case you either have to take the easy route - close the session and create a new one (again, if your app domain supports this) or your client needs to check if the lock is still being held (it's still the owner) when it's eventually reconnected. You can verify this for an ephemeral node by looking at the "ephemeralOwner" field of the Stat object. If this matches your session id then you are the owner and still hold the lock. This is a bit tricky to get right though, so in some cases clients just close the session and recreate.

#3. could you describe the recommended actions for each exception code?

this is highly dependent on your application requirements. See above for my general information. ff to ask more questions.