I got similar results on linux/x86. However, on linux/arm the default version was faster, and I believe it uses CVM_FASTLOCK_MICROLOCKS.
If you have it allocate a new object on each iteration, you'll find that the time difference becomes much smaller. So it seems that the effort to inflate an object monitor is fairly big, but once inflated, the inflated object monitor is faster than an uninflated re-entrant fastlock record.
Basically whether you have fastlocks or use system mutexes, the locking code needs to do something similar (basically bookkeeping on some sort of lock record). I think if you have a system where you always use a heavyweight inflated monitor, then you'll can possibly do better when dealing with contended or re-entered locks. However, uncontended locks, or ones that are not re-entered will perform much slower this way.
Another way of looking at this is if you always assume the worse, and the worse never happens, then you'll pay a performance price for having assumed the worse. However, if the worse always happens, then you'll see gains. With CVM_FASTLOCK_NONE, you are assuming the worse locking cases, and your test case is an example of one.
[Message sent by forum member 'cjplummer' (cjplummer)]