Tuesday, August 18, 2009

Hunting bugs

The last several weeks was a bit nervous. Our customers reported several similar bugs and we had absolutely no idea of where they come from. Fortunately now the cause of bugs seems to be settled (though yet to be solved) thanks to extensive logging (the thing I talked about in the first blog post).

I will talk here using some domain specific terms like Driver entity & getStatus() operation but I think it should not make understanding harder (if you have different opinion, write a comment and I'll try to explain if something is unclear)

The bug pattern was (as we discovered it):
  • Driver performed closeOrder() operation which should have changed his status + status of some order. Driver must have been set FREE status, Order must have been set ARCHIVED status.
  • Concurrently driver's mobile client executed background request getStatus() (which is executed once in half a minute automatically). This should have set only LAST_POLLING flag on Driver entity
This resulted in a very strange combination. Driver remained BUSY while Order was successfully set to ARCHIVED status which means that system was put into inconsistent state. As we first thought getStatus() command must not have any relation to the fact that driver somehow remains in (or is put back to) BUSY status (because it only changes 1 unrelated attribute LAST_POLLING). But we were wrong because of 2 things:

The first one is more general one: transaction isolation level (READ COMMITTED) was insufficient for transactions performed by operations mentioned. This made possible race condition between 2 threads executing different transactions. As a result thread executing getStatus() was able to
  • See driver in BUSY status (because it managed to read driver's state before closeOrder() operation
  • Write driver's status (because it managed to commit after! closeOrder() committed
But now one can ask: how does operation which only should have written LAST_POLLING managed to overwrite driver's status? This is because of the second thing which is related to the default Hibernate settings:
By default Hibernate for any given entity executes updates involving not only fields changed during Hibernate session but all entity fields! This allows it to prepare update operations on application startup. But this also makes application more fragile to race conditions. So I want to warn everyone about this property of the default Hibernate's strategy. It can be changed by setting dynamicUpdates entity attribute to true (either in annotations or in xml descriptors)

Setting dynamicUpdates to true only lowers the chance of race condition between different threads but not eliminates it. So now we are working hard on making possible Hibernate optimistic locking for Driver entity. The main difficulty here is to make some common operation be able to recover from OptimisticLockException (recovery means be able to try the same sequence of actions again)

No comments:

Post a Comment