Saturday, December 5, 2015

Everybody has a plan until they get punched in the face

“A failure” can be a result of a software or hardware problem. A fault tolerance design intends to enable the system to continue operating properly in the event of failure. In this post, I’ll talk about a way to treat a special cause of failure – a timeout using a really cool open source framework named Polly.
We prepare for failures, we put “try catch” blocks and check “if(something == null)” every now and then. We think we’ve got it covered and that we are fully protected.
As Mike Tyson said – that’s our plan. However, we all know that something “unpredictable” will always go wrong, and then we’ll be punched in the face.
clip_image002
Let’s say that you have a DB and you query it to get some data – user’s navigation history for example.
clip_image004
One of your users clicked to get his navigation history and waited for almost a minute (yes – that’s a long time) to get an error that crashes the application!
clip_image006
(BTW This not a great error page – read here about the importance of error pages that make your user smile.)
You guessed right – that was a timeout to the DB. Why did it happen? Who cares? It made your user wait and eventually it crashed the application. This is unacceptable!
How can we handle it?
There are two problems in this scenario: the first is that the application crashed and the second is that the user waited a minute for a response.
Let’s start from the crash. The navigation history is a cool and useful feature but it’s not the core of your system. The core feature is the navigation itself. You have to identify the core features of your application and make them work, even if some other features fail.
In this example, the application should have displayed an error message saying something like “History is currently unavailable – try to remember where you went” and allow the user to use its core features any way.
Now let’s deal with the fact that the user had to wait for a whole minute before he saw the error. That’s a horrible user experience. A timeout might happen due to many reasons in any sort of client server communication (no more available connections, waiting for a really long operation to finish or simply a stuck component).
The bottom line is that someday it will definitely happen to you! Furthermore it will cause all your users to wait until they receive the error. A timeout may accrue just once due to a race condition that only a single user will get. However, sometimes the timeout will remain until you manually fix the problem - which can take a while.
You can lower the timeout limit to a second or even 100ms - and indeed no user will wait longer than 100ms - but all the users will wait for 100ms and eventually fail, while choking your servers.
Your goal is to prevent users from waiting at all when you know they will definitely receive a timeout. In other words, if a system component has a timeout,  you don’t want the application to try to communicate with it. You want it to fail fast.


Fail Fast

Fail fast is a concept in fault tolerance systems that is designed to stop flows and normal operations when a possible failure might accrue. Such design will add check points before operations execution, which will check if the operation is “healthy”. Using “health indicators”, the system will know if a certain operation will most likely fail. If an operation isn’t healthy, then there’s no need to execute it and we’d rather call a fallback behavior.
image
This way, we may give the users a better experience then waiting for an error.
Let’s examine our case: the application shouldn’t try to execute the query unless the connection to the DB is “healthy”. So the system will check the “health indicator” before the execution, and if it returns false then it should immediately turn to the fallback and display the error page without even trying to connect to the DB.
You can take it even further – if such health indicators exist, then why can’t we check them on the application startup? If there’s a problem with the feature, then let’s disable it and even not show it to the user. Perhaps it’s better to make the feature “disappear” then exposing the user to a poor experience. That of course is a pure business decision (it might be a really poor experience if the messages tab in Facebook suddenly “disappears”).
Allowing your system automatically to disable features must be done extremely carefully. You don’t want to disable a feature for all of your users if for some reason a timeout has accrued just once. If it happened 10 times in a row, then most likely there is a real problem and only then should the feature be disabled.
 

Polly

Polly is a really cool and useful open source that helps you create a fault tolerant system using policies for handling exceptions and creating fallback solutions. Let’s see how we can use Polly to handle our timeout scenario and allow your application to fail fast.

NavigationHistoryController has a dependency to the database where the navigation history is stored.
It has a single public method, which provides the history by calling the INavigationHistoryProvider.
We have a try-catch surrounding the provider call. As we know this catch will eventually handle the timeout exception, but it won’t fail fast.

We have created a circuit breaker policy on the timeout exception (error code -2) setting the count to 10 and the policy duration for 10 minutes. In other words, our policy will be triggered if a SqlException will accrue 10 times and in the next 10 minutes every call to the provider will not be executed. BrokenCircuitException will be triggered instead, allowing us to quickly fallback and return an error without waiting for the timeout.

 

Let’s test it

We’ll mock the NavigationHistoryProvider and raise an exception for every single call to it “Get” method.
We’ll call it 10 times and then another time to check if the policy was activated.

Works like a charm. The code is available on GitHub.
 

Conclusion

The applications that we build will have failures. We must recognize that things won’t always fail nicely. We must be prepared for all the possible scenarios and edge cases. The sooner we start thinking about them in our development process – the better.
I encourage you to think about it from the beginning. Start using FDD – Fail-Driven Development.


Best of luck.

5 comments:

  1. Your best one yet!
    Loved the concept and the open source library, hope I'll have the chance of using it soon.

    ReplyDelete
  2. מנוסח היטב.
    גרמת לי להכנס גם לבלוג של הודעות שגיאה.
    מקנא שיש לך את הסבלנות לכתוב כאלו מאמרים.
    תודה

    ReplyDelete
  3. מנוסח היטב.
    גרמת לי להכנס גם לבלוג של הודעות שגיאה.
    מקנא שיש לך את הסבלנות לכתוב כאלו מאמרים.
    תודה

    ReplyDelete
  4. updated in 2016 or future versions in 2017 ?

    ReplyDelete