Implementing Windows Azure Retry Logic

Windows Azure will automatically repair itself. Can your service? In this post I’m going to show you a simple way to make your service a little more resilient by adding retry logic.

Transient Datacenter conditions

When you have to deal with external services of any type there are times when the service might not respond. This could be due to any number of reasons from network connectivity & service throttling, to hardware failure. Windows Azure is designed to withstand these failures, not by avoiding them, but by taking corrective action when they do occur. Windows Azure auto heals itself. These conditions are sometimes referred to as transient conditions because they are not typically long lasting.

As an example, SQL Azure can give you connection errors, and throttling errors, Windows Azure Storage can give you timeout and throttling errors and Service Bus has ServerBusy and MessagingCommunication Exceptions.

Any other external dependency will also likely have similar conditions. Without defensive coding for these transient conditions, your app will suffer unnecessary outages. Fortunately the problem can be easily resolved.

Retry Logic

Handling these conditions is usually as easy as repeating the operation after a short delay.

The Windows Azure Storage Client Library that ships with the SDK already has retry behavior that you need to switch on. You can set this on any storage client by setting the RetryPolicy Property.

SQL Azure doesn’t provide a default retry mechanism out of the box, since it uses the SQL Server client libraries. Service Bus also doesn’t provide a retry mechanism.

Transient Fault Handling Framework

To provide an easy way to handle this, the Windows Azure Customer Advisory Team have developed a Transient Fault Handling Framework – The framework provides a number of ways to handle specific SQL Azure, Storage, Service Bus and Cache conditions.

The most interesting aspect to me however is the ExecuteAction and ExecuteAction<T> methods. These methods allow you to basically wrap any user code in a retry block. Example:

var policy = new RetryPolicy<SqlAzureTransientErrorDetectionStrategy>(MaxRetries,
policy.ExecuteAction(() => object.DoSomething());

Retry Pattern

What is great about these methods are they enable you to use the decorator pattern to add retry logic to your service. This of course assumes you built your service with extensibility in mind.

In my example I have a UriRepository which is defined by the IUriRepository interface. I have a SQLAzureUriRepository that implements the interface. This class however contains no retry logic. Instead I implemented a RetryUriRepository that also implements IUriRepository. RetryUriRepository allows you to specify via constructor injection, which UriRepositiory to retry.

Here is a snippet of the RetryUriRepository:

public class RetryUriRepository : IUriRepository
    private readonly IUriRepository _uriRepository;
    private const int MaxRetries = 10;
    private const double DelayMs = 2000;

    public RetryUriRepository(IUriRepository uriRepository)
    {   _uriRepository = uriRepository;    }

    public void InsertUri(string shortUri, string longUri, string ipAddress)
        var policy = GetRetryPolicy();
        policy.ExecuteAction(() => _uriRepository.InsertUri(shortUri, longUri, ipAddress));
    private static RetryPolicy GetRetryPolicy()
        return new RetryPolicy<SqlAzureTransientErrorDetectionStrategy>(MaxRetries,         
                     TimeSpan.FromMilliseconds(DelayMs)); } }

Using the supplied framework might be overkill, but it should give you an idea on how to implement retry logic in your service.

Too Much Retry

One thing that becomes interesting is when the number of retries increases. This typically indicates either a longer error condition, or you are overloading the services you are consuming. The most likely, and only one we can do anything about, is the later. The more throttling that goes on, the more retires. The more retires the less throughput. The less throughput the slower the response time. Poor response time = disgruntled users (and executives).

Don’t be tempted to turn off the retry logic when this happens. This will just make the problem much worse. About the only solution when dealing with overloading a service is to either scale that service out, or attempt to delay the processing using a queue/worker pattern.


Implementing retry logic is critical if you want your service to keep running. Monitoring the frequency of these retries can be a good indicator you are starting to experience scale issues. Don’t turn your retry logic off to handle scale issues.