Spring WebFlux: Simple Retry Strategies With WebClient

It's often the case with REST requests that you don't get a good response on the first try. Fortunately, you can implement a retry strategy with the Spring WebFlux WebClient API.

So how do you do it? There are are multiple ways.

And in this guide, I'll show you the easiest ways to make it happen.

Alternatively, you can go straight to the source code on GitHub.

Or just hang out here and read the explanations.

Full disclosure here: I've update the contact service to Spring Boot 2.3.8 since the last guide. That means I've had to add in some dependencies in the POM file as the folks at Spring can't leave well enough alone. Be sure to check it out if you're experiencing problems.

The Business Requirements

Your boss Smithers walks into your office with a stern look on his face.

"Once is not enough!" he says.

"Sometimes, when you call those downstream services, you need to try again if the service request fails," he continues. "You can't just go with a one-and-done strategy! What if the service comes back up after a few seconds?!?"

He leaves your office, but you hear him repeatedly shouting "More than once!" as he walks down the hallway.

Picking Up Where You Left Off

Fortunately, you've already got the guts of a WebClient call that handles retrieving a response from a downstream service and transforming it into a Plain Old Java Object (POJO).

Now you're going to update that call so that it performs retries.

And, as I mentioned above, you can do it multiple ways.

Let's start with the very easiest. Update the fetchUser() method in UserService as follows:

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retry()
	                .block();
	        
	
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException (we.getMessage(), we.getRawStatusCode());
    	}
    }

See that retry() line in there? That's your retry.

However, that particular way of handling a retry has some disadvantages. One of those disadvantages is that it will retry forever.

Yeah. That's probably not what you want.

So let's look at another option.

Putting a Limit on Retries

Maybe instead of retrying infinitely, you might just want to retry three (3) times. You can that as follows:

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retry(3)
	                .block();
	        
	
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException (we.getMessage(), we.getRawStatusCode());
    	}
    }

Yep. Just put a 3 inside the retry() method.

Your application will now make three (3) attempts at contacting the downstream service before giving up. When it does give up, you'll get the latest error that it returns.

Just keep in mind: there's no delay there between retries. It just keeps retrying as quickly as possible.

That can make matters worse if the system is being overwhelmed. So you might need something more sophisticated.

Also, you'll retry for everything. Even a bad request.

It's safe to say that no matter how many times you retry a bad request, it's still going to fail.

So you need more.

Enter retryWhen()

The code above uses the very simple retry() method to specify a fixed number of retries before giving up. But you can go a little deeper than that with a full-blown retry strategy.

To do that, use retryWhen() instead of retry().

It used to be the case that you could use retryWhen() with a Function. But that's been deprecated.

Now you use it with a Retry object.

Retry is an abstract class that handles retry strategies. The easiest way to work with it is to use one of its static methods that return the child class RetrySpec.

So, for example:

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retryWhen(Retry.max(3))
	                .block();
	        
	
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException (we.getMessage(), we.getRawStatusCode());
    	}
    }

That's going to give you three (3) retries as in the previous example.

BUT there's a big difference here. When you go with retryWhen() instead of the more simplistic retry(), the application will throw an exception when the retries are exhausted.

That exception class name is unsurprisingly RetryExhaustedException.

And that's important because if you put together some cute strategy for handling errors, that strategy probably isn't going to handle that exception.

So you'll get a massive stacktrace in your log.

Also: that strategy suffers from the same limitation as retry(3). You probably need something that puts a delay between requests.

Fix It With a Delay

Let me introduce you to a new method Retry.fixedDelay(). Here it is in action:

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retryWhen(Retry.fixedDelay(3, Duration.ofSeconds(5)))
	                .block();
	        
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException(we.getMessage(), we.getRawStatusCode());
    	} 
    }

Okay so here's what's happening with that Retry.fixedDelay() line.

The first parameter is the number of retries. You've already seen that in action.

The second parameter is fairly intuitive: it sets the duration between requests in seconds. In this case, it's five (5) seconds.

So the application will try three (3) times at most to access the downstream service. And it will pause five (5) seconds between each request.

Okay, you're getting warmer. But still not quite there yet.

Back Off!

You need to back off, dude.

No, seriously, you need to use Retry.backoff().

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retryWhen(Retry.backoff(3, Duration.ofSeconds(5)))
	                .block();
	        
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException(we.getMessage(), we.getRawStatusCode());
    	} 
    }

Literally the only thing that changed from the last code block to this one is the name of the method after Retry. It's now backoff() instead of fixedDelay().

When you opt for a backoff instead of a fixed delay strategy, you're telling the application to wait a little longer each time.

So whereas fixedDelay() used the same delay of five (5) seconds between each request, backoff() uses a minimum delay of five (5) seconds between each request.

In other words, when a request fails, and then it waits five seconds to try again, and then that request fails, the application essentially tells itself: "Hey, maybe I didn't wait long enough between those first two requests. I'll wait a little longer this time."

That's what's happening with backoff(). And for your environment, that solution may very well be good enough.

But what happens if you have dozens or even hundreds of clients all trying to access the downstream application at the same time? They'll all use the same delay algorithm and you're still putting a strain on the system.

Fortunately, there's a solution for that.

Giving You the Jitters

Now it's time to learn about something called jitter.

Essentially, jitter randomizes the time between retries so that if you have several clients all using backoff() they won't all attempt retries at the exact same time.

And you don't have to do anything to make it happen. Retry.backoff() takes care of it for you.

Now you've got a solution with randomized, progressive delays between each attempt.

But you still have a problem. It's retrying for all errors.

As a rule of thumb, you should only retry for 500-level errors. Those are the "service unavailable" or "internal server error" codes.

Fortunately, you can take care of that.

Another Filter

Take a look at your WebClientFilter class. Add a new method.

	public static boolean is5xxException(Throwable ex) {
		boolean eligible = false;
		
		if (ex instanceof ServiceException) {
			ServiceException se = (ServiceException)ex;
			eligible = (se.getStatusCode() > 499 && se.getStatusCode() < 600);
		}
		
		return eligible;
	};

All that does is check for a 500-level exception. It's in the filter class because that's where all your other filter code is located so it makes perfect sense to put it there.

Note that the code uses the ServiceException class you created in a previous guide. That custom class includes a field that records the response status code.

Now you can use that filter like this:

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retryWhen(Retry.backoff(3, Duration.ofSeconds(5))
	                		.filter(ex -> WebClientFilter.is5xxException(ex)))
	                .block();
	        
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException(we.getMessage(), we.getRawStatusCode());
    	} 
    }

Pay close attention to that .filter() method below retryWhen(). That's where the magic happens.

That will ensure the code only retries 500-level errors.

Now it's time to get rid of that stacktrace that happens once the maximum number of retries is exhausted.

Stopping the Stack

You don't need that exception stacktrace cluttering your logs. And you know Smithers won't go for it either so now you need to catch that exception and do something with it.

Fortunately, you've already got a ServiceException class that you can use to handle such errors.

So update your code to look like this:

    public SalesOwner fetchUser(String bearerToken) {
    	try {
	        SalesOwner salesOwner = userClient.get()
	                .uri("/user/me")
	                .header(HttpHeaders.AUTHORIZATION, bearerToken)
	                .retrieve()
	                .bodyToMono(SalesOwner.class)
	                .retryWhen(Retry.backoff(3, Duration.ofSeconds(5))
	                		.filter(ex -> WebClientFilter.is5xxException(ex))
	                		.onRetryExhaustedThrow((retryBackoffSpec, retrySignal) -> 
	                			new ServiceException("Max retry attempts reached", HttpStatus.SERVICE_UNAVAILABLE.value())))
	                .block();
	        
	        LOG.debug("User is " + salesOwner);
	        
	        return salesOwner;
    	} catch (WebClientResponseException we) {
    		throw new ServiceException(we.getMessage(), we.getRawStatusCode());
    	} 
    }

And there it is. Take a look at that onRetryExhaustedThrow() method. That's where you'll intercept the exception thrown by the framework and instead throw your own custom exception.

Then you can handle it as you see fit.

Wrapping It Up

Congrats! You now know quite a bit about how to handle retries with WebClient.

Now take everything that you've learned and use it to suit your own business requirements. Change the number of retries and delays. Update how you handle exceptions.

And, as always, be sure to take a look at the source code.

Have fun!

Photo by Andrea Piacquadio from Pexels