Building Resilience Into Your Critical Applications

Resilience is the bedrock of all IT systems. No matter how potentially powerful a technology can perform, if it’s not resilient, it won’t deliver business value. Your customers, partners, and internal end-users won’t consistently receive the performance they expect from the applications they consume.

The keys to building resilient applications are the patterns you use to build your service architecture. Done right, these patterns prevent application failures and preserve functionality in the event a single service fails.

In this article, we present four types of patterns you can apply to your development methodology to make your critical applications resilient:

  • Isolation
  • Loose Coupling
  • Latency Control
  • Caching

By applying these patterns, you can ensure your applications support your business processes to help your end-users work more productively, and to help your partners and customers transact more business.

Isolation Resilience Patterns

Systems must protect themselves from malicious and broken calls and return values. It’s crucial that you validate data types by checking them against specific parameters in your programming style guidelines.

Another important isolation pattern is the bulkhead design. With it, you can ensure one fault won’t bring down the rest of an application. Using this pattern is particularly helpful to avoid cases where processes fail and cause a large number of requests to build up.

They may fail in parallel, and if unconstrained, these failures can consume more and more CPUs, threads, and memory—thus degrading application performance or causing a crash. A faulted system can also create a logjam of requests, which may consume all the compute resources in the environment.

Loose Coupling Resilience Patterns

In a loosely-coupled system, service components use very little or none of the definitions belonging to other service components and are less constrained by platforms, development languages, operating systems, and development environments. In addition to service components, loose coupling can be applied to data, classes, and interfaces.

Four types of autonomy promote loose coupling:

  • Reference autonomy
  • Time autonomy
  • Format autonomy
  • Platform autonomy

Another attribute of loose coupling designs is relaxed temporal constraints. Your system does not need to be strictly consistent, but consistency is expected over time. Strict consistency requires tight coupling, a pattern that is common in database management systems. Loose coupling allows you to achieve a state of basic availability and eventual consistency.

For your REST APIs, you can also consider idempotency, which ensures repeated calls respond with the same result. Then there’s self-containment, which isolates functions within independent systems. This avoids creating monolith applications that keep growing to the point where they are difficult to maintain.

Latency Control Resilience Patterns

Almost every application relies on timeouts to make sure application calls don’t result in an endless loop and give end-users those annoying spinning wheels that won’t go away. As you set your timeouts up, allow for slower responses but set limits to stop users from waiting for responses that will never come.

While timeouts will limit the duration of operations that prevent operations from hanging, it’s difficult to predict the perfect amount of time that will be suitable for all the operations. To complement timeouts, you can use a circuit breaker, which is based on the premise that failing fast is better than making users wait. By protecting a faulting system from overload, you can help it recover faster.

Circuit-breakers have three main states:

  • Closed—the circuit initially starts in <Closed> mode and executes actions while measuring the faults and successes of those actions. If the faults exceed a threshold, the circuit breaks and switches to <Open>. When an action causes a tripped circuit, the exception is rethrown, and the circuit switches to an <Open> state.
  • Open—while the circuit is <Open>, any action placed through the policy will not be executed. Instead, the call will fail fast with a broken circuit exception. This broken circuit exception contains the last exception, the one that caused the circuit to break. The circuit remains <Open> for the configured duration of a break. After that timespan—when the next action is placed through the circuit or if the circuit state is queried—the circuit transitions to <Half-Open>.
  • Half-Open—When <Half-Open>, the next action through the circuit will be treated as a trial to determine the circuit’s health. Two action attempts are permitted during the break. All other attempts are rejected, throwing back a broken circuit exception.

During the <Half-Open> trial period, if a handled exception is received, the system rethrows the exception and transitions the circuit back to <Open>, and the circuit remains <Open> again for the configured timespan. If a successful result is received, the circuit transitions back to <Closed>. If an unhandled exception is received, the circuit remains in <Half-Open>.

Other latency control design patterns to consider include bounded queues, which limit the request queue size for highly-utilized resources. This avoids latency due to overloaded resources.

Shed loading will help you control executions if too many requests are started, which can consume extra compute resources, cause processes to crash, and may block all other requests from being served. The rate limitation applied by shed loading solves this issue by directing how many resources the system will consume to fulfill tasks.

If you’re building apps in the cloud, you can also tailor systems to scale automatically. To do this, the load situation must be known, which means you need an upfront calculation of the created load and the consumed resources.

Caching

Caching is one of the easiest ways to increase system performance and improve recurring access to information in a data store. But since cached data may not always be consistent with data stores,
keep the cache data up to date as much as possible. The application should recognize when data is stale.

The key benefits of caching include greater scalability, reduced load on downstream services, and fewer incidents of degraded performance. But caching can also negatively affect application availability, so consider your approach to these caching aspects:

  • How long you will store cached data
  • When you will evict data
  • Cache priming
  • In-memory caching

There are several strategy patterns you can follow to handle these processes. These include Cache-Aside, Read-Through Cache, Write-Through Cache, and Write-Around Write-Back Cache.

A Better Position to Deal With Application Failures

To increase the resilience of your applications, focus on building components that tolerate faults within their scope as well as the failures of other components they integrate with. While traditional approaches have emphasized increasing uptime, the modern approach of resiliency patterns strives to reduce recovery times and overall downtimes.

Application failures at some point in time are inevitable. But by taking the resiliency approach, you will be better positioned to deal with failures and avoid having applications be unavailable for long periods of time.

For more information on how to make your applications more resilient, contact us.