vahid arya -Implementing Stateful Services in a Horizontally Scaled Architecture

Tags : SoftwareEngineering DevOps StatefulServices Microservices NoSQL redis DistributedSystems CloudNative horizontal_scaling system_design

In many real-world scenarios, executing a business process requires multiple sequential API calls.
While these API calls are typically handled independently, from a business logic perspective they all belong to a single workflow.

For example, in an airline ticket purchase flow, the following steps usually take place:

In the first API call, the list of available flights along with their prices is retrieved and shown to the user.
In the second API call, the user reserves one of the selected flights.
In the third API call, the payment is processed and the purchase is finalized.

The key point is that if the flight price changes between these steps, the user should still be able to continue the purchase based on the information shown in the first step, including the price.

This clearly means that the system must preserve the process state across multiple independent requests.

The Challenge in a Horizontally Scaled Architecture

Now consider a system that is implemented using horizontal scaling. In this type of architecture, each API call may be handled by a different service instance.

If each instance is connected only to its own local or dedicated database, there is a real risk that information from previous steps will not be available when subsequent requests are processed.

As a result, a multi-step workflow can easily fail or end up in an inconsistent state.

Common Approaches

In general, there are two well-known approaches to addressing this problem:

Sticky Session
A mechanism that ensures that from the second API call onward, the load balancer routes all requests from a user to the same initial instance.
Shared State
A mechanism that allows all instances to access the API call history and the latest state of the process (Order State).

At first glance, Sticky Session may seem like a simple way to manage state in horizontally scaled systems. However, it comes with notable limitations. This approach tightly couples a user to a specific instance, meaning that if the instance fails, the entire user workflow is disrupted. In addition, Sticky Session prevents even load distribution and limits true scalability. That said, for simple or short-lived workflows, or systems with low traffic and minimal state requirements, it can serve as a temporary or transitional solution. For these reasons, modern, scalable, and cloud-native architectures usually favor Shared State as a more robust and flexible option.

👉 In this article, the focus is on the second approach, as it is generally considered more durable, flexible, and reliable in modern cloud-native systems.

Using Shared Storage to Persist State

In this scenario, the most effective approach is to store process state and API-related data in a shared and neutral data store—one that is independent of each instance’s local database and accessible to all instances.

Within this architecture, NoSQL databases are a natural fit.
When low latency and high throughput are the primary concerns, Redis is a very common and effective choice for storing temporary state.

The general pattern looks like this:

On each API call, the process state is stored or updated in Redis.
Once the final step is completed successfully (for example, payment), the data is persisted to the main database.
Finally, the temporary state is removed from Redis
(or, if historical data is important, it is stored in the primary database).

Concurrency Management

When data stored in a shared database such as Redis can be read or updated concurrently from multiple execution paths, concurrency management becomes critical.

In these situations, using Optimistic Locking to prevent inconsistencies during state transitions is a recommended and effective strategy.

In this architecture, Optimistic Locking is typically implemented at the cache layer.

Practical Considerations When Using Redis / NoSQL for State Management

Although Redis and other NoSQL databases provide an effective way to store process state, building a truly production-ready solution requires careful attention to several practical considerations.

1) Defining TTL for States

States created during a multi-step workflow usually have a limited lifespan.
For example, an airline ticket reservation might only remain valid for 10 to 15 minutes.

For this reason, an appropriate TTL (Time To Live) should be defined when storing state in Redis in order to:

Prevent unused data from accumulating
Automatically clean up abandoned or incomplete states
Keep Redis memory usage under control

In practice, TTL acts as an important guardrail in stateful systems.

2) Idempotency in API Calls

In distributed systems, repeated requests are a natural occurrence. Client retries, network timeouts, or instance failures can all result in the same request being sent more than once.

To avoid duplicate processing, each sensitive API call—especially reservation and payment—must be designed to be idempotent.

A common pattern includes:

Generating a unique idempotency key for each request
Storing this key alongside the process state in Redis
Validating the key before applying any state changes

This approach prevents duplicate execution and data inconsistencies.

3) Failure Scenarios and Redis Reliability

Redis is commonly known as an in-memory data store, which makes careful consideration of failure scenarios essential.

For example:

What happens if Redis becomes temporarily unavailable?
What if payment succeeds but persisting the data to the main database fails?

To reduce these risks:

Redis should be deployed in a clustered or replicated configuration.
Critical operations should be designed to be recoverable.
A state should only be considered “successful” once the data has been safely persisted in the primary database.

4) Atomicity in State Updates

In some steps, multiple related changes must be applied to the state at the same time. For example:

Updating the order status
Recording a timestamp
Storing the payment result

To avoid partial or inconsistent states, Redis transactions or Lua scripts can be used to execute these changes atomically, ensuring the state always remains valid.

5) Choosing Between Redis and Other NoSQL Databases

Redis is an excellent choice for fast, temporary, and short-lived state.
However, if:

More complex queries are required
Data volume is large
Or state must be retained for long periods

Databases such as MongoDB, DynamoDB, or Cassandra may be a better fit.

In many large-scale systems, a combination of these approaches is used:

Redis for active process state
A primary database or durable NoSQL store for final persistence and historical data

A Simple Example of State Implementation in Redis

Key: order:{orderId}

Value:
{
  "status": "RESERVED",
  "flightId": "IR-452",
  "price": 3200000,
  "version": 3,
  "idempotencyKey": "a8f1-9c23",
  "updatedAt": "2025-01-10T12:45:00Z"
}

TTL: 15 minutes

version is used for Optimistic Locking
Each update is only valid if the version has not changed
idempotencyKey prevents duplicate execution of a step
In practice, version checks, idempotency validation, and state updates must be performed as a fully atomic operation. Redis transactions or Lua scripts can be used to avoid partial or inconsistent states.
In large distributed systems, a Correlation ID is often stored alongside the state to enable end-to-end request tracing and simpler debugging in failure scenarios.

👉 Ultimately, choosing Redis or NoSQL for implementing stateful services is not just a technical decision.
It requires careful consideration of factors such as TTL, idempotency, failure handling, and atomicity.
Addressing these concerns ensures that a solution that looks correct on paper is also reliable, scalable, and production-ready in practice.

Implementing Stateful Services in a Horizontally Scaled Architecture

Table Of Content: