Over the past 4 years, I have worked as a QA Specialist with teams transitioning from monolithic systems to a distributed microservice architecture. After working for years with monolithic systems, typically riddled with coupling and testability issues, I longed for an alternative. Like many engineers, I started off this journey in pursuit of the testability, continuous delivery and scalability advertised by microservices proponents.
Through first hand experience, I have learnt that a successful implementation of this architecture can live up to these expectations, but not before presenting its own set of unique challenges. In this series of blog posts I hope to explore some of the most notable challenges and the lessons I’ve learnt from overcoming them.
I believe that being aware of these challenges in the early stages of this transition to microservices is of great value to those embarking on this journey. It is worth mentioning that each challenge can be overcome in different ways. In these blog posts I will be sharing lessons learnt from my personal experience. I will discuss what worked and what did not work for the teams that I have worked with. Think of any solutions highlighted here as an inspiration to how they can solve your same or similar problems – if they resonate with your situation, try them out and build upon them.
Given the broad nature of the topic, each blog post will focus on a specific challenge. The challenge I chose to explore during this post is the complexity in interaction between services. While interaction between services did exist in monolithic architectures, microservices greatly amplified the number of interactions that have to be managed.
Consumers and Providers
When speaking about service interactions, regardless of the communication method chosen, the involved parties can always be classified as either Consumers or Providers.
In the case of request-response interactions such as REST, the Consumer is the service sending the request whilst the Provider is the service being called, returning a response to the Consumer. The Consumer then parses the received response to extract the relevant information. On the other hand, when dealing with message-based systems, the Provider publishes a message that is in turn consumed and parsed by one or more Consumers.
Interactions are only successful if both the Consumer and the Provider adhere to an established contract. The contract encapsulates any information that must be consistent between the Consumer and the Provider for an interaction to be successful. A central element of any contract is the structure of data being transmitted.
As the number of services grows, the number of service interactions typically grows exponentially. The sheer number of interactions makes it increasingly difficult to keep track of these interactions and ensure that a change in a Provider does not break any of its Consumers.
Now that we have identified the posed challenge, the rest of this blog post focuses on a number of practices, principles and techniques that can be used to overcome it.
Early on, we found out the hard way that breaking changes in contracts should be avoided like the plague. A breaking change in a contract is defined as any change in the Provider’s API that causes one or more of its Consumers to fail. Common breaking changes include the renaming or removal of fields in communicated data, changing HTTP status codes in responses, and changing the topic name being used to publish messages.
Rather, contracts should be kept backwards compatible. For example, replacing separate name and surname fields with one full name field will break Consumers that rely on the removed fields. Instead, to maintain backwards compatibility, the Provider should simply add the full name field without removing any of the existing fields.
To minimize breaking changes, we design our services to follow Postel’s Law that states: Be conservative in what you do, be liberal in what you accept from others. A concrete application of this principle is that, in Consumers, only the subset of data given by the Provider that is required to operate is validated. On the other hand, Providers, aim to adhere to the specification and be consistent as much as possible. For example, Providers stick to standards in the provided metadata and the used letter casing. My experience has taught me that this approach goes a long way in preventing breaking changes that could easily be avoided.
On a practical level, there will be cases that demand a breaking change from the Provider to fulfill the expectations of a new Consumer. In order to meet this requirement, without breaking existing Consumers, API Versioning is required. Introducing API Versioning at the start of projects ensures that such scenarios are catered for. There are different strategies and tools through which API Versioning can be implemented. For REST-APIs, versioning is typically done by adding a reference to the version number in the URI, Customer Headers or Custom Media Types. This can be done either at a service or resource level to achieve better granularity. When choosing either of these strategies, a trade-off is made between different qualities of the system. I recommend identifying these trade-offs and discussing which qualities are most desirable within your context. For message-based APIs, versioning is a bit more complex. A common approach is to publish messages with the new version to a new topic. Consumers can then consume messages from the topic with data structured in the format they accept. Another option is to use a data serialization system that supports schema versioning such as Apache Avro. This, however, requires all consumers to commit to such a data serialization system that can introduce a considerable level of complexity.
API Versioning enables us to successfully manage changes in contracts without breaking Consumers. Yet, it introduces the overhead on Providers of having to maintain multiple functioning API versions.
Contract Testing to the rescue
Our experience has proven that not all API changes that are potentially breaking, such as removing a field, actually affect a Consumer. However, when the team working on the Provider lacks visibility into whether a change would break any Consumer, they typically resort to creating a new API version using one of the aforementioned strategies. The lack of visibility into which Consumers are using which contract version also makes it increasingly difficult to safely deprecate older versions of APIs. Upon identifying this issue, it became clear that we needed an automated way through which we could verify whether a change in a contract would break any of the existing Consumers. This led us to introduce a new form of testing, Consumer-Driven Contract testing (CDC testing), using a tool called Pact.
In CDC testing, each Consumer asserts the expectations which are due from its respective Providers. These expectations are defined as executable contracts called Pacts and relate to the subset of data offered by the Provider that the Consumer actually makes use of. Whenever the Consumers’ expectations are updated, a new version of the contract is published in a central Pact repository called the Pact Broker. Every time a Provider is updated, the contracts, or Pacts, of all its Consumers are retrieved from the broker. These are then executed to ensure that the new version of the Provider still meets these expectations. By integrating both these processes into our CI pipeline, we are able to make changes to Providers with more confidence. Whenever a change in a Provider would break a Consumer, the pipeline would immediately alert us of the issue and fail the build before the new version of the service is ever deployed. At that point, the team owning the Provider would coordinate with the teams responsible for the affected Consumers to ensure they are updated.
CDC testing also has the added benefit that, when a change in contract is done unintentionally due to some oversight, we are immediately notified and the change can be easily reverted.
Above all, CDC testing gives us the flexibility to make changes to our contracts that are traditionally considered as breaking, as long as they don’t break any existing consumers. It is worth noting that this is only possible if all Consumers of a Provider are known and publishing their expectations as Pacts which is typically achievable for inter-service communication within a system.
By consistently implementing these techniques we have managed to arrive at a state in which service interactions are manageable. While we are able to confidently make changes in our contracts without breaking consumers, we are aware that a risk is still present. Quoting John Allspaw:
While any increase in confidence in the system’s resilience is positive, it’s still just that: an increase, not a completion of perfect confidence. Any complex system can (and will) fail in surprising ways.
When such unexpected failures do happen, we strive to measure their impact and identify their root cause. This makes it possible for us to continue iterating on our processes and tooling without over-compensating for said risk.