In large complex systems, we typically integrate with several components, often from other teams. What happens when something goes wrong? This may be an incident or a challenge when building integrations due to misconfiguration of downstream dependencies. One pattern I've seen and encouraged is 'find whose fault it is ASAP', e.g. adding heartbeats or monitoring on dependencies. This isn't to blame others, but to resolve issues quickly; the faster we can find the root cause, the faster we can fix it.
However, a common anti-pattern can arise in this setup where engineers explain problems to their users by saying 'it's not our service, it's this other service we depend on…', and leaving it at that. Why is this a problem? Let's say hypothetically, you are a Microsoft Office user and Office stops working. The root cause may be some infrastructure the Office team uses (e.g. maybe an Azure issue). Still, as an Office user, I expect I can contact Office support (or the Office status website) for an explanation and updates as to when service is restored; I do not expect Office support to say 'oh it's an Azure problem, go raise a ticket with that team'.
Luckily, this is typically an intra-company problem between internal teams (though I have been passed between departments when trying to fix my mobile phone service...). Nevertheless, being told, as a service consumer, 'Oh, it's our dependency that's at fault, go and talk to them' is a terrible (internal) user experience. If I am consuming a service, I should not need to understand or trace through a network of service dependencies to resolve issues, as this defeats the purpose of microservices, where teams can focus on their domains.
The answer is simply 'taking ownership of the problem'. The root cause may not be in your system, but your direct users have a problem, so it is now your problem to solve. But what does this mean? It is equally an anti-pattern for you to dig into another team's codebase to find and fix issues.
Being a concierge
The simplest solution is to have an attitude of being a concierge. A concierge will take your request and follow up on your behalf. In this case, they should chase down the direct dependency at fault and ensure they are aware of the severity of the problem so that the appropriate responsiveness is applied. Concierges are responsible for reporting back to their users with regular updates. This model applies to both incidents and even when consumers build their integrations with your service; it should be up to you as a service owner to ensure downstream dependencies are set up appropriately.
Open and transparent communications
In more complex systems with deep graphs of dependencies, having several layers of concierge, especially addressing problems during incidents, is inefficient. A better approach is open and transparent communication in the organisation. As a concierge, you should initiate requests on behalf of your users, e.g. share the impact on your users in incident channels, and then share direct channels that your users can choose to follow along to (e.g. Slack channels, Jira tickets). It is still your responsibility to keep your users updated, but you can likely drop your frequency.
Architecting to control dependencies
You may wonder, "Isn't it a deep graph of dependencies a problem?" Yes, it is! The one thing I remember from my university Systems Theory course was that mathematically, the more sequential dependencies we have, the lower our overall system reliability. Unfortunately, with microservices in large organisations, we are incentivised to leverage as much as possible from others as it means we are more focussed on a smaller domain (read: fewer things to build and maintain), and to maximise consumers (read: max return on investment).
Instead, we must be more selective when introducing dependencies for our system:
Delegate to dependencies only if that functionality is truly generic and not specific to your domain. We need to balance the urge to use all available capabilities from other teams to save effort with understanding whether we need customisation and hence should build it ourselves. Generally speaking, it is better to have complete control while exploring 'product market fit' for your system. Infrastructure primitives are usually a no-brainer to leverage, but components further up the stack should be treated with a more sceptical eye.
Choose dependencies that are at least as reliable as your system needs to be. While it is possible to work around less reliable dependencies (e.g. fallback logic, caching), it adds integration cost and complexity, reducing the value of introducing the dependency.
Summary: Your users' problems are your problems
We are all busy people, so any opportunity to delegate to others, including our users, can be tempting. Also, when things go wrong, the natural human response is to deflect blame, either with reasons (or excuses) or to other people. These behaviours don't serve our users, internal or external. Instead, your users' problems are yours, even if you are not at fault. This post is about taking ownership of problems in a scalable way. Your users will appreciate it!