What software engineers can learn from IPv4 to IPv6 migration
Building a v2 system isn't always worth it.
Engineers love to tinker and improve systems. We accept the first version of a system won't be perfect, and then we want to create a second system before the first one becomes unmaintainable. We're obviously not silly enough to just rewrite a system. Instead, we try to re-model the system's domain based on what we have learned and new requirements. We'll change the interfaces and core concepts to be more extensible, applying some Domain Driven Design techniques we've read about. This second version isn't just a rewrite; it is a fundamentally different (and better) system!
How often have we seen the new system struggle to meet expectations? After the initial hype, we see adoption by internal consumers or end-users falter, so much so that we never replace the original system, resulting in having to maintain two systems for a long time. In the worst case, the new system is the one we throw away.
I recently stumbled across this article "IPv6 may already be irrelevant argues APNIC chief scientist", which pointed me to this extensive blog post on the IPv4 to IPv6 migration from Geoff Huston, the director of APNIC, the authority responsible for managing IP addresses in the Asia Pacific region. The challenges the Internet has faced seem eerily familiar to what I've seen with system rewrites. In this post, I share some lessons from the article and how it relates to software engineering in general.
Summary of the IPv4 → IPv6 story
Introduced in 1982, IPv4 is the defacto Internetworking layer protocol, responsible for global addressing of devices and routing traffic.
The growth of the Internet was not guaranteed, and devices were limited in their network, memory, and computing capacity. IP addresses were 32-bit numbers, so a mathematical max of ~4 billion addresses.
It became apparent in the early 90s that we would run out of IP(v4) addresses. A plan was created consisting of three phases:
Classless Interdomain Routing (CIDR) - The original specification grouped addresses into five classes with different length prefixes, and organisations were allocated a prefix. The class structure meant there were many unusable addresses. For example, Class A prefixes had 16 million addresses, far more than any organisation would likely need and hence would be wasted. CIDR was a short-term step to remove IP address classes to optimise the use ofthe existing address space.
Network Address Translation (NAT) - a medium-term 'hack' where a router would translate non-globally unique IP addresses used by 'internal' devices to public ones accessible on the Internet. This effectively traded off end-to-end device addressability (a goal of the Internet stack) for significantly more runway to get to a long-term solution.
IPv6 - the long-term solution with a 128-bit address space, but also addressing many of the pitfalls of IPv4, including:
A simplified header to reduce computing power to process and route traffic (for those so inclined, the visual difference between IPv4 and IPv6 headers is stark).
In-built support for security (encryption and signing) instead of an add-on protocol (IPSec).
Improved 'quality of service' support for the growing volume of streaming media, with explicit flow labelling instead of a single bit to denote important packets.
Generally improved built-in extensibility with extensibility header fields.
One stat from Huston's post that shocked me was that more than 25 years after IPv6 became an Internet standard and officially running out of IPv4 addresses, we're still only at ~50% of Internet traffic in the US using IPv6!
Lesson 1: Quality of benefits over quantity
One theme I see in Huston's post is whether end users cared enough about the benefits of IPv6 to justify the migration effort. I see two aspects to this:
Primary vs secondary benefits - The primary benefit is the main selling point of the new system and must be big enough compared to the existing setup for users to justify moving to it. Secondary benefits are bonuses; stocking up on secondary benefits will not overcome a weak primary benefit. The primary benefit of IPv6 was solving the address space exhaustion problem. Unfortunately, there were already short- and medium-term solutions to this problem. IPv6 had plenty of secondary benefits relating to effectively fixing papercuts in IPv4, but they do not carry the same weight, even if there are many.
Who are the actual beneficiaries? - Expanding the number of available addresses clearly benefits the general public who pay for Internet services across our increasing number of connected devices. This concept can be easily explained and quantified even to laypeople. The other benefits of IPv6 are primarily valued by engineers operating in the Internetworking layer of the stack, far below where end users interact with the technology. This means the people paying for the migration don't care about these added benefits.
When I look back at other system rewrite or 'new system' projects that have struggled, there are clear parallels and, hence, smells to avoid:
Trying to justify projects by adding 'more benefits' to the pitch. In addition to solving X, we'll also solve Y and Z. We can (incorrectly) think that our design is better because it solves more problems, even if it requires more build and/or migration effort.
Explaining the benefits in terms of what the implementing team cares about, not the (end-)user or the person holding the purse strings.
These anti-patterns can lead to unnecessary projects being presented to leadership, important projects not getting off the ground, project teams prioritising the wrong problem, and/or successful projects not showing value to others even if they met their goals.
Solving many problems at once is a Good Thing, but we can't lose sight of the primary benefit. Focus on a concise and quantifiable goal targetting an end-user problem.
Lesson 2: "Agile" plans and acceptable technical debt
It was interesting to see that even in a large distributed organisation (the IETF), they devised an incremental plan to solve the address exhaustion problem. We also have stopped at an interim point on the overall plan because we've effectively solved the main problem of address exhaustion, even if we have yet to reach the end. This sounds 'agile', incrementally solving problems and doing enough to justify the investment, which should be a Good Thing.
It is difficult for engineers who have developed a long-term plan and 'north star' to accept that we may never reach the ideal state. That leftover work will forever be 'technical debt'. Most successful rewrites result in new and old systems operating concurrently for a long time. Organisations will likely need IPv4/IPv6 dual stacks for the foreseeable future, resulting in long-lived "keep the lights on" work (KTLO) to maintain both.
How do we learn to accept that we will land where we land on a long-term plan? Huston's post talks about pragmatically defining the end of the IPv6 transition as "IPv4 is no longer necessary" rather than completely eliminating IPv4. We should take this approach of defining pragmatic endpoints for our long-term projects. At each logical phase of a project, we should ask, "Have we delivered the primary benefit?" and "What is the cost to users and the owning team if we stop here?".
In particular, who is on the hook to maintain multiple system versions? Is it the user needing multiple integrations or the owning team maintaining numerous independent systems? I would argue that the former should never be acceptable. Instead, we must push harder until we get to a point where the burden is no longer on the user but on the owning team (e.g. all users are either fully on the old or new APIs and never a mix, with some translation layers maintained by the owning team).
One interesting lesson from the IPv6 story is that we should define an acceptable end state from the perspective of the primary user problem at the start of a project.
Lesson 3: Defining principles in terms of user benefits
When designing a new refined domain model, we typically build a sense of what makes our new model 'more elegant' than the old one. Maybe the domain is more decomposed, allowing new concepts to be modellable via core primitives. Perhaps we have explicitly defined some principles that we will follow to ensure the system can be maintained in the long term. We can use this to justify why our new version is necessary. But are we just tricking ourselves?
I found it interesting that from the IPv6 story, we considered NAT a 'hack'. But why is it a hack? NATs violate a core principle of end-to-end addressability in internetworking protocols, but do end users care? In fact, with VPNs, users are actively trying to break the end-to-end nature of the Internet. We can also always add layers to provide end-to-end visibility (e.g. TLS).
My takeaway is that we must be willing to challenge our principles and definitions of elegance. This is not to say our principles are wrong, but they should be based on what the end user cares about, and we should recognise that changes over time.
Lesson 4: Existing systems are more extensible than we realise
One big reason for re-architecting a system is to enable future extensibility. When we decompose a domain (and/or a monolith), in theory, we have smaller components that we can recombine in new ways to gain extensibility for new use cases. We also introduce new interfaces to make it easier to develop new components in future.
IPv6 is designed with extensibility in mind. Features like IPSec or Quality of Service integrate nicely into IPv6 rather than being bolted on. However, IPv4 was extended to support these features; IPSec headers were added as 'inner headers', and improved quality of service was supported by having more network bandwidth and end-to-end flow labelling at higher layers. While it is certainly less elegant than in IPv6, it works!
Why was this possible? Because the internetworking stack was already extensible! Layers were already present. Protocols already had well-defined boundaries; IPv6 was designed such that upper layers generally wouldn't notice that IPv4 was replaced.
There is value in adding interfaces to existing systems to enable some extensibility through adding more layers on top, but complete re-architecture and re-modelling are rarely necessary.
Lesson 5: Upfront design for a migration
The IPv6 story shows how reluctant users are to migrate their systems even with an impending emergency. Sure, the short and medium-term solutions required some migration efforts to upgrade and configure routers across the Internet, but IPv6 adoption was a significant step up. In addition to learning IPv6 (and its now complicated-looking hexadecimal address notation), there was a raft of replacement protocols to go along with it. Organisations had to run both IPv4 and IPv6 'dual stack' during the migration period, and they were also dependent on everyone else on the Internet needing to adopt IPv6 (although 'tunnelling' techniques were introduced).
Successful migrations I've seen or been involved with in the past have involved designing user experience for migrations from the start. This includes:
Identifying cohorts of users and designing experiences for each cohort. What motivates each cohort to migrate, and what changes must they consider? For example, if we consider migrating end user accounts between systems, we can divide between enterprise vs individual users; enterprise users have admins who may see manageability improvements as a major benefit and hence may be easier to start with. Even enterprise users can be broken down by single sign-on vs password-based logins. Understanding each cohort may identify additional requirements for our systems and models. It will also help quantify the required migration effort and challenge whether a migration is worthwhile.
Designing for a forced migration. At some point, you'll hit users that won't migrate themselves. A strategy for forcing migrations gives us the best chance of removing a legacy system. There are two cohorts to consider:
Those who don't see benefits and hence don't want to spend effort - This needs some mandate to force the migration. For company-internal migrations, this may be a company-wide mandate from the top. For end-user migrations, mandates may come from regulations or to address security or privacy risks.
Those who are no longer active users - Are you going to offboard these users or migrate their data so they can come back later? Or is it cheap enough to leave the old system running for these users?
To be clear, we may not build all migration experiences upfront (it is totally valid to start with net-new users first). Still, we need to have thought enough about how we will migrate all users to (a) justify that we should even start and (b) better understand at what stage of a long project we should be willing to stop.
Summary - Think hard about a significant re-architecture project
I found Huston's post fascinating both because of nostalgia (it reminded me of my days as a network engineer) and parallels to so many projects that I have seen struggle (that we need to learn these lessons ourselves rather than observing others' mistakes is perhaps a topic for another post). My big takeaway is that it all comes down to truly understanding and focussing on the primary user problem we're trying to solve. Once we lose sight of that, our natural engineering tinkering instincts can take over, and we start seeking unnecessary perfection (and become disillusioned when we don't reach it).