“Do not go gentle into that good night
Old age should burn and rage at close of day
Rage, rage against the dying of the light”
Dylan Thomas wrote these words as he contemplated his father’s last days and hours, urging him to fight for just a little more life, even as death came ever nearer. The poem from which these lines come is partly a celebration of the tenacity of the human spirit, continuing to rage and strive for life, even at its end.
I wonder what he would have made of today’s n-tier application availability. Our critical applications, upon which our businesses absolutely depend. Why do they fail so readily? Why are they so fragile? Why don’t they grasp life more firmly?
The human organism has evolved over millions of years to withstand, and adapt to, sustained environmental bombardment, both physical and emotional. The n-tier application has, of course, not been granted this period of maturation. But we have made matters worse by disrupting its short evolution and, as a result, we’ve unintentionally birthed ever more elaborate, sensitive and refined creatures that effectively have no immune system, and no heart for the fight.
In the world of compromise in which we must operate, we have, wittingly or not, emphasised features over architectural robustness, scale over resiliency, flexibility over availability. In evolutionary terms, we’ve broken the control feedback between life experience and adaptation through federated IT models that put the teams that run applications far from the people that write them. We’ve used the data that tells us how these applications behave to inform us of their recent death, not to give us insight into how to engineer them to live.
We continuously monitor application infrastructures much as the prison officer continuously observes the prisoner on suicide watch. But in both cases the results are often discouraging.
Until we engineer our applications to fight for life, to rage against failure, we’ll continue to be disappointed by their performance and availability. But today, most development teams are ambivalent about the longevity of their applications, and don’t really embrace resiliency and recovery design at all.
Here’s an example. Most critical financial services applications are deployed with some kind of “fault-tolerant” parallel infrastructure. So how come they’re almost never tolerant to faults? It’s because the designers and developers knew they had to tick a box for fault tolerance, so they specified a standby infrastructure, but they didn’t sweat out the details of how applications really fail or degrade or how operations teams have to react to these situations.
In real life, operations teams don’t switch over an entire 50-server infrastructure to a fallback site just because something somewhere in the primary infrastructure is running slowly. It’s too scary a proposition, and, until they know the cause of the slowdown, who’s to say that the same problem won’t recur within the fallback infrastructure too?
So instead, while they attempt to puzzle out the reason for slowdown, the application limps or falls, the fault tolerant infrastructure remaining idle.
If we want applications to be genuinely resilient, we need to engineer them at a finer granularity than this, to have paranoia enough for effective self-monitoring, to have intelligence and resources enough to divert themselves from failure as soon as it is starts to emerge.
The reason that this rarely gets done is because we regard that task as a theoretical computing science problem, almost in the realms of artificial intelligence. This makes it seem too difficult. We should instead treat it as a practical engineering challenge, which exploits the mass of data on past failures and implements practical mitigations to automatically deal with them in the later applications that we build.
IT Analytics provides an essential guide here, in converting raw data about past degradations and failures into information and insight about how those failures developed within the distributed application.
We should provide this information to our development teams and task them with building practical recovery scenarios for the common failures and degradations of the past. What failed in previous incarnations will fail again in later versions, or in new applications, in much the same way as it did before. In the meantime, for those prior applications, can we only will them to stay alive for just a little longer?
And you my father, there on the sad height
Do not go gentle into that good night
Rage, rage against the dying of the light