The first Queen Elizabeth bathed once every two weeks. Her clothing contained contraptions to collect head lice as they dropped from beneath her wig. Outside in the city, residents routinely emptied human waste from second floor windows onto the streets below, past the abundant rats and on into the town’s water supplies, where diphtheria, cholera and typhus flourished.
Doctors responded to fever by bleeding patients, to Black Death by applying poultices made of dried toad, to mental illness by repeated infusions of lamb’s blood until the patient died, and worst of all, to ear infections by placing a small onion in the sufferer’s ear.
The routine use of torture and mutilation-based punishments for minor crimes was at its highest level in the whole of recorded British history. For more serious offences, braying crowds gathered in town squares to watch dismemberment performed by amputation saw, which was more effective in administering pain than the more "humane” axe.
This was normality. All of this was normal. They called it the Golden Age.
__
People are pretty good at detecting changes in their lives, mostly regarding them with suspicion or fear. We generally tend to prefer what we’re used to, to normality, in whichever form of it we were born into. We find it difficult to objectively appraise normality, or to envisage a better one. We can’t easily perceive the relationship between the ills that visit us and the environment that sends them. Normality is too big, too complicated to ponder, and we’re generally too busy living our lives within it to consider how we might personally improve it.
In an IT context, identifying emerging changes from the normal operating range of application infrastructure remains one of the most effective ways of pre-empting outages and degradations.
But what about that normal operating profile? How healthy is that? And how often does a dysfunctional normality contribute to outage and degradation?
Operational IT people, when asked about a potentially dysfunctional systems' behaviour, often say things like, “it always does that on a Friday,” or “it frequently spikes for an hour or so and then drops back,” or “those servers have never been optimally load balanced, so it’s ok,” and so on.
People frequently make these statements proudly. Ironically, they feel this shows that they are an “expert” in the particular application under discussion, even though they don’t know what causes these patterns.
The implication in these statements is that, because it usually happens or because it has happened before, we don’t have to worry about it, that it’s not worth investigating or improving.
This is an arbitrary and blind conclusion that certainly lays a direct path to service outage.
The best time to perform root-cause analysis is before an application has failed, systematically investigating and actioning each of these potentially dysfunctional norms in our applications before they conspire to provoke a failure.
Why don’t we? Perhaps your organisation is only geared to react to outages and doesn’t have a conduit for preventative investigation and action. Perhaps it just seems too hard to span the organisational boundaries of a federated IT organisation, except during crisis, to eradicate such issues. Maybe they don’t seem as important as the outages they ultimately cause.
But I think the underlying reason is that we’re just too comfortable with, and complacent about, normality. Perhaps busy people just to need to believe that normal is benign, good enough. Perhaps we too will call this period in our IT history "The Golden Age".