Through the years, companies have pushed the development of their IT systems out of their business core processes: retail shop business was retail and not software but reality has kicked in very quickly with companies such as Amazon or Alibaba, which can partially attribute their success to keeping their IT systems in the core of the business.
A few years ago, companies used to outsource their entire IT systems, trying to push the complexity aside from the main business in the same way that companies outsource the maintenance of the offices where they are. This has been successful for quite a long time as the release cycles of the same applications or systems were long enough (a couple of times a year) to be able to articulate a complex chain of change management as a release was a big bang style event where everything was measured to the millimeter with little to no tolerance for failure.
Usually, the life cycle for such projects is very similar to what is shown in the following diagram:
This model is traditionally known as waterfall (you can see its shape), and it is borrowed from traditional industrial pipelines where things happen in very well-defined order and stages. In the very beginning of the software industry, engineers tried to retrofit the practices from the traditional industry to software, which, while a good idea, has some drawbacks:
- Old problems are brought to a new field
- The advantages of software being intangible are negated
With waterfall, we have a big problem: nothing moves quickly. No matter how much effort is put into the process, it is designed for enormous software components that are released few times a year or even once a year. If you try to apply this model to smaller software components, it is going to fail due to the number of actors involved in it. It is more than likely that the person who captures the requirements won't be involved in the development of the application and, for sure, won't know anything about the deployment.
I remember that when I was a kid, we used to play a game called the crazy phone. Someone would make up a story with plenty of details and write it down on paper. This person read the story to another person, who had to capture as much as possible and do the same to the next person, up until we reached the end of the number of people playing this game. After four people, it was almost guaranteed that the story wouldn't look anywhere close to the initial one, but there was a more worrying detail: after the first person, the story would never be the same. Details would be removed and invented, but things would surely be different.
This exact game is what we are trying to replicate in the waterfall model: people who are working on the requirements are creating a story that is going to be told to developers, who are creating another story that is going to be told to QA so that they can test that the software product delivered matches with a story that was in two hands (at the very least) before reaching them.
As you can see, this is bound to be a disaster but hold on, what can we do to fix it? If we look at the traditional industry, we'll see that they never get their designs wrong or, at least, the error rate is very small. The reason for that (in my opinion) is that they are building tangible things, such as a car or a nuclear reactor, which can easily be inspected and believe me or not, they are usually simpler than a software project. If you drive a car, after a few minutes, you will be able to spot problems with the engine, but if you start using a new version of some software, it might take a few years to spot security problems or even functional problems.
In software, we tried to ease this problem by creating very concise and complex diagrams using Unified Modeling Language (UML) so that we capture the single source of truth and we can always go back to it to solve problems or validate our artifacts. Even though this is a better approach, it is not exempt from problems:
- Some details are hard to capture in diagrams
- People in the business stakeholders do not understand UML
- Creating diagrams requires time
Particularly, the fact that the business stakeholders do not understand UML is the big problem here. After the capture of requirements, changing them or even raising questions on lower levels (development, operations, and so on) requires involving some people, and at least one of them (the business stakeholder) does not understand the language of where the requirements were captured. This wouldn't be a problem if the project requirements were spot on since the first iteration, but in how many projects have you been involved where the requirements were static? The answer is none.
Once we have made it clear that we have a communication problem, bugs are expected to arise during our process. Either a misalignment with the requirements or even the requirements being wrong usually leads to a defect that could prevent us from deploying the application to production and delay everything.
In waterfall, fixing a bug is increasingly possible in every step we take. For example, fixing a bug in the requirements phase is very straightforward: just update the diagrams/documentation, and we are done. If the same bug is captured by a QA engineer in the verification phase, we need to:
- Update the documents/diagrams
- Create a new version of the application
- Deploy the new version to the QA environment
If the bug is caught in production, you can imagine how many steps are involved in fixing it, not to mention the stress, particularly if the bug compromises the revenue of your company.
A few years ago, I used to work in a company where the production rollouts steps were written in a Microsoft Word document command by command along with the explanation:
- Copy this file there:
cp a.tar b.tar
- Restart the server
xyzwith the command:
sudo service my-server restart
This was in addition to a long list of actions to take to release a new version. This happened because it was a fairly big company that had commoditized its IT department, and even though their business was based on an IT product, they did not embed IT in the core of their business.
As you can see, this is a very risky situation. Even though the developer who created the version and the deployment document was there, someone was deploying a new WAR (a Java web application packed in a file) in a production machine, following the instructions blindly. I remember asking one day: if this guy is executing the commands without questioning them, why don’t we just write a script that we run in production? It was too risky, they said.
They were right about it: risk is something that we want to reduce when deploying a new version of the software that is being used by some hundred thousand people on a single day. In fairness, risk is what pushed us to do the deployment at 4 A.M. instead of doing it during business hours.
The problem I see with this is that the way to mitigate the risks (deploy at 4 A.M in the morning when no one is buying our product) creates what we call, in IT, a single point of failure: the deployment is some sort of all or nothing event that is massively constrained by the time, as at 8 A.M., the traffic in the app usually went from two visits per hour to thousands per minute, around 9 A.M. being the busiest period of the day.
That said, there were two possible outcomes from the rollout: either the new software gets deployed or not. This causes stress to the people involved, and the last thing you want to have is stressed people playing with the systems of a multi-million business.
Let’s take a look at the maths behind a manual deployment, such as the one from earlier:
Remove the old version of the app (the WAR file)
Copy the new version of the app (the WAR file)
Update properties in configuration files
This describes the steps involved in releasing a new version of the software in a single machine. The full company system had a few machines, so the process would have to be repeated a number of times, but let's keep it simple; assume that we are only rolling out to a single server.
Now a simple question: what is the overall failure rate in the process?
We naturally tend to think that the probability of a failure in a chained process such as the preceding list of instructions is the biggest in any step of the chain: 5%. That is not true. In fairness, it is a very dangerous, cognitive bias. We usually take very risky decisions due to the false perception of low risk.
Let's use the math to calculate the probability of failure:
The preceding list is a list of dependent events. We cannot execute step number 6 if step 4 failed, so the formula that we are going to apply is the following one:
P(T) = P(A1)*P(A2)…*P(An)
This leads to the following calculation:
P(T) = (99.5/100) * (99.5/100) * (98/100) * (98/100) * (95/100) * (95/100) * (99.5/100) = 0.8538
We are going to be successful only 85.38% of the times. This translated to deployments, which means that we are going to have problems 1 out of 6 times that we wake up at 4 A.M. to release a new version of our application, but there is a bigger problem: what if we have a bug that no one noticed during the production testing that happened just after the release? The answer to this question is simple and painful: the company would need to take down the full system to roll back to a previous version, which could lead to loss of revenue and customers.