There is a famous quote by Henry Ford, the creator of Ford (the popular car-maker brand):
“If I had asked people what they wanted, they would have said faster horses.”
This is what happened with the traditional system administrator role: people were trying to solve the wrong problem.
By the wrong problem, I mean the lack of proper tools to automate the intervention in production systems, avoiding the human error (which is more common than you may think) and leading to a lack of communication continuity in the processes of your company.
Initially, DevOps was the intersection of development and operations as well as QA. The DevOps engineer is supposed to do everything and be totally involved in the SDLC (software development life cycle), solving the communication problems that are present in the traditional release management. This is ideal and, in my opinion, is what a full stack engineer should do: end-to-end software development, from requirement capture to deployments and maintenance.
Nowadays, this definition has been bent up to a point where a DevOps engineer is basically a systems engineer using a set of tools to automate the infrastructure of any company. There is nothing wrong with this definition of DevOps, but keep in mind that we are losing a very competitive advantage: the end-to-end view of the system. In general, I would not call this actor a DevOps engineer but an Site reliability engineering (SRE). This was a term introduced by Google few years back, as sometimes (prominently in big companies), is not possible to provide a single engineer with the level of access required to execute DevOps. We will talk more about this role in the next section, SRE model.
In my opinion, DevOps is a philosophy more than a set of tools or a procedure: having your engineers exposed to the full life cycle of your product requires a lot of discipline but gives you an enormous amount of control over what is being built. If the engineers understand the problem, they will solve it; it is what they are good at.
In the last few years, we have gone through a revolution in IT: it sparkled from pure IT companies to all the sectors: retail, banking, finance, and so on. This has led to a number of small companies called start-ups, which are basically a number of individuals who had an idea, executed it, and went to the market in order to sell the product or the service to a global market (usually). Companies such as Amazon or Alibaba, not to mention Google, Apple, Stripe or even Spotify, have gone from the garage of one of the owners to big companies employing thousands of people.
One thing in common in the initial spark with these companies has always been corporate inefficiency: the bigger the company, the longer it takes to complete simple tasks.
Example of corporate inefficiency graph
This phenomenon creates a market on its own, with a demand that cannot be satisfied with traditional products. In order to provide a more agile service, these start-ups need to be cost-effective. It is okay for a big bank to spend millions on its currency exchange platform, but if you are a small company making your way through, your only possibility against a big bank is to cut costs by automation and better processes. This is a big drive for small companies to adopt better ways of doing things, as every day that passes is one day closer to running out of cash, but there is a bigger drive for adopting DevOps tools: failure.
Failure is a natural factor for the development of any system. No matter how much effort we put in, failure is always there, and at some point, it is going to happen.
Usually, companies are quite focused on removing failure, but there is a unwritten rule that is keeping them from succeeding: the 80-20 rule:
- It takes 20% of time to achieve 80% of your goals. The remaining 20% will take 80% of your time.
Spending a huge amount of time on avoiding failure is bound to fail, but luckily, there is another solution: quick recovery.
Up until now, in my work experience, I have only seen one company asking "what can we do if this fails at 4 A.M. in the morning?" instead of "what else can we do to avoid this system from failing?", and believe me, it is a lot easier (especially with the modern tools) to create a recovery system than to make sure that our systems won't go down.
All these events (automation and failure management) led to the development of modern automation tools that enabled our engineers to:
- Automate infrastructure and software
- Recover from errors quickly
DevOps fits perfectly into the small company world (start-ups): some individuals that can access everything and execute the commands that they need to make the changes in the system quickly. Within these ecosystems is where DevOps shines.
This level of access in traditional development models in big companies is a no-go. It can be an impediment even at a legal level if your system is dealing with highly confidential data, where you need to get your employees security clearance from the government in order to grant them access to the data.
It can also be convenient for the company to keep a traditional development team that delivers products to a group of engineers that runs it but works closely with the developers so that the communication is not an issue.
SREs also use DevOps tools, but usually, they focus more on building and running a middleware cluster (Kubernetes, Docker Swarm, and so on) that provides uniformity and a common language for the developers to be abstracted from the infrastructure: they don't even need to know in which hardware the cluster is deployed; they just need to create the descriptors for the applications that they will deploy (the developers) in the cluster in an access-controlled and automated manner in a way that the security policies are followed up.
SRE is a discipline on its own, and Google has published a free ebook about it, which can be found at https://landing.google.com/sre/book.html.
I would recommend that you read it as it is a fairly interesting point of view.