12 Principles for Building and Managing Kubernetes Cloud Infrastructure

6 min readApr 20, 2021

Building a resilient and reliable Kubernetes cloud infrastructure requires more than getting your clusters up and running with a fancy provisioning tool. Solid infrastructure design is a sequence of architecture decisions and experienced implementation. Luckily, many organizations and experts went through this path and shared their experiences.

I believe that success does not have a single recipe, however, there are patterns and principles that we learned from failures and successes. I summarize a list of these core principles which cloud experts and decision-makers can refer to.

Simplify

Kubernetes is not a simple system, either to set up or operate. It helps to decrease the complexity of managing large-scale workloads in a world where applications could scale up to serve millions of users, where cloud-native and microservices architectures are the used approach for a lot of modern systems.

For infrastructure creation and management, we do not need to add another layer of complexity as the infrastructure itself is meant to be seamless and transparent to the products. As the primary concern and focus should remain on the product, not the infrastructure.

Here comes the simplification principle; which does not mean applying trivial solutions but simplifying the complex ones. This leads us to decisions such as deploying fewer clusters or avoiding multi-region and multi-cloud architectures; as long as we do not have a solid use case to justify them.

The simplification principle also applies to the infrastructure features and services we deploy to a Kubernetes cluster. It could be very attractive to deploy extra services (gold-plating the cluster) hoping to make it a powerful and feature-rich cluster. On the contrary, this will end up complicating the operations and decreasing the platform reliability.

Cloud Managed

Although cloud managed services appear pricier than self-managed ones, it is still preferred over them. In almost every scenario, a managed service is more efficient and reliable than its self-managed counterpart. We can apply this principle to Kubernetes managed services such as Google Kubernetes Engine (GKE), Azure Kubernetes Service (AKS), and Elastic Kubernetes Service (EKS). This goes beyond Kubernetes to the other infrastructure services, such as databases, object stores, messaging queues, cache, etc. Sometimes, the cloud-managed service could be less customizable or more expensive than a self-managed one, but in every other situation, you should always consider the cloud-managed service, and consider the operational cost and overhead cost that will occur over the period of the product life cycle.

Standardization

Having a set of standards always helps to reduce teams’ struggle with aligning and working together, eases the scaling of the processes, improves the overall quality, and increases productivity. This becomes essential for companies and teams planning to use Kubernetes for production.

Defining your set of standards covers processes for operations runbooks and playbooks, as well as technology standardization such as using containers, Kubernetes, and standard tools across teams.

These tools should have preferred characteristics, such as being open source but battle-tested in production, support and promote infrastructure as code, immutability, cloud-agnostic, and simple to use, and deploy with minimum infrastructure (think Ansible and Terraform).

The same principle applies to the technology stack and tools we choose, as unifying and standardizing the tools and technology stack across the teams is proven to be more efficient than having a set of inhomogeneous tools that end up hard to manage, and even if one of these tools is best for a use case, simplicity benefits will always overcome that.

Immutability

Immutability is an infrastructure provisioning concept and principle where we replace system components for each deployment instead of updating them in place. We always create immutable components from images or a declarative code, where we can build, test, and validate these immutable systems and get the same predictable results every time. Docker images and AWS EC2 AMI are examples of this concept.

Immutability leads to adopting the mentality of operating Kubernetes clusters as cattle instead of individual pets.

Everything as Code

This goes without saying! As I believe this one of the most known industry standards and best practices in the meantime for modern infrastructure and DevOps teams. It is a recommended approach to use declarative infrastructure as code (IaC) and configuration as code (CaC) tools and technologies to build and operate cloud and Kubernetes infrastructure, regardless of the time your team will invest writing this code instead of having a quick kickstart with any of the imperative tools.

Automation

We live in the era of software automation, as we tend to automate everything; it is more efficient and easier to manage and scale, but we need to take automation with Kubernetes to a further level. Kubernetes comes to automate the management of containers life cycle and it also comes with other advanced automation concepts, such as operators and GitOps, which are efficient and with them you can literally automate the automation.

Source Of Truth

Having a single source of truth is a cornerstone and an enabler to modern cloud infrastructure management and configuration. Source code control systems such as Git are usually the standard choice to play this role, as they can store and version the infrastructure code. A dedicated infrastructure git repository away from the product code is a commonly used best practice.

Design For Availability

Kubernetes is a key enabler for the high availability of both the infrastructure and the applications. Having high availability as a design pillar since day 0 is critical for getting the full power of Kubernetes, so at every design level, you should consider high availability, starting by choosing multi-zone or region architecture, then going through the Kubernetes layer by designing multi-master clusters, and finally, the application HA by ensuring the design of the product to support high availability (usually by developing stateless services), and then deploying multiple replicas of these services.

Cloud-Agnostic

Being cloud-agnostic means that you can run your workloads on any cloud with a minimal vendor lock, but take care of getting obsessed with the idea, and do not make it a goal on its own. Docker and Kubernetes make managing cloud-agnostic platforms possible, however, with challenges. This concept applies to the tools and other technologies that you select, such as Terraform vs. CloudFormation.

Business Continuity

Public cloud with its elasticity solved one problem that always hindered the business continuity for the online services, especially when it allowed infrastructure instance scaling. I can admit that cloud elasticity enabled small businesses to have the same infrastructure luxury that was allowed only for giant tech companies.

However, coping with the increased scaling needs and making it real-time remains a challenge, and with introducing containers to deploy and run apps it becomes easy to deploy and scale them in seconds rather than minutes. This puts pressure on Kubernetes and the underlying infrastructure layers to support such massive scaling capabilities of the containers.

You need to plan the scaling requirements for the future to support business expansion and continuity. Capacity planning questions such as whether to use a single large cluster versus smaller multiple clusters, how to manage the infrastructure cost, what are the best nodes sizes, and what is the most efficient resource utilization. All of these questions are required to be answered before creating the cluster and to remain ongoing during its operations.

Plan For Failures

A lot of distributed systems characteristics apply to Kubernetes and microservices on top of it; specifically, fault tolerance and resiliency, where we expect failures, and we plan for system components failures.

When designing a Kubernetes cluster, you have to design it to survive outages and failures, usually by adopting high-availability concepts. You also have to intentionally test and mitigate your infrastructure and systems failures. You can do this by using chaos engineering techniques, disaster recovery automation, infrastructure testing, and having complete infrastructure CD and IaC.

Operational Efficiency

Companies usually underestimate the effort to operate containers in production — what to expect on day 2 and beyond, and how to get prepared for outages, cluster upgrades, backups, performance tuning, resource utilization, and cost control. During that stage, companies need to figure out how to deliver changes continuously to an increasing number of production and non-production environments. Without the proper operations practices, this could create bottlenecks and slow down the business growth, and moreover, lead to unreliable systems that cannot fulfill customers’ expectations. We witnessed successful Kubernetes production rollouts, but eventually, things fell apart because of operations teams and the weak practices.

Learn about designing and building production-ready infrastructure in my new book about Kubernetes: 292 pages full of best practices, insights, and hands-on to help you successfully build and manage your Kubernetes infrastructure. https://www.amazon.com/dp/1800202458/