Kubernetes Production Readiness

7 min readSep 11, 2021

“Your offering is production-ready when it exceeds customer expectations in a way that allows for business growth.” – Carter Morgan, Developer Advocate, Google

Production readiness is the goal every organization needs their Kubernetes infrastracture to reach so theyfeel confident about their apps in the cloud. We may not have a definitive definition for the production readiness buzzword, it could mean a cluster capable to serve production workloads and live traffic reliably and securely. We can further extend this definition, but what many experts agree on is that you need to fulfill a minimum set of requirements before you mark your Kubernetes cluster as “production-ready”.

I categorized these readiness requirements according to the Kubernetes cluster layers illustrated in the following diagram:

Figure 1 — Kubernetes infrastructure layers

This diagram describes the typical layers of Kubernetes infrastructure. There are six layers: public or private cloud infrastructure; the cloud IaaS; the Kubernetes cluster; the core cluster services; the cluster supporting services; and finally, the applicaiton layer.

The production-readiness checklist

I tried to categorize the items into one checklist that is mapped to the corresponding infrastructure layers. Each category represents group of design and implementation concerns that you need to consider during building your cluster infrastructure.

Cluster infrastructure

The following checklist items cover the production-readiness requirements on the cluster level:

Highly available control plane: You can achieve this by running the control plane components on three or more nodes. Another recommended best practice is to deploy the Kubernetes master components and etcd on two separate node groups. This is generally to ease etcd operations, such as upgrades and backups, and to decrease the radius of control plane failures.

Also, for large Kubernetes clusters, this allows etcd to get proper resource allocation by running it on certain node types that fulfill its extensive I/O needs. Finally, avoid deploying pods to the control plane nodes.

Highly available node groups: You can achieve this by running a group or more of worker nodes with three or more instances. If you are running these worker groups using one of the public cloud providers, you should deploy them within an auto-scaling group and in different availability zones.

Another essential requirement to achieve worker high availability is to deploy the Kubernetes cluster auto scaler, which enables worker nodes to horizontally upscale and downscale based on the cluster utilization.

Shared storage management solution: You should consider using a shared storage management solution to persist and manage stateful apps’ data. There are plenty of choices, either open-source or commercial, such as AWS Elastic Block Store (EBS), Elastic File System (EFS), Google Persistent Disk, Azure Disk Storage, ROOK, Ceph, and Portworx. There is no right or wrong choice among them, but it all depends on your application use case and requirements.
Infrastructure observability stack: Collecting logs and metrics on the infrastructure level for nodes, network, storage, and other infrastructure components is essential for monitoring a cluster’s infrastructure, and also to get useful insights about the cluster’s performance, utilization, and troubleshooting outages.

You should deploy monitoring and alerting stacks, such as Node Exporter, Prometheus, and Grafana, and deploy a central logging stack, such as ELK (Elasticsearch, Logstash, and Kibana). Alternatively, you can consider a complete commercial solution, such as Datadog, New Relic, AppDynamics, and so on.

Fulfilling the previous requirements will ensure the production readiness of the cluster infrastructure. Later in this book, we will show you in more detail how to achieve each of these requirements through infrastructure design, Kubernetes configuration tuning, and third-party tools usage.

Cluster services

The following checklist items cover the production-readiness requirements on the cluster services level:

Control cluster access: Kubernetes introduces authentication and authorization choices and lets the cluster’s admin configure them according to their needs. As a best practice, you should ensure authentication and authorization configuration is tuned and in place. Integrate with an external authentication provider to authenticate cluster’s users, such as LDAP, OpenID Connect (OIDC), and AWS IAM.

For authorization, you need to configure the cluster to enable Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), and webhooks.

Hardening the default pod security policy: Pod security policy (PSP) is a Kubernetes resource that is used to ensure a pod has to meet specific requirements before getting created.

As a best practice, we recommend that you limit any privileged pods within the Kube-system namespace. For all other namespaces that host your apps pods, we recommend assigning a restrictive default PSP.

Cluster policies and rules: Rules and policy enforcement are essential for every Kubernetes cluster. This is true for both a small single-tenant cluster and a large multi-tenant one. Kubernetes introduces native objects to achieve this purpose, such as pod security policies, network policies, resource limits, and quotas.

For custom rules enforcement, you may deploy an open policy agent, such as OPA Gatekeeper. This will enable you to enforce rules such as pods must-have resource limits in place, namespaces must have specific labels, images must be from known repositories, and many others.

Fine-tune the cluster DNS: Running a DNS for Kubernetes clusters is essential for name resolution and service connectivity. Managed Kubernetes comes with cluster DNS pre-deployed, such as CoreDNS. For self-managed clusters, you should consider deploying CoreDNS too. As a best practice, you should fine-tune CoreDNS to minimize errors and failure rates, optimize performance, and adjust caching, and resolution time.
Restricted network policies: Kubernetes allows all traffic between the pods inside a single cluster. This behaviour is insecure in a multi-tenant cluster. As a best practice, you need to enable network policies in your cluster, and create a deny-all default policy to block all traffic among the pods, then you create network policies with less restrictive ingress/egress rules to allow the traffic whenever it is needed for between specific pods.
Security checks and conformance: Securing a Kubernetes cluster is not questionable. There are a lot of security configurations to enable and tune for a cluster. This could get tricky for cluster admins, but luckily, there are different tools to scan cluster configuration to assess and ensure that it is secure and meets the minimum security requirements. You have to automate running security scanning tools, such as Kube-scan for security configuration scanning, Kube-bench for security benchmarking, and Sonobuoy to run Kubernetes standard conformance tests against the cluster.
Backup and restore: As with any system, Kubernetes could fail, so you should have a proper backup and restore process in place. You should consider tools to back up data, snapshot the cluster control plane, or back up the etcd database.
Observability for the cluster components: Monitoring and central logging are essential for Kubernetes components such as control-plane, kubelet, container runtime, and more. You should deploy monitoring and alerting stacks such as Node Exporter, Prometheus, and Grafana, and deploy a central logging stack, such as EFK (Elasticsearch, Fluentd, and Kibana).

Fulfilling the previous requirements will ensure the production readiness of the cluster services. Later in this book, we will show you in more detail how to achieve each of these requirements through Kubernetes configuration tuning and third-party tools usage.

Apps and deployments

The following checklist items cover the production-readiness requirements on the apps and deployments level:

Image quality and vulnerability scanning: An app image that runs a low-quality app or that is written with poor-quality specs can harm the cluster reliability and other apps running on it. The same goes for images with security vulnerabilities. For that, you should run a pipeline to scan images deployed to the cluster for security vulnerabilities and deviations from quality standards.
Network ingress controller: By default, you can expose Kubernetes services outside the cluster using load balancers and node ports. However, the majority of the apps have advanced routing requirements, and deploying an Ingress Controller such as Nginx’s Ingress Controller is a de facto solution that you should include in your cluster.
Certificates and secrets management: Secrets and TLS certificates are commonly used by modern apps. Kubernetes comes with a built-in Secrets object that eases the creation and management of secrets and certificates inside the cluster. In addition to that, you can extend secrets object by deploying other third-party services, such as Sealed Secrets for encrypted secrets, and Cert-Manager to automate certificates from certificate providers such as Let’s Encrypt or Vault.
Applications observability: You should make use of Kubernetes’ built-in monitoring capabilities, such as defining readiness and liveness probes for the pods. Besides that, you should deploy a central logging stack for the applications’ pods. Deploy a BlackBox monitoring solution or use a managed service to monitor your apps’ endpoints. Finally, consider using application performance monitoring solutions, such as New Relic APM, Datadog APM, AppDynamics APM, and more.

Fulfilling the previous requirements will ensure the production readiness of the apps and deployments. Later in this book, we will show you in more detail how to achieve each of these requirements through Kubernetes configuration tuning and third-party tool usage.

Learn about designing and building production-ready infrastructure in my new book about Kubernetes: 292 pages full of best practices, insights, and hands-on to help you successfully build and manage your Kubernetes infrastructure. https://www.amazon.com/dp/1800202458/

Kubernetes Production Readiness

The production-readiness checklist

Cluster infrastructure

Cluster services

Apps and deployments

Written by Aly Saleh