Kubelet can be instructed to reserve a certain amount of resources for the system and for Kubernetes components (kubelet itself and Docker etc).
Reserved resources are subtracted from the node's allocatable resources. This improves scheduling and makes resource allocation/usage more transparent.
You can explore how to reserve resources on the official documentation.
https://itnext.io/on-grpc-load-balancing-683257c5b7b3 https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/
If the content of an image changes, the image tag must change too.
Common strategies are to use the Git commit hash or the CI build ID as part of the image tag. Can be combined with semantic versioning.
For example: v1.0.1-bfeda01f
Avoid the latest
tag
Even for simple applications.
Requires installation of an Ingress controller.
TODO: integrate with current "Set pod disruption budgets" in "Fault tolerance" (application development)
For example, Azure Active Directory, AWS IAM.
Authentication with bearer token and API server validates token with the external service
TODO: is this a best practice?
Chartmuseum, Artifactory
TODO: integrate with current "Network policies"
TODO: integrate with current "Pod security policies" (governance)
- Using PodSecurityPolicies requires the PodSecurityPolicy admission controller to be enabled, but in most Kubernetes deployments it is not enabled. As soon as the PodSecurityPolicy admission controller is enabled, you need appropriate PodSecurityPolicy resources to allow any Pods to be created.
- You also need to grant "use" access to the created PodSecurityPolicies to the service account of the workload or the controller of the workload (you can use
system:serviceaccounts
group which compromises all controller service accounts).
- You also need to grant "use" access to the created PodSecurityPolicies to the service account of the workload or the controller of the workload (you can use
TODO: integrate with current "Custom policies" (governance)
Only allow compliant Kubernetes resources (of any kind) to be applied to the cluster (compliant with the defined policies).
- Open Policy Agent (OPA): policy engine
- Gatekeeper
- Validating admission control webhook
- Kubernetes Operator for installing, configuring and managing Open Policy Agent policies
Example policies that can be implemented with Gatekeeper:
- Services must not be exposed publicly on the internet
- Allow containers only from trusted container registries
- All containers must have resource limits
- Ingress hostnames must not overlap
- Ingresses must use only HTTPs
Use storage services outside the cluster (e.g. Amazon DynamodDB, Amazon S3).
Because this name is often the default in Helm charts.
Many stateful applications (e.g. databases) have operators that make it easier to run and manage them reliably.
Recommended set of admission controllers to be enabled: NamespaceLifecycle, LimitRanger, ServiceAccount, DefaultStorageClass, DefaultTolerationSeconds, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, Priority, ResourceQuota, PodSecurityPolicy
If you use multiple mutating admission webhook, make sure that they don't modify the same fields of a resource
They would undo each other's actions. Furthermore, the order in which admission webhooks run is undefined.
If you use create a mutating admission webhook, also create a validating admission webhook that validates the mutations
Using the namespaceSelector
field in the MutatingWebhookConfiguration/ValidatingWebhookConfiguration resource.
Restrict "create" on MutatingWebhookConfiguration/ValidatingWebhookConfiguration resources.
Do not manually maintain copies of the same YAML files for the different environments. Use a templating system like Helm, kustomize, Kapitan, ytt.
Works by defining the common structure of the YAML files once and defining values to substitute for each environment.
TODO: integrate with current "Logging" (application development) and "Logging setup" (cluster configuration).
This allows for the use of node agent based log aggregation systems rather than sidecar container based ones.
Everything a container writes to stdout or stderr is saved by the container runtime at a known location. This means that an agent running on each worker node can collect the logs of all the containers on that node and send them to the central log store.
If the container writes the logs to a file instead, the log file is local to the container and can't be accessed by a node agent. In this case, each Pod would require a sidecar container that collects these logs and sends them to the central log store.
A node agent (for example, deployed as a DaemonSet) is more efficient and easier to manage than a sidecar container for every Pod.
Cluster components and applications log to their containers, which means that logs are distributed all across the cluster. This makes it difficult to inspect the logs (probably of multiple components) to troubleshoot a problem.
A log aggregation system collects all these logs and saves them at a central place in the cluster so that you can inspect them at a single place.
Some log aggregation tools include: EFK stack (Elasticsearch, Fluentd, Kibana), Loki, DataDog, Sumo Logic, Sysdig, GCP Stackdriver, Azure Monitor, AWS CloudWatch
The operational overhead of running your own log aggregation system can be come quite large. This is because for running a log aggregation system, you need to take care of things like persistent storage, backups, archival, log rotation, etc.
With a hosted log aggregation solution, all these things are taken care of for you and there are fewer things that can go wrong.
Some hosted log aggregation systems include: DataDog, Sumo Logic, Sysdig, GCP Stackdriver, Azure Monitor, AWS CloudWatch
How long to retain logs in the log store and what to do with them afterwards?
As a general rule, retaining logs for 30-45 day in the log store is a reasonable value. After that, if the logs should still be available, you can move them to a cost-efficient archiving storage (e.g. Amazon Glacier).
Take into account any regulatory and internal compliance in your organisation.
Your applications are not the only components that produce logs in your cluster — all the cluster components do as well, and you should collect their logs too.
Here are some of the cluster components whose logs you should also have in your log aggregation system:
- All nodes: kubelet, container runtime
- Master nodes: API server, scheduler, controller mananger
- (Kubernetes auditing (all requests to the API server))
TODO: what kind of advice to give in this item?
Monitoring is the collection and aggregation of measurements from different components of your cluster. It is extremely important for gaining insight into the internals of your cluster, assessing its health, and detecting and troubleshooting problems, and even preventing them before they occur.
Monitoring consists of two parts:
- Components make measurements and expose them as metrics
- The monitoring system periodically collects these metrics
Some monitoring systems include:
- Self-hosted: Prometheus
- Managed: DataDog, Sumo Logic, Sysdig, Google Stackdriver, Azure Monitor, Azure Monitor for Containers, AWS CloudWatch, AWS Container Insights
What should you monitor?
Exactly what metrics to collect depends on the components in your cluster (i.e. what metrics they expose).
There are some general guidelines as to the types of metrics to collect:
- Infrastructure (e.g. nodes): USE metrics — Usage, Saturation, Errors
- Applications: RED metrics — Rate, Errors, Duration
You could run the monitoring system in the production cluster itself. However, if there is a problem with the cluster, the monitoring system might be affected as well (and you might need the monitoring system to troubleshoot the problem with the cluster).
To avoid this, you can create a dedicated cluster that only runs management tools (such as the monitoring system) for all your other clusters.
Omit alerts that don't require a human to take action right away. Too many unimportant alerts cause "alert fatigue" and cause important alerts to be ignored.
The core of your alerts should be on incidents that negatively affect the service level that you promised to your customers.
All incidents that don't merit an alert (because they don't require immediate human intervention, or they don't affect the customer experience) should be handled automatically (invest in automation).
TODO: integrate with current "Configuration and secrets"
Configuration should be maintained outside the application code.
This has several benefits. First, changing the configuration does not require recompiling the application. Second, the configuration can be updated when the application is running. Third, the same code can be used in different environments.
In Kubernetes, the configuration can be saved in ConfigMaps, which can then be mounted into containers as volumes are passed in as environment variables.
Save only non-sensitive configuration in ConfigMaps. For sensitive information (such as credentials), use the Secret resource.
Secrets are similar to ConfigMaps but have some special semantics to protect their content (e.g. content of Secrets is not displayed in some kubectl outputs)
The content of Secret resources should be mounted into containers as volumes rather than passed in as environment variables.
This is to prevent that the secret values appear in the command that was used to start the container, which may be inspected by individuals that shouldn't have access to the secret values.
- Injected environment variables are always present and may become artifacts in logs for the entire system.
- Secret-based environment variables should be mounted as a volume (not environment variables). In this way, they're only available to the desired process/container. Not the whole pod.
See tweet
For example, instead of naming a ConfigMap just config
name it config-v1
. Whenever you update the configuration in the ConfigMap, also update the version (for example, config-v2
). Then, update all references to the ConfigMap (for example, in a Deployment) to the new version (e.g. config-v2
).
This causes all Pods to be restarted, whih ensures that the new configuration is indeed loaded into the Pods, regardless of whether the ConfigMap is mounted as a volume or as environment variables, and, in the latter case, regardless of whether the app watches the configuration file for changes.
You can automate this process with a CD system.
If you don't delete the previous versions of the ConfigMap (e.g. config-v1
), you can easily roll back to a previous configuration.
TODO: integrate with current "Role-Based Access Control (RBAC) policies" (governance)
If a Pod needs to access the Kubernetes API, tailor an RBAC role that allows exactly those operatios that the Pod has to do (and nothing more), assign it to a new service account, and assign this service account to the Pod.
Don't use an exising service account for the Pod that might have an associated role with more permissions than the Pod needs.
TODO: integrate with current "Tagging resources"
What?
The Kubernetes documentation recommends a set of common labels to be applied to all resource objects.
Why?
Having a set of common labels allows third-party tools to interoperate with the resources in your cluster. Furthermore, a common set of labels facilitates and standardises manual management of the resources.
How?
The recommended labels are:
app.kubernetes.io/name
: the name of the applicationapp.kubernetes.io/instance
: a unique name identifying the instance of an applicationapp.kubernetes.io/version
: the current version of the application (e.g., a semantic version, revision hash, etc.)app.kubernetes.io/component
: the component within the architectureapp.kubernetes.io/part-of
: the name of a higher level application this one is part ofapp.kubernetes.io/managed-by
: the tool being used to manage the operation of an application
Here is an example of applying these labels to a StatefulSet resource:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql-abcxyz
labels:**
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
spec:
# ...
template:
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
spec:
# ...
Note that depending on the application, only some of these labels may be defined (for example, app.kubernetes.io/name
and app.kubernetes.io/instance
), but they should be applied to all resource objects.
References
E.g. if Pod specification changes, update only the Pod and app version, but not the container image version, etc.
Resources that compromise an entire app (e.g. Deployment, StatefulSet) should have a release
label.
This label should change every time the application is deployed (the release should be updated even if the version of the app stays the same).
TODO: integrate with current "Resource utilisation"
What?
You can set a resource request (CPU and memory) for each container of a Pod in the Pod specification.
Why?
The request for a resource is the minimum amount of resource the container needs for running. It influences the scheduling decision of the Kubernetes scheduler (the scheduler schedules a pod only to a node that has enough free resources to accommodate the requests of all the containers of the pod).
By default, no resource requests are set, which means that the scheduler schedules a pod to any node, no matter how many free resources it has.
Setting requests for memory and CPU for each container of a Pod ensures that the Pod is able to run properly on the node it is scheduled to.
How?
Set the pod.spec.containers[].resources.requests
field of a Pod specification:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
# ...
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
memory: 100Mi
cpu: 100m
References
The limit for a resource is the maximum amount of the resource that the container is allowed to use (you can think of it as an upper cap for resource usage bursts). What happens when a container reaches the limit, depends on the type of resource:
- For memory, the container is killed and restarted
- For CPU, the container is throttled (TODO: what does "throttled" mean exactly?)
By default, no resource limits are set, which means that there's no limit of how much resources a pod may use on a node.
Setting a resource limits prevents Pods from monopolising all the resources on node.
Given a value for the resource request, the value for the resoruce limit determines the Quality of Service (QoS) class of the Pod:
- Best-effort: no requests and limits, lowest priority, Pod is killed first
- Burstable: limits are higher than requests, middle-priority, killed if no Best-effort Pods exist
- Guaranteed: limits equal requests, highest priority, killed only if no Best-effort and Burstable Pods exist
TODO: how does this affect the QoS class? Where's the source that this is a best practice?
TODO: integrate with current "Health checks"
TODO: integrate with current "Graceful shutdown" (rename it "Application lifecycle")
This is an alternative to listening for the SIGTERM signal.
As a best practice to use this, I would recommend using kube2iam -- this allows you to control which IAM roles a pod is allowed to assume based on namespace, but also allows the pod to see itself as having the role vs requiring it to assume such a role.
In our setup we use kube2iam and assign nodes their IAM role - this IAM role is allowed to assume other roles. With kube2iam, we assign the role through a pod annotation, so the pod sees itself as having that role. Additionally we use namespace restrictions to control access to which roles a pod is allowed to assume.
See #7