sia.hackernoon.com

If you have been using kubernetes for a long time, then you know what it is resource quotas. But do you know it well enough? Do you know what mechanism is build on? If you did not - you will soon know.

First if all, kubernetes is a container management platform. Therefore, we will delve into the mechanisms of the container.

CGROUPS

Cgroups is a Linux kernel mechanism that allows you to place processes in hierarchical groups for which the use of system resources can be limited. For example, the "memory" controller limits the use of RAM, the "cpuacct" controller takes into account the use of processor time.

There are two versions of cgroup: v1 and v2.

cgroupv1:

сpu – guarantees the minimum and limits the minimum number of “CPU shares”. In order not to deprive any process.

cpuact - generates reports on the use of processor resources. Counts the usage of a process.

cpuset - allows you to assign a process to certain cores. For example, reports that only certain processes have access to a certain kernel.

memory - monitors and limits the amount of processor memory.

blkio – sets limits for reading and writing from block devices.

cgroup v2 is the next version of the Linux cgroup API. cgroup v2 provides a unified control system with enhanced resource management capabilities.

cgroup v2 has several improvements over cgroup v1, for example:

JAVA 15+ can use cgroup v2. Applications (using JAVA 15+) can be configured to use the container's quotas rather than all the resources available on the Kubernetes node.

k8s supported.

Enhanced resource allocation management and isolation across multiple resources

Unified accounting for different types of memory allocations (network memory, kernel memory, etc).

The kubelet automatically detects that the OS is running on cgroup v2 and performs accordingly with no additional configuration required.

System requirements for using cgroup v2

CAPABILITIES

Capabilities are the means to manage privileges, which in traditional Unix-like systems were only available to processes.
Permissions for a process to make certain system calls. Only about 20 pieces

Examples:

CAP_CHOWN - permission to change the UID and GUID of the file

CAP_KILL - permission to send signals (sisterm, sigkill, etc.)

CAP_NET_BIND_SERVICE - permission to use ports with a number less than 1024

etc etc.

Finally, about quotas. Mechanisms that allow you to limit the use of resources for a container (not for Pod)

Requests - a guaranteed amount of resources (If the node does not have enough free resources, then the scheduler does not place the Pod on the node).

Limits - the maximum amount of the resource. Nothing is guaranteed. Those. the total size of limits can exceed the entire namespace quota. For example, you can set 999 trillion cores. • If you set only Limits, then automatically requests = limits.

If you specify only requests, then limits will not appear.

Limit - defines the memory limit for cgroup. If the container tries to use more memory than the Limit, then OOMkiller will kill one of the processes.

Requests - with cgroups v1, they only affect the start of the pod. With cgroups v2 there are special memory.min and memory.low controllers. Exclusively allocated memory for the container, which no one else can use.

Tmpfs (ephemeral storage) counts as memory consumed by the container.

#Container resources example
apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
        ephemeral-storage: "2Gi"
      limits:
        memory: "128Mi"
        cpu: "500m"
        ephemeral-storage: "4Gi"

How CPU requests work

Requests - used by cpu.share. The root cgroup (root) contains the number of CPUs * 1024 shares and inherits child cgroups in proportion to their cpu.shares and so on.

If all shares are occupied, but no one is using anything, then you can leave them.

How CPU Limits Work

Limits - used by cfs_period_us and cfs_quota_us. Us is microseconds (mu). Unlike requests, limits are based on time spans.

cfs_period_us – time period within which the quota usage is considered. Equals 100000mu (100ms).

cfs_quota_us – allowed amount of CPU time in us per period.

Scenario 1 (left picture) : 2 thread and 200ms limit. No throttling

Scenario 2 (right): 10 thread and 200ms limit. throttling starts after 20ms and only receive cpu power after 80ms.

Let’s say you have configured 2 core as CPU limit; the k8s will translate this to 200ms. That means the container can use a maximum of 200ms CPU time without getting throttled.

Here starts all misunderstanding. As I said above, the allowed quota is 200 ms, which means if you are running 10 parallel threads on 12 core machine (see the second figure) where all other pods are idle, your quota will exceed the limit in 20ms (10 * 20 ms = 200 ms), and all threads running under that pod will get throttled for next 80 ms. To make the situation worse, the scheduler has a bug that is causing unnecessary throttling and prevents the container from reaching the allowed quota.

CPU Management Policy

The CPU Manager policy is set with the --cpu-manager-policy kubelet flag or the cpuManagerPolicy.

vim /etc/systemd/system/kubelet.service

And add the folowing lines:

--cpu-manager-policy=static \
  --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi \
  --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi \

Allows you to assign dedicated cores to containers (cpuset).
Works if the pod has guaranteed qos.
Type of requests value by cpu must be an integer.

The role of K8S Scheduler in quotas distribution

Responsible for placing pods on cluster nodes. 2 stages:

Filtering – the scheduler selects suitable nodes •Scoring – evaluates suitable nodes and selects the most appropriate one. NodeResourcesFit is a scheduler plugin that checks resources on nodes. It checks which nodes have enough Pod resources. You can configure some resources not to be checked.

Scoring - selects the best node.

There are 3 strategies for choose:

LeastAllocated (default) – bets on the node that is the least utilized.

MostAllocated

RequestToCapacityRatio

#Example to use scoringStrategy
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - args:
      scoringStrategy:
        resources:
        - name: cpu
          weight: 1
        type: MostAllocated
    name: NodeResourcesFit

Storage Resource Quota

Requests.storage - Across all persistent volume claims, the sum of storage requests cannot exceed this value.

Persistentvolumeclaims - The total number of PersistentVolumeClaims that can exist in the namespace.
<storage-class-name>.storageclass.storage.k8s.io/requests.storage - Across all persistent volume claims associated with the <storage-class-name>, the sum of storage requests cannot exceed this value.
<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims - Across all persistent volume claims associated with the storage-class-name, the total number of persistent volume claims that can exist in the namespace.

For example, if an operator wants to quota storage with gold storage class separate from bronze storage class, the operator can define a quota as follows:

gold.storageclass.storage.k8s.io/requests.storage: 500Gi
bronze.storageclass.storage.k8s.io/requests.storage: 100Gi

Ephemeral storage

In release 1.8, quota support for local ephemeral storage is added as an alpha feature:

requests.ephemeralstorage - Across all pods in the namespace, the sum of local ephemeral storage requests cannot exceed this value. The amount of free space that should be on the node at the time the container is launched.
limits.ephemeral-storage - Across all pods in the namespace, the sum of local ephemeral storage limits cannot exceed this value. The maximum amount of ephemeral storage available to the pod.
ephemeral-storage - Same as requests.ephemeral-storage. EmptyDir except tmpfs, container logs, rw container layers. If this place runs out on one container, then it will end everywhere.

Quite obscure quotas

Count/resource – the maximum number of resources of this type in the namespace.
Count/widget.example.com - example for widgets custom resource from example.com API group

Typical object counts:

count/persistentvolumeclaims
count/services
count/secrets
count/configmaps
count/replicationcontrollers
count/deployments.apps
count/replicasets.apps
count/statefulsets.apps
count/jobs.batch
count/cronjobs.batch

It is possible to configure the total number of objects that can exist in the namespace.

Reasons for use Object Count Quota:

Error protection - pod limit per node 110 pieces.
To prevent bad practices

PID limits

Limits on the number of PIDs. If you create a lot of pids in the container, then the node will run out of PIDs.

Global kubelet setting - different behavior is possible on different nodes if the settings are different.

It is possible to prevent "Fork bomb".

:(){ :|:& };:

Quotas for extended resources

Extended resources - any resource configured by the cluster operator, which is taken from outside. K8s knows nothing about him and does not work with him in any way.
Node level - tied to the node, ie. each node has some amount of resource. Often controlled by Device Plugin.
Cluster level - common for the entire cluster.

But in ext resources you can't use limits. Only requests. Example of extended resources quota:

#correct:
requests.nvidia.com/gpu: "4"
#not correct:
limits.nvidia.com/gpu: "4"

Network quotas and network Bandwidth

You can set the network bandwidth for the pod in spec.template.metadata.annotations to limit the network traffic of the container.

If the parameters are not specified, the network bandwidth is not limited by default.

The following is an example:

apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: nginx 
spec: 
  template: 
    metadata: 
      annotations:
       # Ingress bandwidth
        kubernetes.io/ingress-bandwidth: 100M
       # Egress bandwidth
        kubernetes.io/egress-bandwidth: 1G
    spec: 
      containers: 
      - image: nginx  
        imagePullPolicy: Always 
        name: nginx

Limited through a plug-in to the CNI - this is not a quota in the usual sense of k8s.
Configurable via pod annotations - works based on Token Bucket Filter.

Some other shared resources

inode - ephemeral container storage usually has a shared file system.
Dentry cache - file system cache, stores the relationship between files and directories in which they are located.

Conclusion

Why it is necessary. Because it:

Reduces the influence of the container on each other.
Provides cluster stability.
Ensures predictability of container performance.

Kubernetes Resource Quotas