Understanding Linux Container Scheduling

Docker containers are often thought of as complete shippable software packages that run as lightweight VMs. While this can be a very convenient view, it is important to understand how containers are really implemented in Linux. This article explores how Docker containers are implemented using Linux control groups (cgroups) and how containers can affect the performance of systems like the JVM.

All containers running on a host ultimately share the same kernel and resources. In fact, Docker containers are not even a first class concept in Linux but instead just a group of processes that belong to a combination of Linux namespaces and control groups (cgroups). System resources, such as CPU, memory, disk, and network bandwidth can be restricted by these cgroups, providing mechanisms for resource isolation. Namespaces are then used to limit a process's visibility into the rest of the system through the use of the ipc, mnt, net, pid, user, cgroups, and uts namespace subsystems. The cgroups namespace is in fact used to limit the view of cgroups; cgroups themselves are not namespaces.

Any process not explicitly assigned to a cgroup is automatically included in the root cgroup. On CentOS the root cgroup and any children are mounted as a mutable filesystem at /sys/fs/cgroup (check with mount if you are on a different Linux distribution). A user with sufficient privileges can easily create cgroups, modify them, or move tasks to them using basic shell commands or the higer order utilities provided by the libcgroup-tools package. Of particular interest is the cpu and cpuacct cgroup subsystems that are mounted at /sys/fs/cgroup/cpu,cpuacct. The symlink /sys/fs/cgroup/cpu can also be used for simplicity. The cpuacct subsystem is very simple as it solely collects CPU runtime information while the cpu subsystem schedules CPU access to each cgroup using either the Completely Fair Scheduler (CFS) or the Real-Time Scheduler (RT). We can ignore the realtime scheduler as it is not used by default on Linux and Docker.

When we run the Docker container image quay.io/klynch/java-simple-http-server, the Docker daemon creates a container and spawns a single Java process within the container. The container is assigned a unique identifier of 31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267 by the Docker daemon and will be used later to label and identify the various components that comprise the container. This identifier has no real significance to the kernel, however.

By default, Docker creates a pid namespace for this container, isolating the process from other namespaces; the Java process is attached to this new pid namespace before execution and is assigned PID 1 by the Linux kernel. However, this process is not entirely isolated from other processes on the system. Because PID namespaces are nested, every namespace except for the initial root namespace has a parent namespace. A process running in a namespace can see all processes of child pid namespaces. This means that a process running in the root namespace, such as our shell, can see all processes running on the system, regardless of namespace. In our example, we can see that the java process has the PID 30968. We can also see the cgroups and namespaces our process is assigned to:

 

# cat /proc/30968/cgroup
11:cpuacct,cpu:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
10:net_prio,net_cls:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
9:freezer:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
8:memory:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
7:pids:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
6:perf_event:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
5:devices:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
4:blkio:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
3:cpuset:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
2:hugetlb:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267
1:name=systemd:/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267

# ls -l /proc/30968/ns/*
lrwxrwxrwx 1 root root 0 May  7 14:16 ipc -> ipc:[4026532461]
lrwxrwxrwx 1 root root 0 May  7 14:16 mnt -> mnt:[4026532459]
lrwxrwxrwx 1 root root 0 May  7 15:41 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 May  7 14:16 pid -> pid:[4026532462]
lrwxrwxrwx 1 root root 0 May  7 15:41 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 May  7 14:16 uts -> uts:[4026532460]

We can also verify the container of a process by grepping for our pid in the file /sys/fs/cgroup/cpu,cpuacct/docker/31dbff344530a41c11875346e3a2eb3c1630466173cf9f1e8bca498182c0c267/tasks or by running the command systemd-cgls and searching for the process in question. However, this does not tell us what our process is mapped to inside of the container! You can access that by looking at the process status file. Unfortunately, this was introduced in a kernel patch that has not yet been backported to the CentOS 7.3 kernel. However, in practice, it should is usually simple to identify the appropriate process inside of a container. The following command shows us that our process maps to PID 1 inside of it's namespace.

# grep NSpid /proc/30968/status
NSpid:  30968    1

We can verify that the view from inside of our process namespaces is a little different. We can use the docker exec command to run an interactive shell provided our container has a binary for our shell. This command is a much simpler solution for most cases than the nsenter utility. After you run the exec, you will then see a shell prompt that is sharing the same namespaces as our java process, including the pid namespace.

# docker exec -it java-http bash

# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  5.8 18.6 4680796 724080 ?      Ssl  05:10  41:51 java SimpleHTTPServer

Likewise, the cgroup namespace is restricted to the container's cgroups, further isolating our process from other processes running on the system.

# ls -l /sys/fs/cgroup/cpuacct,cpu
-rw-r--r-- 1 root root 0 May  7 05:10 cgroup.clone_children
--w--w--w- 1 root root 0 May  7 05:10 cgroup.event_control
-rw-r--r-- 1 root root 0 May  7 17:04 cgroup.procs
-rw-r--r-- 1 root root 0 May  7 05:10 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 May  7 05:10 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 May  7 05:10 cpu.rt_period_us
-rw-r--r-- 1 root root 0 May  7 05:10 cpu.rt_runtime_us
-rw-r--r-- 1 root root 0 May  7 05:10 cpu.shares
-r--r--r-- 1 root root 0 May  7 05:10 cpu.stat
-r--r--r-- 1 root root 0 May  7 05:10 cpuacct.stat
-rw-r--r-- 1 root root 0 May  7 05:10 cpuacct.usage
-r--r--r-- 1 root root 0 May  7 05:10 cpuacct.usage_percpu
-rw-r--r-- 1 root root 0 May  7 05:10 notify_on_release
-rw-r--r-- 1 root root 0 May  7 05:10 tasks

While cpu and cpuacct provide different capabilities, they are implemented by different kernel subsystems. This is due to a complexity of cgroups that allows for a task to be assigned to different groups but permits only a single instance of a subsystem type to be assigned. Because of the close relationship with the cpu and cpuacct subsystems.

Scheduling

When we think of containers as lightweight VMs, it is more natural to think of resources in terms of discrete resources such as number of processors. However, just as the hypervisor schedules these requests onto discrete hardware, the Linux kernel must do the same with processes. Scheduling cgroups in CFS, like most process schedulers, requires us to think in terms of time slices instead of processor counts. The cpu cgroup subsystem is used to limit how processes are scheduled on the system by the CFS and can be tuned to support relative minimum resources as well as hard ceiling enforcements used to cap processes from using more resources than provisioned. Both tunable classes behave differently and may appear confusing at first.

CPU Shares

CPU shares provide tasks in a cgroup with a relative amount of CPU time, providing an opportunity for the tasks to run. The file cpu.shares defines the number of shares allocated to the cgroup. The amount of time allocated to a given cgroup is the number of shares divided by the total number of shares available. This proportional allocation is calculated for each level in the cgroup hierarchy. In CentOS, this begins with the root / cgroup with 1024 shares and 100% of CPU resources. The root cgroup is typically limited to a small number of critical userland kernel processes and the initial SystemD process. The rest of the resources are then offered equally amongst the groups /system.slice (system services), /user.slice (user processes), and /docker (Docker containers) each with an equal weight of 1024 by default.

On minimal CentOS installations, we can typically ignore the impact of system services and user processes. This will allow the scheduler to offer nearly all of the CPU time to the /docker group proportional to each container's share. If there are 3 containers with weights of 2048, 1024, and 1024 on a four core system, the first cgroup will be allocated the equivalent of 2 cores, and the two remaining cgroups will each be given the equivalent of 1 core. If all of the tasks in a cgroup are idle and not waiting to run, any unused shares are then placed in a global pool for other tasks to consume. Thus, if there is a single task in the first cgroup, the unused shares will be placed back into the global pool.

CPU Quotas

While CPU shares are unable to guarantee a minimum amount of CPU time without complete control of the system, it is much easier to enforce a hard limit to the CPU time allocated to processes. CPU bandwidth control for CFS was introduced to prevent tasks to exceed the total allocated CPU time for a given cgroup.  By default, quotas are disabled for a cgroup with cpu.cfs_quota_us set to -1. If enabled, CFS quotas will permit a group to execute for up to cpu.cfs_quota_us microseconds within a period of cpu.cfs_period_us microseconds, (default of 100ms). If a groups tasks are unconstrained they will be permitted to use as many unused resources available on the host. By adjusting a cgroup's quota relative to the period we can effectively assign entire cores to a group! A quota of 100ms will allow tasks in that group to run for a total of 100ms during that entire 100ms window.

If two tasks are executing in the same cgroup on different cores, each task will contribute to the quota. If the entire quota is eliminated yet there are still tasks waiting to execute, the group will get throttled, even if a host has unused processor resources. The number of periods and the accumulated amount of time in nanoseconds a cgroup has been throttled is reported in the cpu.stat as the nr_throttled and throttled_time statistics. Likewise, if there are enough tasks left in the waiting state for a long enough, we may see the load average increase. Performance tools like Netflix Vector can help easily identify throttled containers. The systemd-cgtop utility can also be used to show how many resources are being consumed by each cgroup.

When scheduling containers using quotas, it is important that the processes are provided an appropriate window of time to execute. If a cgroup is consistently being throttled, it is likely not being allocated enough resources. This is particularly true when running complex systems like the JVM that make many assumptions about the system it is running on. Because the JVM is still able to see the number of cores on a running system, it will size the number of GC threads to the number of physical cores on the host, regardless of its quota limit. This can lead to disastrous consequences when running the JVM on a 64 core machine but limiting it to the equivalent of 2 cores as the GC can result in longer than expected application pauses. Additionally, the use of Java 8 parallel stream features can cause similar issues. By sizing the number of threads permitted using The JVM flags -XX:ParallelGCThreads, -XX:ConcGCThreads, and -Djava.util.concurrent.ForkJoinPool.common.parallelism we can prevent many unnecessary pauses. The Fabric8 Java base images detect the cgroup settings and automatically configure the JVM with appropriate values.

Conclusion

In this post we have looked at how a Docker container is assigned resources and scheduled using Linux cgroups. Because resource requirements are highly variable it is typically not possible to predictable partition resources. However, cgroups allow us to partition the resources sanely and easily schedule our container based processes using the Completely Fair Scheduler. While this post focused on the use of cgroups v1, it is important to know that this will be changing in the future. A second version of cgroups was introduced in the 4.5 kernel to simplify the complexities of the first version. Notably, the removal of the hierarchies introduces a new model that is simpler to implement and understand. The scheduling features, however, are still being worked out and will likely not be introduced into the RHEL kernel for a very long time.