Containers and Virtualization

November 2018

This post looks at the container and virtualization landscape as of 2018 - a year in which containerization, either directly through Docker or Kubernetes, or indirectly through AWS Lambda and the like, has skyrocketed in popularity.

Containers provide isolation, well-defined communication channels for security, and a convenient way to package an application (i.e., a set of binaries, libraries, and script source code files) along with its dependencies (eg. a set of modules and operating system packages and config files). This makes for a reliable unit that you can count on to work, even years into the future when the software in the container is out of date and unmaintained. The software may be insecure and buggy but at least it can be counted on to work.

Unlike virtual machines, containers do not guarantee is security. In particular, when attempting to run untrusted code, containers typically do not provide security guarantees or on-the fly control of resource utilization. A single exploit could potentially result in a container breakout. An 'untrusted' script for this article is anything from a third party, AWS Lambda or Google Cloud Functions routinely run untrusted code, and do it safely.

Containers also do not provide a good management interface (eg. to change memory limits or number of virtual CPUs) which is crucial for long-running code in containers, eg. servers.

Lightweight VMs attempt to bridge this gap by providing fast and stripped down virtual machines that are easy to use and secure.

Before proceeding, it may be helpful to see a technology review of common virtualization technology. Skip this section and proceed directly to Lightweight VMs if you're already familiar with these terms.

Technology Overview - Containers

Docker

The project that first popularised the container concept.

Kubernetes

Google's container orchestration system to automate the deployment, scaling, and management of containerised applications. This works with Docker contaieners or other containers. Docker Swarm is another example of an orchestration system.

OCI Initiative

The Open Container Initiative is an effort to create open standards for container formats and runtime, and is backed by the Linux Foundation. OCI has produced an image-spec for definining container image formats, as well as a runtime format for defining the runtime. The idea is that any OCI compliant container would work with any OCI compliant runtime.

The runtime spec defines how to run a filesystem bundle unpacked on the disk. It defines the configuration files, root filesystem, and other conventions.

runC

runC is the reference container runtime developed by OCI. This allows for spawning and running containers, and interacting with containers - as long as they are built according to the OCI runtime-spec as described above. runC deals with low level Linux capabilities like cgroups, namespaces, and the like. A runC instance is created to spawn each container, and once spawned it exits. In the case of Docker, containerd-shim handles the actual running of the container once runC has started it.

Note that given an OCI runtime-spec compliant filesystem bundle, runc is all you need to run it - you don't need dockerd, docker, or containerd. Of course, this also means managing the whole system yourself.

Containerd

This is a higher-level daemon that controls runc, and manages the container lifecyle for the host - from image transfer and storage to container execution and storage. It manages networking for containers, pushing and pulling images (but not creating them, that falls to Dockerd), and execution. It exposes gRPC endpoints.

Dockerd

This is the highest level component of the system, as illustrated below. This provides the ux features of Docker. It allows for building images, volume management, and logging.

The relationship between components in the Docker ecosystem

Technology Overview - Virtualization

Hypervisor

This is what creates and runs virtual machines. They can be native (running directly on the host's hardware) or hosted (running on a conventional operating system, eg. VirtualBox/QEMU).

Guest and Host

The guest is the operating system being emulated, and the host is the operating system that provides the virtualization.

KVM

KVM is a hypervisor project implemented as a Linux kernel module. It turns a host Linux system into a native hypervisor.

Xen

A native hypervisor that runs as a host operating system.

QEMU

QEMU is a hosted hypervisor (i.e., unlike a native hypervisor it runs on top of a host operating system). It is a full system emulator that can emulate differnt CPUs and hardware, using binary translation to convert opcodes from the guest to the host. It can also model hardware devices in software to provide a virtual environment. Of course, since this is all done in software, emulation is extremely slow. This makes it possible, for example, to run ARM software on an AMD machine.

However, if KVM is available QEMU can make use of it to run at near native speed.

KVMTool

This is a more lightweight, stripped, and focused qemu alternative that only supports running Linux machines on Linux hosts with both vm and host running the same architecture. It should be more performant than QEMU.

Libguestfs

Libguestfs is a set of tools that allow you to access and modify virtual machine disk images.

Lightweight VMs

Lightweight VMs are stripped down virtual machines with an emphasis on management and speed.

Most lightweight VMs do not have the ease of use of Docker. For example, Docker allows one to:

Easily open a shell into a running container
Edit mounted files from outside the container
Build and develop easily with a Dockerfile
Add in custom components at build time (maybe compile custom software in the Dockerfile)
Easily manage volumes and mounted filesystems

The best supported MicroVMs include:

Firecracker

Firecracker was released as open source by AWS in November 2018, and is what powers AWS Lambda and AWS Fargate. It is a lightweight VM that approaches a container in terms of performance while also providing mechanisms to limit resources and provides better security guarantees.

However, this is considerably complicated to set up and run. It is nothing like the Dockerfile in terms of ease of use, we will likely have to package our own disk image for use with this product, as well as a separate kernel to boot with. Tooling is likely to develop over the coming months.

It provides a management API over REST, allowing you to set memory limits, virtual CPU count, etc. on the fly.

This is a complete solution that works directly with KVM.

This provides excellent security, is very fast, and provides a useful and fine-grained management API. However, it is exceedingly hard to use - without the convenience of something like Dockerd to simplify creating and running images, you are left to manage the filesystem images and kernel yourself - you need an ext4 filesystem image to boot the vm.

Kata Containers

This is about two years old, but has lately become very polished and very compelling. Note that setting this up and running it is a considerable commitment - this is not a simple project to manage and run.

This combines Intel ClearContainers (which is comparable to Docker) with runV (described below).

With all components (kata-agent, kata-runtime, etc.) this is a complete solution that works with KVM or Xen and even Qemu without KVM if necessary.

Kata provides excellent security, is very fast, and works with OCI compliant images like Docker images. However, it is not very easy to use, and does not provide a management API like Firecracker does.

gVisor

gVisor is is a sandbox that runs in userspace. It is a userspace kernel that implements a substantial portion of the Linux system surface. The sandbox sits between the container (eg. Docker) and the host operating system to provide better security. This limits the host kernel surface available to the container while intercepting all system calls made by the container.

This is easy to use, and it works with Docker images. However, becasue of the kernel API reimplementation not all dockerised applications are guaranteed to work, and it has no management API.

runV

Like runC, runV is a container runtime. Unlike runC, runV uses virtualization when running containers. This allows you to run OCI containers (eg. Docker containers) as virtual machines. It requires either QEMU/KVM, Kvmtool/KVM, or Xen.

This provides excellent security, is very fast, and works with Docker images - but it has no management API, and is not well supported (Kata runtime seems to have more interest) and does not seem to work with current versions of the Docker engine.

That concludes the overview of containers and virtualization, and it is likely more good tools will become available in 2019 particularly around Firecracker and Kata, so this will be an interesting space to watch.

Vnaik.com