Sunday, May 22, 2016

Practice Guide for LXD - Canonical's OpenSource Container HyperVisor [Part-II]

LXD container on a single host is just like "chroot on steroids". LXD's main goal is to provide an experience that is similar to virtual machines and hypervisors excluding the hardware virtualization technique. My previous port "Layman's Guide for LXD - Canonicals OpenSource Container Hypervisor [Part-I]" provides an introduction to containers and LXD.

This article will be based mostly on implementing and practising various tasks related to LXD. For beginners who are unfamiliar with containers can refer my previous post. And those who already have a rough idea on containers and LXD, this post will help in learning the configuration and management of LXD.


Below mentioned are the main components of LXD.

  • lxd (system-wide daemon)
  • lxc (command line client)
  • nova-compute-lxd (Openstack Nova plugin)
  • rootfs (root filesystem)
  • lxcfs (FUSE filesystem)


In order to understand how containers can fit into your requirement, you need to understand the basic working of containers.

A container is basically a Linux process that thinks that it is the only process running and it knows only those things that it is told to know about. When assigned an IP address it behaves like an independent and isolated virtual machine running on the host kernel identifiable across the network. A running container shares the resources in a dynamic and cooperative way.

LXD like any other container shares the host kernel. What makes LXD as an OS container is the /sbin/init process - "the mother of all processes". When LXD container starts, by default the /sbin/init process in the container starts to spawn other processes in the container's process space. This makes LXD container look like an isolated machine. Any system calls and device are handled by the host kernel only.


Among all features of LXD listed in my previous post, I shall go ahead an briefly explain some of them.


A container snapshot is similar to a container except that it cannot be modified but can only be renamed, restored and destroyed. Snapshot combined with CRIU help us to take stateful snapshots of running containers and the ability to rollback cpu / memory state to the point when snapshot was taken.


Containers must be created using images. Each container image contains an image / copy of Linux filesystem. Hence, containers are image based due to which they are very lightweight and compact. LXD uses these pre-built filesystem images that come from remote image servers to create containers. Remote images are updated automatically and are cached locally by the LXD daemon for a default period of 10 days after they expire. These remote images can also be synced.

By default, LXD come pre-configured with three remote image servers that can be accessed over internet.
  • ubuntu: provides stable Ubuntu images.
  • ubuntu-daily: provides daily builds of Ubuntu.
  • images: provides images for number of Linux distributions using upstream LXC templates. This is a community run image server.
The "images:" remote server uses LXC protocol whereas ubuntu: and ubuntu-daily: remote image servers use simplestreams protocol. A simplestreams protocol is read-only, image-only protocol to get image information and import images from public image servers. It uses JSON to describe a list of products and files related to those products. It is used by various other tools like Openstack, Juju, MaaS to find, download and mirror system images.


Profiles are just like templates, where we can define various container related configurations and devices. These configurations can finally be applied to a number of containers at a time. A particular container can be set multiple profiles. The latest applied profile configuration overrides the identical parameters.

By default, LXD comes pre-configured with two profiles:
  • default: This profile is automatically applied to all freshly created containers by default, unless an alternate profile or list of profiles is provided by the user. This profile defines eth0 network device for the container.
  • docker: This profiles enables container nesting, few devices and loads required kernel modules to allow running docker containers on top of LXD.

Remote Mechanism

Remote mechanism helps in interacting with the remote image servers and LXD servers to perform various activities like copying and moving of containers between multiple LXD servers using command line client.

LXD command line client comes with following pre-configured remotes:
  • local: This is a default remote which talks to LXD daemon over UNIX socket.
  • ubuntu: Image server that provides stable Ubuntu images.
  • ubuntu-daily: Image server that provides daily builds of Ubuntu.
  • images: This domain hosts a public image server ( for use by LXC and LXD.


What makes LXD user-friendly is it's use of new configuration language that abstracts the background activity required for admin related tasks performed by an administrator. For example, attaching a host device to the container without looking for it's major and minor numbers. While designing LXD utmost care was taken to make it as secure as possible for multiple Linux distributions to run inside it. The main security features provided by LXC library are as follows:
  • Kernel namespace: Provides isolation machanism to the containers.
  • Seccomp: Provides filtering mechanism to filter potentially dangerous system calls.
  • AppArmour: Provides restrictions for cross-container communications like mounts, file access, sockets, ptrace, etc.
  • Capabilities: Provides capability to prevent containers from loading kernel modules, like modifying system time.
  • CGroups: Provides resource usage restrictions.

Rest API

LXD 2.0 comes with 1.0 stable API. Rest API provides the only communication channel between the LXD client and daemon.

LXD Project API Clients

Third-Party API Clients

Command Line Client

LXD command line client provides a good user-friendly interface for the users to manage and use LXD containers. But it might be impossible to manage thousands of containers on multiple LXD servers using this command line client. In such cases OpenStack provides a plugin "nova-lxd" to manager LXD containers just like any virtual machine.


Installing and implementing LXD is pretty simple on Ubuntu. I am using Ubuntu 16.04 Desktop for LXD installation and testing. Ubuntu 16.04 Server comes pre-installed with LXD 2.0 and Docker 1.10 and can also be very easily installed on desktops.

Installing LXD on Ubuntu 16.04 Desktop

$ sudo apt-get install lxd lxd-* criu python-setuptools zfsutils-linux bridge-utils

The above command will install lxd, lxd-client and lxd-tools packages. One can verify that the above packages are installed by running the below command.

$ sudo dpkg -l | grep lxd

Before the installation exits, it will display a message that a default LXD bridge "lxdbr0" is also provided unconfigured with the LXD package. The lxdbr0 bridge is used for routing the traffic to the containers either through proxied HTTP or SSH for user convenience. Earlier versions LXD used lxcbr0 as default bridge, but now LXD is gradually stopping it's dependency upon LXC. Using the below command one can exclusively configure lxdbr0 bridge.
$ sudo dpkg-reconfigure -p medium lxd


Command line tool can be used for LXD to configure all of LXD and it's components like filesystem and bridge. The "lxd init" tool helps to interactively configure various LXD component and their functions.
$ sudo lxd init

Canonical recommends to use ZFS filesystem for LXD. ZFS supports per-container disk quotas, snapshot / restore, live migration, instant container creation like advanced features. LXD also supports other filesystems such as btrfs, lvm and a simple directory.


LXD provides a very user-friendly command line interface to manage containers. One can perform activities like create, delete, copy, restart, snapshot, restore like many other activities to manage the containers.

Creating a container with the below shown command is very easy, it will create a container with best supported Ubuntu image from ubuntu: image server, set a random name and start it.
$ lxc launch ubuntu:

Creating a container using latest, stable image of Ubuntu 12.04, set a random name and start it.
$ lxc launch ubuntu:12.04

Creating a container using latest, stable image of Ubuntu 16.04, set name "container1" and start it.
$ lxc launch ubuntu:16.04 container0

To create a container using CentOS 7 64-bit image, set name "container2" and start it, we first have to search the "images:" remote image server and copy the required alias name.
$ lxc image list images: | grep centos | grep amd
$ lxc launch images:centos/7/amd64 container1

Creating a container using OpenSuSE 13.2 64-bit image, set name "container3" without starting it.
$ lxc init images:opensuse/13.2/amd64 container2

Remote image server "ubuntu-daily" can be used to create a container using latest 
development release of Ubuntu.

Listing containers
$ lxc list

Query detailed information of a particular container
$ lxc info container1

Start, stop, stop forcibly and restart containers
$ lxc start container1
$ lxc stop container1
$ lxc stop container1 --force
$ lxc restart container1

Stateful stop
Containers start from scratch after a reboot. To make the changes persistent across reboots, a container needs to be stopped in a stateful state. With the help of CRIU, the container state is written to the disk before shutting down. Next time the container starts, it restores the state previously written to disk.
$ lxc stop container1 --stateful

Pause containers
Paused containers do not use CPU but still are visible and continue using memory.
$ lxc pause container1

Deletion and forceful deletion of containers
$ lxc delete container1
$ lxc delete container1 --force

Renaming Containers
Just like the Linux move command renames a particular file or directory, similarly the containers can also be renamed. A running container cannot be renamed. Renaming a container doesnot change it's MAC address.
$ lxc move container1 new-container

Configuring Containers

Container settings like controlling container startup, including resource limitations and device pass-through options can be altered on live containers. LXD supports varioius devices like disk devices (physical disk, partition, block/character device), network devices (physical interface, bridged, macvlan, p2p) and none. None is used to stop inheritance of devices from profiles.


Profiles store the container configuration. Any number of profiles can be applied to a container, but these profiles are applied in the order they are specified. Hence, always the last profile overrides the previous one. By default, LXD is preconfigured with "default" profile which comes with one network device connected to LXD's default bridge "lxdbr0". Any new container that is created has "default" profile set.

Listing profiles
$ lxc profile list

Viewing default profile content
$ lxc profile show default

Editing default profile
$ lxc profile edit default

Applying a list of profiles to a container
$ lxc profile apply container1 <profile1> <profile2> <profile3> ...

Editing the configuration of a single container
$ lxc config edit container1

Adding a network device to container1
$ lxc config device add container eth1 nic nictype=bridged parent=lxcbr0

Listing the device configuration of container1
$ lxc config device list container1

Viewing container1 configuration
$ lxc config show container1

Above listed are a few examples of basic commands in use. There are many more options that can be used with these commands. A complete list of configuration parameters is mentioned here.

Executing Commands

Commands executed through LXD will always run as the container's root user.

Getting a shell inside the container
$ lxc exec container1 bash

File Transfers

LXD can directly read / write in the container's filesystem.

Pulling a file from container1
$ lxc file pull container1 /etc/redhat-release ~

Reading a file from container1
$ lxc file pull container1 /etc/redhat-release -

Pushing a file to container1
$ lxc file push /etc/myfile container1/

Editing a file on container1
$ lxc file edit container1/etc/hosts


Snapshots help in preserving the point in time running state of containers including container's filesystem, devices and configuration, if --stateful flag is used. A stateful snapshot can only be taken on a running container, where stateless snapshot can be taken on stopped containers.

Creating container1 stateless snapshot
$ lxc snapshot container1

Creating container1 stateful snapshot with name c1s1
$ lxc snapshot container1 --stateful c1s1

Listing snapshots
Number of snapshots created per container can be listed using below mentioned command.
$ lxc list

A detailed snapshot information related to container1 like snapshot name, stateless / stateful can be obtained by executing below command.
$ lxc info container1

Restoring snapshot
$ lxc restore container1 c1s1

Renaming snapshot
$ lxc move container1/c1s1 container1/c1s1-new\

Creating a container using snapshot
$ lxc copy container1/c1s1 container-c1s1

Deleting snapshot
$ lxc delete container1/c1s1


Cloning or copying a container is a lot faster process to create containers if the requirement permits so. Cloning a container resets the MAC address for the cloned container and does not copy the snapshots of parent container.
$ lxc copy container1 container1-copy


LXD allows an efficient way to dynamically manage the resources like setting memory quotas, limiting CPU, I/O priorities and limiting disk usage. Resource allocation can be done on per container basis as well as globally through profiles. All limits can be configured in live environments where they can take effect immediately. In the below example, first command defines the limit on per container basis whereas the second sets the limits globally using profiles.
$ lxc config set <container> <key> <value>
$ lxc profile set <profile> <key> <value>

Disk Limits

Unlike virtual machines containers don't reserve resources but allow us to limit the resources. Currently disk limits can be implemented only if ZFS or btrfs filesystems are in use.

CPU Limits

CPU limits can be configured using the following ways.
  • Limiting number of CPUs: Assigning only a particular number of CPUs, restricts LXD to use specified number of CPUs and not more than that. LXD load balances the workload among those number of CPUs as the containers start and stop. For example, we can allow LXD to use only 4 cores and it will load balance between them as per the requirement.
        $ lxc config set container1 limits.cpu 2
  • Limiting to particular set of CPUs: Assigning only particular cores to be used by the containers. Load balance does not work here. For example, we can allow only cores 5, 7 and 8 to be used by the containers on a server.
        $ lxc config set container1 limits.cpu 1,2,3,4

        Pinning cpu core ranges
        $ lxc config set container1 limits.cpu 0-2,7,8
  • Limiting to CPU usage percent: Containers can be limited to use only a particular percent of CPU time when under load even though containers can see all the cores. For example, a container can run freely when the system is not busy, but LXD can be configured to limit the CPU usage to 40% when there are a number of containers running.
        $ lxc config set container1 limits.cpu.allowance 40%
  • Limiting CPU time: As in previous case, the containers can be limited to use particular CPU time even though the system is idle and they can see all the cores. For example, we can limit the containers to use only 50ms out of every 200ms interval of CPU time.
        $ lxc config set container1 limits.cpu.allowance 50ms/200ms

The first two properties can be configured with last two to achieve a more complicated CPU resource allocation. For example, LXD makes it possible to limit 4 processors to use only 50ms of CPU time. We can also prioritize the usage in case there is a tiff between containers for a particular resource. In below example we set a priority of 50, if specified 0 it will provide least priority to the container among all.
$ lxc config set container1 limits.cpu.priority 50

Below command will help to verify the above set parameters.
$ lxc exec container1 -- cat /proc/cpuinfo | grep ^process

Memory Limits

LXD can also limit memory usage in various ways that are pretty simple to use.
  • Limiting memory to use particular size of RAM. For example, limiting containers to use only 512MB of RAM.
        $ lxc config set container1 limits.memory 512MB
  • Limiting memory to use particular percent of RAM. For example, limiting containers to use only 40% of total RAM.
        $ lxc config set container1 limits.memory 40%
  • Limiting swap usage: A container can be configured to turn on / off swap device usage. We can also configure a container to swap out memory to the disk first on priority basis. By default swap is enabled for all containers.
        $ lxc config set container1 limits.memory.swap false
  • Setting soft limits: Memory limits are hard by default. We can configure soft limits so that a container can enjoy full worth of memory as long as the system is idle. As soon as there is something that is important that has to be run on the system, a container cannot allocate anything until it is in it's soft limit.
        $ lxc config set container1 limits.memory.enforce soft

Network I/O Limits

There are two types of network limits that can be applied to containers.

  • Network interface limits: The "bridged" and "p2p" type of interfaces can be allocated max bit/s limits.
        $ lxc config device set container1 eth0 limits.ingress 100Mbit
        $ lxc config device set container1 eth0 limits.egress 100Mbit

  • Global network limits: It prioritizes the usage if the container accessing the network interface is saturated with network traffic.
        $ lxc config set container1 50

Block I/O Limits

Either ZFS or btrfs filesystem is required to set disk limits.
$ lxc config device set container1 root size 30GB

Limiting the root device speed
$ lxc config device set container1 root 40MB
$ lxc config device set container1 root limits.write 20MB

Limiting root device IOps
$ lxc config device set container1 root 30Iops
$ lxc config device set container1 root limits.write 20Iops

Assigning priority to container1 for disk activity
$ lxc config device set container1 limits.disk.priority 50

To monitor the current resource usage (memory, disk & network) by container1
$ lxc info container1

Sharing a directory in the host machine with a container
$ lxc config device add shared-path path=<destination-directory-on-container> source=<source-directory-on-container>


By default, LXD does not listen to the network. To make it listen to the network following parameters can be set:
$ lxc config set core.https_address [::]
$ lxc config set core.trust_password <some-password>

First parameter tells the LXD to bind all addresses on port 8443. Second parameter creates a trust password to contact that server remotely. These are set to make communication between multiple LXD hosts. Any LXD host can add this LXD server using below command.
$ lxc remote add lxdserver1 <IP-Address>

Doing so will prompt for a password that we had set earlier. One can now communicate with the LXD server and access the containers. For example, below command will update the OS in container "container1" on LXD server "server1".
$ lxc exec lxdserver1:container1 --yum update

Proxy Configuration
Setups requiring HTTP(s) to reach out to the outside world can set the below configuration.
$ lxc config set core.prox_http <proxy-address>
$ lxc config set core.prox_https <proxy-address>
$ lxc config set core.prox_ignore_hosts <local-image-server>

Any communication initiated by LXD will use the proxy server except for the local image server.


When a container is created from a remote image, LXD downloads the image by pulling its full hash, short hash or alias into it's image store, marks it as cached and records it's origin.

Importing Images

From Remote Image Servers to Local Image Store

LXD can simply cache the image locally by copying the remote image into the local image store. This process will not create a container from it. Below example will simply copy the Ubuntu 14.04 image into the local image store and create a filesystem for it.
$ lxc image copy ubuntu:14.04 local

We can also provide an alias name for the fingerprint that will be generated for the new image. Specifying alias name is an easy way to remember the image.
$ lxc image copy ubuntu:14.04 local: --alias ubuntu1404

It is also possible to use the alias that are already set on the remote image server. LXD can also keep the local image updated just like the images that are cached by specifying the "--auto-update" flag while importing the image.
$ lxc image copy images:centos/6/amd64 local: --copy-aliases --auto-update

Later we can create a container using these local images.
$ lxc launch centos/6/amd64 c2-centos6

From Tarballs to Local Image Store

Alternatively, containers can also be made from images that are created using tarballs. These tarballs can be downloaded from There we can find one LXD metadata tarball and filesystem image tarball. The below example will import an image using both tarballs and assign an alias "imported-ubuntu".
$ lxc image import meta.tar.xz rootfs.tar.xz --alias imported-ubuntu

From URL to Local Image Store

LXD also facilitates importing of images from a local webserver in order to create containers. Images can be pulled using their LXD-image-URL and ultimately get stored in the image store.
$ lxc image import --alias opensuse132-amd64

Exporting Images

The images of running containers stored in local image store can also be exported to tarballs.
$ lxc image export <fingerprint / alias>

Exporting images creates two tarballs: metadata tarball containing the metadata bits that LXD uses and filesystem tarball containing the root filesystem to bootstrap new containers.

Creating & Publishing Images

  • Creating Images using Containers
        To create an image, stop the container whose image you want to publish in the local store, then we can create a new container using the new image.
$ lxc publish container1 --alias new-c1s1

A snapshot of a container can also be used to create images.
$ lxc publish container1/c1s1 --alias new-snap-image
  • Creating Images Manually
        1. Generate the container filesystem for ubuntu using debootstrap.
        2. Make a compressed tarball of the generated filesystem.
        3. Write a metadata yaml file for the container.
        4. Make a tarball of metadata.yaml file.

Sample metadata.yaml file

architecture: "i686"
creation_date: 1458040200
 architecture: "i686"
 description: "Ubuntu 12.04 LTS server (20160315)"
 os: "ubuntu"
 release: "precise"
   - start
  template: cloud-init-meta.tpl
   - start
  template: cloud-init-user.tpl
   default: |
   - start
  template: cloud-init-vendor.tpl
   default: |
   - create
  template: upstart-override.tpl
   - create
  template: upstart-override.tpl
   - create
  template: upstart-override.tpl
   - create
  template: upstart-override.tpl
   - create
  template: upstart-override.tpl

          5. Import both the tarballs as LXD images.
            $ lxc image import <rootfs.tar.gz> <meta.tar.gz> --alias imported-container

LXD is very likely going to deprecate the lxd-images import functionality for LXD. The image servers are much more efficient for this task.

By default, every LXD daemon plays image server role and every created image is private image i.e. only trusted clients can pull those private images. To create a public image the LXD server must be listening to the network. Below are a few steps to make the LXD server listen to the networ and serve as public image server.
1. Bind all addresses on port 8443 to enable remote connections to the LXD daemon.
    $ lxc config set core.https_address "[::]:8443"

2. Add public image server in the client machines.
    $ lxc remote add <public-image-server> <IP-Address> --public

Adding a remote server as public provides an authentication-less connection between client and server. Still, the images that are marked as private in public image server cannot be accessed by the client. Images can be marked as public / private using "lxc image edit" command described above in previous sections.

Listing available remote image servers
$ lxc remote list

List images in images: remote server
$ lxc image list images:

List images using filters
$ lxc image list amd64
$ lxc image list os=ubuntu

Get detailed information of an image
$ lxc image info ubuntu

Editing images
We can edit images with parameters like "autoupdate" or "public".
$ lxc image edit container1

Deleting images
$ lxc image delete <fingerprint / alias>

By default, after a period of 10 days the image gets removed automatically and also every 6 hours by default LXD daemon looks out for new updates and automatically updates the images that are cached locally. Below mentioned commands with relevant parameters help to configure and tune the above mentioned default properties.

$ lxc config set images.remote_cache_expiry <no-of-days>
$ lxc config set images.auto_update_interval <no-of-hours>
$ lxc config set images.auto_update_cached <false>

The last parameter automatically updates only those images that have a flag "--auto-update" set and not all the images that are cached.


Suppose you are a developer who is working on a project and you have built some shiny new application / stack / language which you do not trust much or want to run a database which you do not want to be installed on the system and keep running all the time except when you are working on the project, completely isolated on your development server. For multiple container environments it can be used for various other uses like load balancing. LXD is surely the best bet.


LXD as a OS container with it's unique features like complete isolation, light-weight, fast deployment and very user-friendly interface is very cool and will surely make you to give it a try.


Stephane Graber's blog post series on LXD 2.0
Ubuntu's official documentation on LXD


Saturday, May 07, 2016

Layman's Guide for LXD - Canonical's OpenSource Container HyperVisor [Part-I]


Certainly, "container" is the new buzz word among techies. Everyone is talking about Docker, LXC and LXD. The continous need to reduce costs, optimize performance, as well as maintain the data availability combined with data integrity has been the most prominent need for most of the organizations leading to convergence of various virtualization concepts to develop an efficient model called, "Containerization". Containerization of resources helps in utilizing the resources more efficiently than the other virtualization techniques like hypervisors. Even though there are a few limitations to it, but still there is a wide spread acceptance in production environments too. Let us see why and how this is achieved.

What are Hypervisors?

Hypervisor based virtualization technologies have been around for a long time now. Today they can be installed and deployed on any laptop and server, helping in reducing costs and optimizing performance where ever required. Hypervisor or a "virtual machine manager", is a software that uses hardware virtualization technique to emulate computing environments of different operating systems sharing a single hardware machine. It is like a thin layer between the hardware and operating system mainly used to create virtual machines. Each guest operating system that is installed on it is assigned a part of host's processor, memory, storage and network as allocated by the user. The hypervisors provide full virtualization mechanism that emulates the hardware in such a way that we can run any operating system on top of any other. Here each virtual machine has it's own kernel and hence the resources allocated are statically fixed. This provides a high level of isolation between the host and guest machines.

Types of Hypervisors

There are mainly two types of hypervisors.

1. Bare metal Hypervisor or native - Type1

These hypervisors run directly on the host's hardware to manage the guest operating system and to control the hardware, due to which they are commonly referred as bare metal hypervisors.

Example: VMware ESXi, vSphere Hypervisor, Microsoft Hyper-V, Citrix XenServer, Oracle VM, KVM, IBM z/VM

2. OS based or Embedded or hosted - Type2

These hypervisors run like any other computer program running on the operating system. They tend to abstract the guest operating system from the host operating system. Example: VMware Workstation, VMware PlayerOracle VirtualBox, KVM, Qemu, Parallels Workstation (discontinued).

Note: KVM is categorized as both Type1 and Type2 Hypervisor.

What is a container?

Container is another method of virtualization where the kernel of an operating system allows multiple isolated user-space instances. These instances are also called software containers, lightervisors, virtualization engines, jails and zones, depending upon the operating system OEMs. Containers are based on shared operating system concept. Therefore instead of virtualizing the hardware, they work on top of single linux instance running on the same hardware.
The operating system's kernel provides resource management feature with the help of namespace support combined with chroots, to limit the impact of one container's activities on another container in order to facilitate complete isolation of the environments. This isolation mechanism helps to ensure security and hardware independence.
Containers use operating system's normal system call interface and hence, there is no need to run in an intermediate virtual machine. They are skinnier, lightweight and portable than the virtual machines.  The only limitation of this technique is that it cannot host a guest operating system different from the host operating system. For example, a container with Linux as a host operating system can only host Linux guest operating systems and not Windows. It can host both identical or different distributions of Linux operating systems to the host operating system. They are believed to run close to bare metal speeds and theoretically run 6000 containers and 12000 bind mounts of root filesystem directories. This link demonstrates the procedure to use a container as a router.

Example: LXC, LXD, Docker, rktOpenVZ, Virtuozzo, Spoon, Solaris Containers/Zones, AIX WPARs, HP-UX Containers, FreeBSD Jails.

Difference between container and hypervisor

All hypervisors are usually resource hungry to emulate virtual hardware which makes them slow and incur significance performance overhead. Due to more resource consumption they can host limited number of virtual machines. Virtual machines require separate and independent kernel instance to run on. On the other hand containers do not emulate hardware and can be deployed from the host operating system by sharing the host OS kernel. This makes them faster with reduced startup/shutdown speeds. The enhanced sharing feature helps in making the containers more leaner, lightweight and smaller than hypervisor guests, just because the kernel sees the containers as simply resources to be managed. 

For example, when container1 and container2 open the same file, the host kernel opens the file and puts the pages into the kernel page cache. These pages are then handed over to both the containers. Whereas in case of VMs, first it creates and caches the pages in the host kernel, then the same process takes place in both VM1 and VM2 kernels. Just because hypervisors cannot share the pages in a same way a container can, there are three separate pages, one each in page cache of host, VM1 and VM2.
The above example proves that the advanced sharing in containers enable them in consuming less resources and run more number of containers as compared to the hypervisors.

Types of Containers

Containers can be categorized into two categories depending upon the requirement. The two use cases are as follows.

1. Full System Containers

Full system containers or OS containers share the kernel of host operating system but provide user space isolation. User space is nothing but allowing host CPU to partition memory allocation into isolation levels. The OS containers can be compared to hypervisors or virtual machines. We can install different applications and libraries just like any other operating system running on a virtual machine. Full system containers runs the init process, thus supporting multiple processes and services just like any Linux OS. Using full system containers we can easily assign static IP and routable IP, use multiple network devices, edit /etc/hosts file, basically OS containers can perform anything that a virtual machine can do.

Example: LXC/LXD, OpenVZ, Oracle Solaris Zones, FreeBSD Jails

2. Application Containers

Application containers also share the host operating system kernel just like OS containers. These containers are designed to run a single process (process tree) or application. The application container do not fit in all use cases. Here containers provide a limited control over services and configuration files due to which admin users who require to perform admin related activities to guarantee SLAs, cannot perform tasks like logging, ssh, cron, networking activities like setting up static IP, modifying system files like /etc/hosts and monitoring system using system tools and utilities is bit complicated task to perform and a nightmare to actually manage. For example to create a LAMP container we need to create three containers that consume services from each other, Apache container, MySQL container and PHP container respectively. These containers are ephemeral, stateless and minimal, the idea behind application containers was to reduce a container as much as possible to a single process that can be efficiently managed by the docker.

Example: Docker, rkt

What is LXD?

LXD, "Linux Container Daemon" pronounced "Lex-Dee", is a container based hypervisor sponsored by Canonical. It is a full system container just like VMware Workstation/ESX and VirtualBox, that runs a full Linux distribution and is built on top of LXC. LXD uses LXC API in the background for container management and REST API on top to provide a friendlier user interface. Hence, LXD is just a value-added extension and successor to LXC. Due to it's unique sharing features and being lightweight, it is also called "lightervisor" - world's fastest hypervisor.
Both LXC and LXD are developed by Stephane GraberSerge Hallyn from Ubuntu and Canonical. LXC was initially released on 6th August, 2008. Version 1.0 of LXC was the first stable version released on 20th February, 2014 which has a LTS support. It is actively developed at released Ubuntu 16.04 comes with inbuilt LXD 2.0 which is an Apache2 Licensed opensource project written in Go Language and has a Long Term Support release with 5 years support commitment from upstream, ending on 1st June, 2021. There is no LXD 1.0 version release because LXD was a successor to LXC 1.0, and hence LXD 2.0 was released.

LXD Features

1. Secure
LXD facilitates the use of unprivileged containers that provides access to non-root users to run and deploy containers for better security and multi-tenant workloads. It also supports resource restrictions.
2. Scalable
Containers support scaling i.e. they can be deployed on a single laptop to a large number of compute nodes.
3. Interactive
LXD interacts with the user with the help of it's REST API providing a very user friendly command line interface.
4. Live Migration
LXD also supports live migration of containers, an ability to move a running container from one host to another without actually shutting the container down.
5. CRIU (Checkpoint/Restore In Userspace)
LXD supports in freezing a container at a particular point of time and restore it later from the point it was frozen at. Online snapshotting has also been introduced in the latest version.

Difference Between LXD and Docker

1. LXD is a OS container or system container whereas docker is a application container.
2. Principal process for LXD is liblxc whereas docker has recently started using it's own libcontainer. Earlier docker also used to have liblxc as the principal process.
3. LXD wraps up application inside a "userspace image" as opposed to docker that holds the application in a self contained filesystem which makes it ephemeral and stateless.
4. LXD runs init process which a parent to all processes and services, whereas docker runs a single process per application in the container.
5. LXD specializes in deploying virtual machines whereas Docker specializes in deploying applications.
6. Administrator related activities can be easily managed in LXD, but for docker it can be complicated to configure, tune and monitor the system.
7. LXD works just like any other hypervisor with shared kernel. Whereas in docker, only the outer layer is writable and all other internal layers are read-only. It can be best compared to an onion.

Lightervisor and Hypervisors Coexistence

Yes. Since containers do not support windows on top, hypervisors make sure that Windows runs on top of Linux host. On the other side dockers can perform a scale out operation on a large scale and LXD can be used to carry standard linux workloads. Hence it is possible for a hypervisor to contain LXD and a LXD to host multiple docker containers simultaneously.

Future is actually evolving into a world where there is convergence of various virtualization technologies, and where the containers lead the virtualization space LXD is slowly attracting users.

LXD is "FREE" and that is the power of opensource innovation.

Friday, April 22, 2016

Life Cycle of MapReduce Job

Here, I will explain behind the scenes of job execution process in Hadoop MapReduce or MRv1 (MapReduce version 1), from the time user fires a job to the time when the job is executed on the slave nodes.

MapReduce is a "programming model/software framework" designed to process large amount of data in parallel by dividing the job into a number of independent data local tasks. The term data locality is one of the most important concepts of HDFS/MapReduce, since it helps in drastically reducing the network usage. Data locality means "bringing the compute to data" or moving the algorithm to the datanodes for data processing. It is very cost effective, rather than moving data to the algorithm which is generally found in traditional HPC clusters.

Components of Hadoop MapReduce

1. Client: Client acts as a user interface for submitting jobs and collects various status information.
2. Jobtracker: Jobtracker is responsible for scheduling jobs, dividing job into map and reduce tasks to datanodes, task failure recovery, monitoring jobs and tracking job status.
3. Tasktracker: Tasktracker runs map and reduce tasks and manages intermediate outputs.

MapReduce Job Life Cycle


Generally a MapReduce program executes in three stages, namely map stage, shuffle stage and reduce stage. The first phase of MapReduce is called mapping. A MapReduce job is submitted to the jobtracker by the user sitting on a client machine. This MapReduce job contains the job configuration which specifies map, combine and reduce functions. It also contains the job location information about the input splits and output directory path.

The InputFormat class calls the getSplits() function to compute the input splits. These MapReduce input splits come from the input files loaded by the user into the HDFS. An ideal input split size should be one filesystem block size. These input splits' information are then retrieved by the jobscheduler and selects the input file from HDFS for map function with the help of InputFormat class.

The tasktrackers on datanodes periodically communicate with the jobtracker using heartbeat signals to convey their availability status. The jobscheduler  uses the key features like data locality and rack-awareness, and lets the jobtracker assign map tasks to the nearest available tasktrackers through their heartbeat signal return value. In case if a datanode fails, it assigns the tasks to another nearest datanode that has replicated input split. This intelligent placement of data blocks and processing them according to the availability and proximity of datanodes/tasktrackers is achieved by the Hadoop's own technologies - Data Locality and Rack Awareness, making HDFS/MapReduce very unique in it's own kind.

The map tasks run on their respective tasktrackers and datanodes assigned to them. The outputs from these map tasks are written to the local disks. Further sort and shuffle are performed on the ouput data in order to transfer the map outputs to the respective reducers as input. This is known as Shuffle/Sort stage. In this phase the intermediate key/value pairs are exchanged between datanodes so that all values with the same key are sent to a single reducer.

In reduce phase, the shuffled/sorted output is provided as input to the reduce tasks. The reduce function is invoked on each key to produce a more sorted output. Finally the output from each reducer is written to a separate file with prefix name "part-00000" into the HDFS. No two map and reduce tasks communicate with each other. In a MapReduce program 20% of the work is done by the mappers in map phase, whereas other 80% of the work is done by the reducers in reduce phase.

1. The client prepares the job for submission and hands it off to the jobtracker.
2. Jobtracker schedules the job and tasktrackers are assigned map tasks.
3. Each tasktracker runs map tasks and updates the progress status of the tasks to the jobtracker periodically.
4. Jobtracker assigns reduce tasks to the tasktrackers as soon as the map outputs are available.
5. The tasktracker runs reduce tasks and updates the progress status of the tasks to the jobtracker periodically.

Stages of MapReduce Job Life Cycle

Job Submission

1. The user submits a job to the client.
2. Client checks the output specifications of the job. If the output directory path is not specified or if the output directory already exists, then it will throw an error to the MapReduce program.
3. Client computes the input splits. If the input directory path is not specified then it will throw an error to the MapReduce program.
4. Client copies the job.jar, job.xml and input split information into the HDFS. The job.jar file is copied with a default replication factor of 10 so that there are ample number of copies for the tasktrackers to access. It can be controlled by the property mapred.submit.replication property.
5. Client tells the jobtracker that the job is ready to be submitted for execution.

Job Initialization

1. Jobtracker takes the job and puts it into an internal queue from where the jobscheduler  will pick it up.
2. Jobscheduler retrieves the input splits from the HDFS which the client had computed earlier.
3. Jobscheduler assigns a map task for each input split. The number of reduce tasks is controlled by the mapred.reduce.tasks property.
4. The tasks are given task ids at this point.

Task Assignment

1. Before assigning a task to the tasktracker, the jobtracker must first choose a job to select a task from.
2. The tasktracker communicates with the jobtracker by periodically sending a heartbeat signal to the jobtracker, to tell the jobtracker that the tasktracker is alive and it's availability for the new job. If the tasktracker is available, then the jobtracker assigns the tasktracker a new task through the heartbeat signal return value.
3. The tasktracker has a fixed number of map and reduce slots. The map slots are filled before the reduce slots.
4. For each map task, the jobscheduler takes into account the network location of the tasktracker and picks a map task that is closest to the input split.

Task Execution

1. Tasktracker copies the job.jar configuration file into the tasktracker's local filesystem.
2. Tasktracker creates a new directory and unjars the job.jar file's content into it.
3. Tasktracker runs an instance called taskrunner to run the task.
4. Taskrunner runs the task inside a jvm so that the buggy user defined variables do not affect the tasktracker.
5. The child process communicates with the parent process periodically to report the status of the job task.

Job Completion & Progress Updates

1. As map tasks complete successfully, they notify their parent tasktracker of the status update which in turn notifies the jobtracker.
2. These notifications are transmitted over the heartbeat communication mechanism. These statuses change over the course of the job.
3. Mappers and reducers on child jvm report to the tasktracker periodically and set a flag to report a task status change.
4. When the jobtracker receives a notification that the last task of the job is completed, it changes the status of the job to "successful".
5. Finally, jobtracker combines all the updates from tasktrackers to provide a global view of job progress status.


1. The jobtracker cleans up it's working state for the job only after confirming that all the reduce tasks are completed successfully and instructs the tasktrackers to do the same.
2. The cleanup activity involves the deletion of intermediate output and other such cleaning/deletion tasks are performed.

Note: The jobtracker is alone responsible for scheduling jobs, dividing job into map and reduce tasks, distributing map and reduce tasks to datanodes, task failure recovery, monitoring jobs and tracking job status. Hence, jobscheduler must not be confused with a separate MapReduce daemon or identity.

Shuffle/Sort Phase

MapReduce is the heart of Hadoop and Shuffle/Sort phase is one of the most expensive part of MapReduce execution where the actual "magic" happens. The process by which mappers separate out outputs for their respective reducers using sort and transfer the data to the intended reducers to be collected and grouped it by key using shuffle is known as Shuffle/Sort phase. The shuffle/sort phase begins after the first map task is completed. There may be several other map tasks still running to process their outputs on their respective datanodes, but they also start exchanging the intermediate outputs from the map tasks to be sent to the respective reducers. Hence, it is not necessary for all map tasks to complete before any reduce task can begin. In the end, grouped keys are processed by the reducers after all map outputs have been received by the reducers.

Map Phase (Preparation Phase)

In map phase, mappers run on unsorted key/value pairs. Mappers generate zero or multiple output key/value pairs for each input key/value pairs. When the map tasks start producing output, each map task writes the output to a circular memory buffer assigned to it. The default size of this circular memory buffer is 100MB and is regulated by the property io.sort.mb.

Partition Phase
When the contents of the buffer reaches a certain threshold size, a background thread starts to divide the spilled data into partitions before writing it to the local disks. The default threshold size of circular memory buffer is 80MB and is controlled by the property io.sort.mb. The number of partitions is dependent upon the number of reducers specified. The number of reduce tasks is defined by the property mapred.reduce.tasks. Each partition contains multiple K*V* pairs. Hence, partitioner decides which reducer will get particular key/value pair.

Sort Phase
Each partition has a set of intermediate keys that are automatically sorted by Hadoop, also known as in-memory sort key process.

Combine Phase
It is an optional phase also known as mini-reduce phase. Combiners combine key/value pairs with the same key together on a single node. Each combiner may run zero or more times. In this phase a more sorted and compact map output is produced so that less data needs to be transferred and written to local disks. Hence, combiners work before spilling the data to the local disk. Since the reduce phase does not operate parallel tasks as is done by map phase, hence it is slow. Combiners help to optimize and speed up the job by drastically reducing the total bandwidth required by the shuffle phase. It reduces the time by performing some work that has to be performed by the reduce phase later.

Before each spill is written to disk, it is often a good idea to compress the map output so that it is written faster into the disk, consumes less disk space and reduces the amount of data to be transferred to the reducer. By default compression is not enabled. It is also an optional phase. Setting property to true enables compression.

Merge Phase
The spill is written to the disk (mapred.local.dir). A new spill file will be created every time the buffer reaches the spill threshold. After the map task has generated it's last output record, there can be several spill files created by a single map task.
Before the task is finished, the spill files on the local disk are merged into a single partition. Property io.sort.factor controls the maximum number of spill records that can be merged at once.
Note: The map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block the outputs until the spill is complete.

Reduce Side (Calculation Phase)

Shuffle Phase / Copy Phase
A thread in the reducer periodically asks the jobtracker for map output locations until it has retrieved them all. Each map task may finish at different times, but the tasktrackers require the map tasks' outputs to run reduce tasks. Hence the reduce tasks start copying the map task outputs as soon as the map task completes. And, the map outputs are not deleted by the tasktracker as soon as the first reducer has retrieved it. Since, MapReduce ensures that the input to reducers are sorted by key, hence all values of same key are always reduced together regardless of it's mapper's origin. Thus map nodes also perform shuffle so that the mapper's intermediate data is copied to their respective locations with the help of partitioners through HTTP.
The map outputs are copied into the tasktracker's memory if they are small enough. This memory/buffer size is controlled by mapred.job.shuffle.input.buffer.percent property. Else, they are copied to the disk. The map outputs that were compressed in the map side are decompressed so that they can be merged in the later stages. When the buffer memory reaches a threshold size (mapred.job.shuffle.merge.percent) or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), the map outputs are merged and spilled to the disk.

Sort/Merge Phase
The copied spills are merged into a single sorted set of key/value pair. MapReduce does not believe in larger buffer sizes and hence, it concentrates more on smaller disk spills and parallelizing spilling/fetching in order to obtain better reduce times.

Reduce Phase
Finally the sorted and merged files are feeded into the reduce functions to get the final output which is written directly to the HDFS. The first block of replicas is written to the local disk.

MapReduce v1 had a single jobtracker to manage all the tasktrackers and the whole queue of jobs which later proved out to be a bottleneck. An inherent delay and latency was discovered in job submission process which led towards the development of alternate solutions like Facebook's Corona and Yahoo's YARN (Yet Another Resource Negotiator).

Note: I have also prepared a brief overview of the above article here. Please share your views and thoughts in the comments section below. Anything that I might have missed here or any suggestions are all welcome.

Thanks for reading.

Sources:, Hadoop - The Definitive Guide

Tags: Apache Hadoop Cloudera Hadoop MapReduceV1 MRV1 Overview Data flow Mechanism in MapReduce Data Processing in MapReduce Flow of Data in MapReduce Internal Flow of MapReduce MapReduce Data Processing MapReduce Model of Data Processing MapReduce Working MapReduce Anatomy Lifecycle of MapReduce Job MapReduce Job Working