|
This is a short^H^H^H^H^H long mail to introduce / walk-through some
|
|
recent developments in libvirt to support native Linux hosted
|
|
container virtualization using the kernel capabilities the people
|
|
on this list have been adding in recent releases. We've been working
|
|
on this for a few months now, but not really publicised it before
|
|
now, and I figure the people working on container virt extensions
|
|
for Linux might be interested in how it is being used.
|
|
|
|
For those who aren't familiar with libvirt, it provides a stable API
|
|
for managing virtualization hosts and their guests. It started with
|
|
a Xen driver, and over time has evolved to add support for QEMU, KVM,
|
|
OpenVZ and most recently of all a driver we're calling "LXC" short
|
|
for "LinuX Containers". The key is that no matter what hypervisor
|
|
you are using, there is a consistent set of APIs, and standardized
|
|
configuration format for userspace management applications in the
|
|
host (and remote secure RPC to the host).
|
|
|
|
The LXC driver is the result of a combined effort from a number of
|
|
people in the libvirt community, most notably Dave Leskovec contributed
|
|
the original code, and Dan Smith now leads development along with my
|
|
own contributions to its architecture to better integrate with libvirt.
|
|
|
|
We have a couple of goals in this work. Overall, libvirt wants to be
|
|
the defacto standard, open source management API for all virtualization
|
|
platforms and native Linux virtualization capabilities are a strong
|
|
focus. The LXC driver is attempting to provide a general purpose
|
|
management solution for two container virt use cases:
|
|
|
|
- Application workload isolation
|
|
- Virtual private servers
|
|
|
|
In the first use case we want to provide the ability to run an
|
|
application in primary host OS with partial restrictons on its
|
|
resource / service access. It will still run with the same root
|
|
directory as the host OS, but its filesystem namespace may have
|
|
some additional private mount points present. It may have a
|
|
private network namespace to restrict its connectivity, and it
|
|
will ultimately have restrictions on its resource usage (eg
|
|
memory, CPU time, CPU affinity, I/O bandwidth).
|
|
|
|
In the second use case, we want to provide completely virtualized
|
|
operating system in the container (running the host kernel of
|
|
course), akin to the capabilities of OpenVZ / Linux-VServer. The
|
|
container will have a totally private root filesystem, private
|
|
networking namespace, whatever other namespace isolation the
|
|
kernel provides, and again resource restirctions. Some people
|
|
like to think of this as 'a better chroot than chroot'.
|
|
|
|
In terms of technical implementation, at its core is direct usage
|
|
of the new clone() flags. By default all containers get created
|
|
with CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWUSER, and
|
|
CLONE_NEWIPC. If private network config was requested they also
|
|
get CLONE_NEWNET.
|
|
|
|
For the workload isolation case, after creating the container we
|
|
just add a number of filesystem mounts in the containers private
|
|
FS namespace. In the VPS case, we'll do a pivot_root() onto the
|
|
new root directory, and then add any extra filesystem mounts the
|
|
container config requested.
|
|
|
|
The stdin/out/err of the process leader in the container is bound
|
|
to the slave end of a Psuedo TTY, libvirt owning the master end
|
|
so it can provide a virtual text console into the guest container.
|
|
Once the basic container setup is complete, libvirt exec the so
|
|
called 'init' process. Things are thus setup such that when the
|
|
'init' process exits, the container is terminated / cleaned up.
|
|
|
|
On the host side, the libvirt LXC driver creates what we call a
|
|
'controller' process for each container. This is done with a small
|
|
binary /usr/libexec/libvirt_lxc. This is the process which owns the
|
|
master end of the Pseduo-TTY, along with a second Pseduo-TTY pair.
|
|
When the host admin wants to interact with the contain, they use
|
|
the command 'virsh console CONTAINER-NAME'. The LXC controller
|
|
process takes care of forwarding I/O between the two slave PTYs,
|
|
one slave opened by virsh console, the other being the containers'
|
|
stdin/out/err. If you kill the controller, then the container
|
|
also dies. Basically you can think of the libvirt_lxc controller
|
|
as serving the equivalent purpose to the 'qemu' command for full
|
|
machine virtualization - it provides the interface between host
|
|
and guest, in this case just the container setup, and access to
|
|
text console - perhaps more in the future.
|
|
|
|
For networking, libvirt provides two core concepts
|
|
|
|
- Shared physical device. A bridge containing one of your
|
|
physical network interfaces on the host, along with one or
|
|
more of the guest vnet interfaces. So the container appears
|
|
as if its directly on the LAN
|
|
|
|
- Virtual network. A bridge containing only guest vnet
|
|
interfaces, and NO physical device from the host. IPtables
|
|
and forwarding provide routed (+ optionally NATed)
|
|
connectivity to the LAN for guests.
|
|
|
|
The latter use case is particularly useful for machines without
|
|
a permanent wired ethernet - eg laptops, using wifi, as it lets
|
|
guests talk to each other even when there's no active host network.
|
|
Both of these network setups are fully supported in the LXC driver
|
|
in precense of a suitably new host kernel.
|
|
|
|
That's a 100ft overview and the current functionality is working
|
|
quite well from an architectural/technical point of view, but there
|
|
is plenty more work we still need todo to provide an system which
|
|
is mature enough for real world production deployment.
|
|
|
|
- Integration with cgroups. Although I talked about resource
|
|
restrictions, we've not implemented any of this yet. In the
|
|
most immediate timeframe we want to use cgroups' device
|
|
ACL support to prevent the container having any ability to
|
|
access to device nodes other than the usual suspects of
|
|
/dev/{null,full,zero,console}, and possibly /dev/urandom.
|
|
The other important one is to provide a memory cap across
|
|
the entire container. CPU based resource control is lower
|
|
priority at the moment.
|
|
|
|
- Efficient query of resource utilization. We need to be able
|
|
to get the cumulative CPU time of all the processes inside
|
|
the container, without having to iterate over every PIDs'
|
|
/proc/$PID/stat file. I'm not sure how we'll do this yet..
|
|
We want to get this data this for all CPUs, and per-CPU.
|
|
|
|
- devpts virtualization. libvirt currently just bind mount the
|
|
host's /dev/pts into the container. Clearly this isn't a
|
|
serious impl. We've been monitoring the devpts namespace
|
|
patches and these look like they will provide the capabilities
|
|
we need for the full virtual private server use case
|
|
|
|
- network sysfs virtualization. libvirt can't currently use the
|
|
CLONE_NEWNET flag in most Linux distros, since current released
|
|
kernel has this capability conflicting with SYSFS in KConfig.
|
|
Again we're looking forward to seeing this addressed in next
|
|
kernel
|
|
|
|
- UID/GID virtualization. While we spawn all containers as root,
|
|
applications inside the container may witch to unprivileged
|
|
UIDs. We don't (neccessarily) want users in the host with
|
|
equivalent UIDs to be able to kill processes inside the
|
|
container. It would also be desirable to allow unprivileged
|
|
users to create containers without needing root on the host,
|
|
but allowing them to be root & any other user inside their
|
|
container. I'm not aware if anyone's working on this kind of
|
|
thing yet ?
|
|
|
|
There're probably more things Dan Smith is thinking of but that
|
|
list is a good starting point.
|
|
|
|
Finally, a 30 second overview of actually using LXC usage with
|
|
libvirt to create a simple VPS using busybox in its root fs...
|
|
|
|
- Create a simple chroot environment using busybox
|
|
|
|
mkdir /root/mycontainer
|
|
mkdir /root/mycontainer/bin
|
|
mkdir /root/mycontainer/sbin
|
|
cp /sbin/busybox /root/mycontainer/sbin
|
|
for cmd in sh ls chdir chmod rm cat vi
|
|
do
|
|
ln -s /root/mycontainer/bin/$cmd ../sbin/busybox
|
|
done
|
|
cat > /root/mycontainer/sbin/init <<EOF
|
|
#!/sbin/busybox
|
|
sh
|
|
EOF
|
|
|
|
|
|
- Create a simple libvirt configuration file for the
|
|
container, defining the root filesystem, the network
|
|
connection (bridged to br0 in this case), and the
|
|
path to the 'init' binary (defaults to /sbin/init if
|
|
omitted)
|
|
|
|
# cat > mycontainer.xml <<EOF
|
|
<domain type='lxc'>
|
|
<name>mycontainer</name>
|
|
<memory>500000</memory>
|
|
<os>
|
|
<type>exe</type>
|
|
<init>/sbin/init</init>
|
|
</os>
|
|
<devices>
|
|
<filesystem type='mount'>
|
|
<source dir='/root/mycontainer'/>
|
|
<target dir='/'/>
|
|
</filesystem>
|
|
<interface type='bridge'>
|
|
<source network='br0'/>
|
|
<mac address='00:11:22:34:34:34'/>
|
|
</interface>
|
|
<console type='pty' />
|
|
</devices>
|
|
</domain>
|
|
EOF
|
|
|
|
- Load the configuration into libvirt
|
|
|
|
# virsh --connect lxc:/// define mycontainer.xml
|
|
# virsh --connect lxc:/// list --inactive
|
|
Id Name State
|
|
----------------------------------
|
|
- mycontainer shutdown
|
|
|
|
|
|
|
|
- Start the VM and query some information about it
|
|
|
|
# virsh --connect lxc:/// start mycontainer
|
|
# virsh --connect lxc:/// list
|
|
Id Name State
|
|
----------------------------------
|
|
28407 mycontainer running
|
|
|
|
# virsh --connect lxc:/// dominfo mycontainer
|
|
Id: 28407
|
|
Name: mycontainer
|
|
UUID: 8369f1ac-7e46-e869-4ca5-759d51478066
|
|
OS Type: exe
|
|
State: running
|
|
CPU(s): 1
|
|
Max memory: 500000 kB
|
|
Used memory: 500000 kB
|
|
|
|
|
|
NB. the CPU/memory info here is not enforce yet.
|
|
|
|
- Interact with the container
|
|
|
|
# virsh --connect lxc:/// console mycontainer
|
|
|
|
NB, Ctrl+] to exit when done
|
|
|
|
- Query the live config - eg to discover what PTY its
|
|
console is connected to
|
|
|
|
|
|
# virsh --connect lxc:/// dumpxml mycontainer
|
|
<domain type='lxc' id='28407'>
|
|
<name>mycontainer</name>
|
|
<uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid>
|
|
<memory>500000</memory>
|
|
<currentMemory>500000</currentMemory>
|
|
<vcpu>1</vcpu>
|
|
<os>
|
|
<type arch='i686'>exe</type>
|
|
<init>/sbin/init</init>
|
|
</os>
|
|
<clock offset='utc'/>
|
|
<on_poweroff>destroy</on_poweroff>
|
|
<on_reboot>restart</on_reboot>
|
|
<on_crash>destroy</on_crash>
|
|
<devices>
|
|
<filesystem type='mount'>
|
|
<source dir='/root/mycontainer'/>
|
|
<target dir='/'/>
|
|
</filesystem>
|
|
<console type='pty' tty='/dev/pts/22'>
|
|
<source path='/dev/pts/22'/>
|
|
<target port='0'/>
|
|
</console>
|
|
</devices>
|
|
</domain>
|
|
|
|
- Shutdown the container
|
|
|
|
# virsh --connect lxc:/// destroy mycontainer
|
|
|
|
There is lots more I could say, but hopefully this serves as
|
|
a useful introduction to the LXC work in libvirt and how it
|
|
is making use of the kernel's container based virtualization
|
|
support. For those interested in finding out more, all the
|
|
source is in the libvirt CVS repo, the files being those
|
|
named src/lxc_conf.c, src/lxc_container.c, src/lxc_controller.c
|
|
and src/lxc_driver.c.
|
|
|
|
http://libvirt.org/downloads.html
|
|
|
|
or via the GIT mirror of our CVS repo
|
|
|
|
git clone git://git.et.redhat.com/libvirt.git
|
|
|
|
Regards,
|
|
Daniel
|