Call For Testing BSD Certification Group

Hands-on bhyve

http://cft.lv/4

#FreeBSD #bhyve #Virtualization

April 25th, 2012

Version 4.0

© Michael Dexter

The History and Architecture of the BSD Hypervisor

bhyve, the BSD Hypervisor was unveiled at BSDCan 2011 by FreeBSD developers neel@ and grehan@

bhyve Overview

bhyve is a type 2 Hypervisor for FreeBSD and PC-BSD that is similar to Linux KVM and consists of the vmm.ko kernel module, a few support utilities and a library. Because these are all loadable external components, they can be easily packaged and installed on an unmodified host. A bhyve guest must currently be built with a few FreeBSD-specific shims that expedited development but the code is fundamentally portable. With a little help, bhyve could support unmodified guests and be ported to other operating systems thanks to its simple design and permissive license.

Hardware Requirements

bhyve depends on Intel's "Nehalem" or later Virtualization Technology (VT-x) and specifically Extended Page Tables (EPT). bhyve optionally supports Direct Device Attach (VT-d) for PCI pass-through of storage and network devices. VT-x and EPT can be found on Intel Core i3, i5 and i7 processors, the Pentium G6950 and select Xeon processors. Only the i3 specifically does not include VT-d support. Intel is good about listing VT-x and VT-d support for given processors on their web site but unfortunately are not as clear about EPT. The most certain way to verify VT-x and EPT support on a given system is to watch for the VMX and POPCNT (Pop Count) features in your dmesg output. Some systems may disable VT-x in BIOS and while POPCNT does not directly confirm EPT support, these features are usually, if not always available together. In the words of an Intel rep at the recent Supercomputing conference, "We added POPCNT for the NSA" and he confirmed that one could theoretically probe for EPT support.

The NYC*BUG dmesg Database is quite useful for referencing candidate systems and the system I used for this article reports:

Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x17bae3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,AVX> AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM> AMD Features2=0x1<LAHF>

bhyve support for AMD processors with AMD-V and Rapid Virtualization Indexing (RVI, formerly known as Nested Page Tables) is under development.

Key Components

bhyve is comprised of a few key components:

bhyve host components:

"/usr/sbin/bhyve", the user-space sequencer and I/O emulation
/usr/src/usr.sbin/bhyve/*

"/usr/sbin/bhyveload", the user-space FreeBSD loader that can load the
kernel and metadata inside a bhyve-based virtual machine
/usr/src/usr.sbin/bhyveload/*

"/usr/sbin/vmmctl", a utility to dump hypervisor register state
/usr/src/usr.sbin/vmmctl/*

"/usr/lib/libvmmapi.a, /usr/lib/libvmmapi.so, /usr/lib/libvmmapi.so.5, /usr/lib/libvmmapi_p.a"
The front-end to the vmm.ko chardev interface
/usr/src/lib/libvmmapi/*

"/boot/kernel/vmm.ko" Kernel module for VT-x, VT-d and hypervisor control
/usr/src/sys/modules/vmm/*
/usr/src/sys/amd64/vmm/*
/usr/src/sys/amd64/include/vmm*


bhyve guest kernel components:

The BIOS MPTable in-memory structures used to get APIC ID's etc.
/usr/src/sys/x86/x86/mptable.c
/usr/src/sys/x86/x86/mptable_pci.c

BVM, the "BSD Virtual Machine" Console
/usr/src/sys/dev/bvm/*

The modified local_apic.c to enable x2apic support for performance
/usr/src/sys/x86/x86/local_apic.c

The modified mp_machdep.c to allow CPUs without 'unrestricted guest'
support to bypass the real-mode bootstrap when running under bhyve
/usr/src/sys/amd64/amd64/mp_machdep.c

The BHYVE kernel configuration file
/usr/src/sys/amd64/conf/BHYVE

device                  bvmconsole
device                  mptable

mptable will be obsoleted by ACPI support.


bhyve guests rely on the following external components:

The traditional FreeBSD tunnel software network interface
/usr/src/sys/modules/if_tap/*

The VirtIO modules (imported to FreeBSD 10.0-CURRENT)
/usr/src/sys/dev/virtio/*
/usr/src/sys/modules/virtio/*

~/src/sys/amd64/vmm/intel/vmx.c provides the key heavy lifting of the Hypervisor and you may want to study it.

The Script

I have consolidated all of my knowledge of bhyve configuration into a single menu-driven script that offers the following steps:

1. Add the subversion and binutils Packages
2. Configure the Host's /boot/loader.conf (reboot required)
3. Retrieve bhyve Sources and Set-up the Working Directory
4. Build bhyve Host Components and Package
5. Add the bhyve Package (pkg_add /usr/src-bhyve/bhyve_package225757.tar
6. Build BhyVe Guest
7. Clean Up /mnt/ /dev/md0 /usr/obj/ and /usr/src-bhyve
8. Delete the bhyve Package (pkg_delete bhyve-0.0.1r225757)
9. Exit

Each step is a shell function and I have made it as linear as possible for easy comprehension and modification: simply add an exit anywhere that you are having trouble. This script will not modify /usr/src/ but rather union mount it on a working directory that will be populated with the bhyve sources via remote svn checkout or export. The /usr/src/ environment is needed to build the bhyve components but when the union mount is unmounted, the working directory remains with only the bhyve-specific sources and built binaries.

This script will build a package of the bhyve host components which can you can optionally use with Neel's downloadable bhyve guest.

Download: bhyve-menu.sh

PC-BSD 9.0 users should be able to use this script without modification using sudo but I have yet to test it myself.

If everything goes according to plan, you can exit the script and follow the boot instructions. The resulting system boot should look like:

Wait until 20 seconds after boot for networking to work
errno = 22
Consoles: userboot  

FreeBSD/amd64 User boot, Revision 1.1
(root@bhyve, Fri Mar 30 01:41:20 PDT 2012)
Loading /boot/defaults/loader.conf 
/boot//kernel/kernel text=0x41f64f data=0x57810+0x273590 syms=[0x8+0x73788+0x8+0x6af0b]
/boot//kernel/virtio.ko size 0x4bc0 at 0xbca000
/boot//kernel/if_vtnet.ko size 0xae10 at 0xbcf000
/boot//kernel/virtio_pci.ko size 0x57d8 at 0xbda000
/boot//kernel/virtio_blk.ko size 0x4f68 at 0xbe0000

  ______               ____   _____ _____  
 |  ____|             |  _ \ / ____|  __ \ 
 | |___ _ __ ___  ___ | |_) | (___ | |  | |
 |  ___| '__/ _ \/ _ \|  _ < \___ \| |  | |
 | |   | | |  __/  __/| |_) |____) | |__| |
 | |   | | |    |    ||     |      |      |
 |_|   |_|  \___|\___||____/|_____/|_____/    ```                        `
                                             s` `.....---.......--.```   -/
 ?????????????Welcome to FreeBSD???????????? +o   .--`         /y:`      +.
 ?                                         ?  yo`:.            :o      `+-
 ?  1. Boot [ENTER]                        ?   y/               -/`   -o/
 ?  2. [Esc]ape to loader prompt           ?  .-                  ::/sy+:.
 ?  3. Reboot                              ?  /                     `--  /
 ?                                         ? `:                          :`
 ?  Options:                               ? `:                          :`
 ?  4. Boot Safe [M]ode: NO                ?  /                          /
 ?  5. Boot [S]ingle User: NO              ?  .-                        -.
 ?  6. Boot [V]erbose: NO                  ?   --                      -.
 ?                                         ?    `:`                  `:`
 ?                                         ?      .--             `--.
 ?                                         ?         .---.....----.
 ???????????????????????????????????????????
                                          
GDB: debug ports: bvm
GDB: current port: bvm
KDB: debugger backends: ddb gdb
KDB: current backend: ddb
Copyright (c) 1992-2012 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 9.0-RELEASE #0: Fri Mar 30 01:41:05 PDT 2012
    root@bhyve:/usr/obj/usr/src-bhyve/sys/BHYVE amd64
WARNING: WITNESS option enabled, expect reduced performance.
CPU: Intel(R) Core(TM) i5-2400S CPU @ 2.50GHz (2499.87-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x206a7  Family = 6  Model = 2a  Stepping = 7
  Features=0x8fabab7f<FPU,VME,DE,PSE,TSC,MSR,PAE,CX8,APIC,SEP,PGE,CMOV,PAT,PSE36,CLFLUSH,DTS,MMX,FXSR,SSE,SSE2,SS,PBE>
  Features2=0x97bae25f<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,SMX,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,AVX,HV>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant
real memory  = 6442450944 (6144 MB)
avail memory = 2729897984 (2603 MB)
MPTable: <NETAPP   vFiler      >
Event timer "LAPIC" quality 400
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 2 package(s) x 1 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
pcib0 pcibus 0 on motherboard
pci0: <PCI bus> on pcib0
virtio_pci0: <VirtIO PCI Network adapter> port 0x2000-0x201f at device 1.0 on pci0
vtnet0: <VirtIO Networking Adapter> on virtio_pci0
virtio_pci0: host features: 0x18020 <Status,MrgRxBuf,MacAddress>
virtio_pci0: negotiated features: 0x18020 <Status,MrgRxBuf,MacAddress>
vtnet0: Ethernet address: 00:a0:98:f6:70:6c
virtio_pci1: <VirtIO PCI Block adapter> port 0x2040-0x207f at device 2.0 on pci0
vtblk0: <VirtIO Block Adapter> on virtio_pci1
virtio_pci1: host features: 0x10000004 <RingIndirect,MaxNumSegs>
virtio_pci1: negotiated features: 0x10000004 <RingIndirect,MaxNumSegs>
vtblk0: 400MB (819200 512 byte sectors)
cpu0 on motherboard
cpu1 on motherboard
isa0: <ISA bus> on motherboard
Timecounters tick every 10.000 msec
SMP: AP CPU #1 Launched!
Timecounter "TSC-low" frequency 9765121 Hz quality 1000
WARNING: WITNESS option enabled, expect reduced performance.
Trying to mount root from ufs:vtbd0 []...
warning: no time-of-day clock registered, system time will not be set accurately
Setting hostuuid: 837fa9d4-7a44-11e1-bef2-00a098f6706c.
Setting hostid: 0xa9e35c5b.
Entropy harvesting: interrupts ethernet point_to_point kickstart.
Starting file system checks:
/dev/vtbd0: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/vtbd0: clean, 29775 free (23 frags, 3719 blocks, 0.0% fragmentation)
Mounting local file systems:.
Setting hostname: bhyve-tap0.
vtnet0: link state changed to UP
Starting Network: lo0 vtnet0.
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=3<RXCSUM,TXCSUM>
	inet6 ::1 prefixlen 128 
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2 
	inet 127.0.0.1 netmask 0xff000000 
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
vtnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=80028<VLAN_MTU,JUMBO_MTU,LINKSTATE>
	ether 00:a0:98:f6:70:6c
	inet 192.168.1.151 netmask 0xffffff00 broadcast 192.168.1.255
	inet6 fe80::2a0:98ff:fef6:706c%vtnet0 prefixlen 64 tentative scopeid 0x1 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet 1000baseT <full-duplex>
	status: active
Starting devd.
add net default: gateway 192.168.1.1
add net ::ffff:0.0.0.0: gateway ::1
add net ::0.0.0.0: gateway ::1
add net fe80::: gateway ::1
add net ff02::: gateway ::1
Generating host.conf.
Creating and/or trimming log files.
Starting syslogd.
ELF ldconfig path: /lib /usr/lib /usr/lib/compat
32-bit compatibility ldconfig path: /usr/lib32
Clearing /tmp (X related).
Updating motd:.
Generating public/private rsa1 key pair.
Your identification has been saved in /etc/ssh/ssh_host_key.
Your public key has been saved in /etc/ssh/ssh_host_key.pub.
The key fingerprint is:
ab:9c:f9:9a:5b:b8:f4:91:97:9e:86:8a:b0:67:ba:41 root@bhyve-tap0
The key's randomart image is:
+--[RSA1 1024]----+
|                 |
|                 |
|                 |
|                 |
|  E     S        |
| .     . o .     |
|  o   o =.o      |
|   +o+ O.+..     |
|  +=. @=o.o      |
+-----------------+
Generating public/private dsa key pair.

...

Starting sshd.
Starting cron.
Starting background file system checks in 60 seconds.

Fri Mar 30 08:44:11 UTC 2012

FreeBSD/amd64 (bhyve-tap0) (console)

login: 

The result is a genuine FreeBSD system running the paired-down BHYVE kernel.

The 400M disk image leaves about 84M of space for experimentation.

Filesystem    Size    Used   Avail Capacity  Mounted on
/dev/vtbd0    393M    277M     84M    77%    /
devfs         1.0k    1.0k      0B   100%    /dev

Try your favorite software and benchmarks on your bhyve guest and explore its limits. Every step of my testing has generated as many questions as answers and bhyve clearly offers a lot to explore.

Some versions of VMWare reportedly allow for VT-x features to be passed through to the emulator but unfortunately, my notebook does not support EPT.

Troubleshooting

If the updated binutils package is not installed, you will see:

{standard input}: Assembler messages:
{standard input}:160: Error: no such instruction: 'invept -16(%rbp),%rax'

If the hw.physmem="0x100000000" is not set, you will see:

vm_setup_memory(lowmem): Cannot allocate memory

If the vmm.ko module is not loaded when you try to boot the guest, you will see:

vm_create: No such file or directory

If the vmm.ko is mismatched with the host kernel, you will see:

kldload: can't load vmm: Exec format error

An incompatible system such a Celeron U2300 with the following features in dmesg will give the following errors when attempting to boot a guest:

Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,
CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x400e3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,XSAVE>
AMD Features=0x20100800<SYSCALL,NX,LM>
AMD Features2=0x1<LAHF>

kldload vmm.ko
vmx_init: processor does not support desired primary processor-based controls
module_register_init: MOD_LOAD (vmm, 0xffffffff816127a0, 0) error 22

Note that ifconfig tap0 up needs to be run after the guest as booted in order for it to work. A 20 second delay is included in vmrun.sh to accomodate the time it takes to boot the guest kernel.

Note that "myguest" is an arbitrary guest name and you must be careful to keep track of guest names and their memory allocations. A mismatch can result in:

vm_setup_memory(highmem): Cannot allocate memory

If you receive this message, it can be cleared up by running kldunload vmm and kldload vmm to the reload the vmm kernel module.

bhyve Guest Layout

Building bhyve guest images is much like building FreeBSD Jails or Xen guests and you can use many of your favorite Jail building techniques to configure and customize them. The script populates a disk image with a userland, configures it to your needs and surrounds it with a loader, kernel, and the appropriate kernel modules. I have followed the layout of Neel's downloadable image in /usr/bhyve-guest/:

/usr/guest/
           boot/kernel/     Containing loader, kernel and modules
           diskdev          Disk image for guest's / partition
           userboot.so      Required by the bhyve* utilities

/usr/guest/boot/
beastie.4th         kernel             menu.rc
brand.4th           loader.4th         screen.4th
check-password.4th  loader.conf        shortcuts.4th
color.4th           loader.help        support.4th
defaults            loader.rc          userboot.so
delay.4th           menu-commands.4th  version.4th
frames.4th          menu.4th

/usr/guest/boot/
if_vtnet.ko         virtio.ko          virtio_pci.ko
kernel              virtio_balloon.ko
mdroot              virtio_blk.ko

The /usr/guest/boot/ listing shows an mdroot device as per Neel's downloadable image layout. This is the memory-backed disk that he used and is not used by my approach.

As this is a test environment, I run echo "PermitRootLogin yes" >> /mnt/etc/ssh/sshd_config to allow root to ssh into the system.

bhyve Guest Images

I would like to thank NYC*BUG, the New York City *BSD User Group for providing bandwidth to share images generated with bhyve-menu.sh at bhyve.org

I would also like to thank Paul Schenkeveld for helping me debug this at AsiaBSDCon 2012 and let's all thank Neel and Peter for their hard work.

You can find more information and the original BSDCan presentation in the bhyve section of the FreeBSD Wiki.

I welcome your corrections and contributions.

CFT

Copyright © 2011 – 2014 Michael Dexter unless specified otherwise. Feedback and corrections welcome.

Happy hacking!