This Data Corruption Bug will Shock You
It's not common to have a severe bug in Ubuntu LTS, and it's even less common to have a severe data corruption bug! We are used to the idea that data is stored reliably and safely on our computers. However, for the past week, I have been pulling my hair out because of a data corruption issue, and was really surprised to discover the true cause. I have been attempting to create several Virtual Machines (VMs) to compartmentalize my essential network services, e.g. email is separated from contacts, and I decided to use virtualized QEMU instances on my Ubuntu (18.04 LTS) system. I first provisioned a Windows Server VM, which installed flawlessly on a qcow2 image on an ext4 disk. However, I ran into issues when provisioning a Linux VM with a different virtual hard drive configuration. The drive format and definition (libvirt) for the VM was:
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/dev/rust/client'/>
<target dev='sda' bus='sata'/>
</disk>
This disk image was stored on a ZVOL, i.e. a block device on a ZFS filesystem, and not as a file.
After installing Debian with disk encryption, and only with a long encryption key, I kept getting a kernel panic upon first boot:
[ 0.799754] Unpacking initramfs...
[ 0.800970] Initramfs unpacking failed: junk in compressed archive
...
[ 0.936387] List of all partitions:
[ 0.937590] No filesystem could mount root, tried: [ 0.938747]
[ 0.939199] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 0.941199] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u2
[ 0.943290] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
...
[ 0.966656] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
The console line, Initramfs unpacking failed: junk in compressed archive
, is the kernel saying that its boot file is corrupt! Initramfs is an image file that provides essential modules and drivers during system boot. I decided to mount the guest drive and extract the contents of initrd.img
, but I received more error messages from gzip:
gzip: initrd.img-4.9.0-7-amd64: invalid compressed data--crc error
gzip: initrd.img-4.9.0-7-amd64: invalid compressed data--length error
At this point I gave up, thinking that the root cause was either of the following:
- a ZVOL/ZFS File System Bug (my first time using ZVOLs)
- a QEMU bug (suspicious update right before I used it)
- an initramfs-tools bug (I had a corrupt
initrd.img
file) - a LUKS disk encryption bug (only happens to encrypted disks with long encryption keys!)
I even contemplated submitting a bug report to the initramfs-tools package. However, it turns out I was completely mistaken with my guesses, and that having the bug only occur with long encryption keys was a red herring. A Google search for "qemu corrupt" with results from to the past month revealed a news article detailing the cause of the data corruption bug.
It turns out that my data corruption was due to a kernel bug! This bug only occurs when the virtual drive is set to cache='none'
and located on a non-ext4 (mine was ZFS) file system. Like in any accident, there was a long chain of events: QEMU cache='none' -> buggy system call on a ZVOL/ZFS file system -> corrupt disk write -> corrupt initramfs -> kernel panic on VM boot.
Even though I hoped that Linux (and by extension Linus!) would be infallible, and that such a serious data corruption bug would have been a package's fault, this bug is just one of many kernel bugs every year. This is the Ubuntu Bug Fix Report. The bug fix was actually released today and thankfully I did not lose any important data. In conclusion, as silly as it sounds, I learned that Linux isn't perfect.