Brokenstack? - Rescuing Your Instance From The Brink of Oblivion

January 5, 2017

Have you ever had Openstack do something to your instance that put it in an unbootable state? Did YOU do something to your instance that put it into an unbootable state?

Modern IaaS wisdom teaches us that we are to treat instances like "cattle", that we should be able to just blow it away and replace it at any time. However, we still have dev environments, jump boxes, etc. that will still be treated as "pets". When these instances get in trouble, we panic.

In today’s story, we happen upon an Openstack admin who decided to try migrating such an instance from one node to another to better distribute the memory load. That brings us to another axiom: Test Openstack’s migrate feature on a test VM BEFORE attempting to move an instance.

So imagine the ensuing panic when said migration failed with a 401 error. Gulp.

As with any other SNAFU involving nova-compute, we figure out which host we’re running on and the virsh instance name:

# nova show 928907ae-4711-4863-9add-cff4f0ff161e
+--------------------------------------+-----------------------------------+
| Property                             | Value                                                         |
+--------------------------------------+-----------------------------------+
| OS-DCF:diskConfig                    | AUTO                                                          |
| OS-EXT-AZ:availability_zone          | nova                                                          |
| OS-EXT-SRV-ATTR:host                 | node-5.domain.tld                                             |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | node-5.domain.tld                                             |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000052                                             |
...
+--------------------------------------+-----------------------------------+

Then SSH directly to the compute node to see what KVM / QEMU’s view of the world is.

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 26    instance-00000056              running
 70    instance-00000052              shut down
...

Turns out, Openstack didn’t delete the instance, but left the instance’s folder in a renamed state, like so:

# ls /var/lib/nova/instances
0ecaff2c-d73a-483f-97d4-3425faa8355e
928907ae-4711-4863-9add-cff4f0ff161e_resize
...
# ls -Alh /var/lib/nova/instances/928907ae-4711-4863-9add-cff4f0ff161e_resize
total 11G
-rw------- 1 root root  46K Jan  5 15:03 console.log
-rw-r--r-- 1 root root  11G Jan  5 15:16 disk
-rw-r--r-- 1 root root 410K Jun  1  2016 disk.config
-rw-r--r-- 1 nova nova  162 Jun  1  2016 disk.info
-rw-r--r-- 1 nova nova 2.9K Jan  5 10:22 libvirt.xml

So all we need to do is rename the directory so it no longer has the _resize directive, then run:

# virsh start instance-00000052
Domain instance-00000052 started
# nova reset-state --active 928907ae-4711-4863-9add-cff4f0ff161e

All is well, right?

Give root password for maintenance (or type Control-D to continue):

Not yet. Looks like the OS has decided that something is amiss – possibly a corrupted root filesystem?!?! All you need to do is type the root password and… oh wait, this is a cloud image. You don’t KNOW the root password!

NOTE: It was brought to my attention after posting this that the next logical step is to use nova rescue, which essentially allows me to boot another instance, attach to the boot disk of the instance in question, and perform whatever operations I need. Try that first. If nova rescue doe not work for your particular situation – read on.

If this was your desktop, you’d simply pop the CentOS 7 DVD into the drive and attempt recovery. Let’s do that!

Back on the compute node, use virsh to add a cd-rom drive to your instance:

# virsh
virsh # edit instance-00000052

Under <os>, ensure that we’re also going to boot from cdrom:

  <os>
    <type arch='x86_64' machine='pc-i440fx-2.0'>hvm</type>
    <boot dev='cdrom'/>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>

Next, add the following device under <devices>:

    <disk type='block' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>

Next, start the instance and attach the ISO:

# virsh start instance-00000052
# virsh attach-disk instance-00000052 /var/lib/nova/workspace/CentOS-7-x86_64-DVD-1611.iso hdc --type cdrom --mode readonly

Then, you can actually go into Horizon, click the Console link for your instance, and operate the console from there. From the console, click Send CTRLALTDEL to restart your instance and boot from the ISO.

You may be tempted to finally say "YAY! I can finally fix the filesystem and boot my instance – almost there." Then some jerk keeps restarting your instance before you can run fsck or xfs_repair. That prankster is nova-compute. To tell him to "cut it out", simply reset the status on the instance after you hit Send CTRLALTDEL.

# nova reset-state --active 928907ae-4711-4863-9add-cff4f0ff161e

Do what you need to do – set the root password this may help, restart your instance from the local disk, and fix what’s wrong.

The underlying problem that caused all of this seemed to be twofold: First, xfs_repair found that there were some errors in the root filesystem, and promptly fixed them. Also, I had a block device I was using for data storage that didn’t detach cleanly. In fact, early on in the process I went to Horizon and detached the block device when virsh start didn’t initially work, and planned on reattaching it when I determined all was well with the OS. However, during boot up, the OS was trying to mount said device per its /etc/fstab and it wasn’t apparent from what I was seeing at the console.

When finished, make sure you cleanly power down your instance, go back to the compute node, and use virsh to remove the changes you made to get the cdrom drive to work. Then, start your instance back up using the Horizon UI.

Also – You should probably reset that root password.

Anyway, this may have gone from Buffalo to New York by way of Chicago, but at least now we know what lengths we can go to if something goes south on an instance you care about (and ideally, you SHOULDN’T care about them).

Written by:
Jeremy R Budnack

Senior Cloud Engineer

Brokenstack? – Rescuing Your Instance From The Brink of Oblivion