December 13th '10 23:49

Fun with unsigned long long overflows.

I added two disks to my Very Large Video Storage Array this weekend, because I had them laying around and I thought it would be fun. This machine is a kirkwood-based arm5 running Debian Squeeze (a marvell open-rd client, to be specific). The reshape caused the overall storage capacity to go from ~1.2TB to ~2.1TB. I added the discs, started the reshape, and went to bed.

A few hours later, something stupid happened to the electricity in my garage and all my stuff got turned off. I fixed that, got everything turned back on, and went to check on the reshape. The kernel was complaining that the reshape wasn't far enough along to be automatically restarted, which was scary and seemed untrue. I emailed the linux-raid mailing list and talked to the maintainer a bit, and he seemed to think it was an overflow on arm specifically of the kernel code that performs this check.

Neil was right; when I eventually got a sufficiently modern version of mdadm running on an x86 machine, I was able to re-assemble the array and get it reshaping again. A few false starts later (I may have slightly unplugged the disks from the laptop while checking on it last night), the reshape completed successfully, I reattached the disks to the arm machine, and was able to extend the underlying ext4 partition without error. I've verified all the (important) data against checksums and backups, and there appears to be no lasting damage.

While this failure was due to the overall complexity of my storage system, it highlights the advantages of linux software raid: I was able to trivially move the disks to another machine to perform recovery and work around hardware-specific bugs.