Odd behaviour infortrend/LSI/2.4.31

At some point, a few disks went offline in one of our newsspoolers. The newsspooler is running linux kernel 2.4.31 (never change a winning team), with 2 infortrend 12-drive raid boxen connected to LSI (mpt) SCSI controller. Anyway, when I got the message from nagios, saying that something was going on on that box, I logged onto it, took a look at dmesg and saw that some disks were gone. They indeed were. As I hadn’t seen this before on any of our boxes with this same configuration, I took a look at the array on which the offline disks were connected to. The array mentioned nothing else than bad sectors on one of the disks.

As I didn’t want to screw up the filesystems on the offline disks, I then decided to reboot the raid array. The reboot went nicely as always. After that, I unloaded the mptscsih and mptbase kernel module, and reloaded it. The disks showed up normally again.

One or two days later, the same thing happened on the same box, with disks on the same array. I didn’t repeat the same procedure however, but just reloaded the mpt modules. It worked, without even rebooting the array 😐

The bad thing is that I can’t decide where the problem lies. As the raid array was complaining about bad sectors on some disk (stupidly I forgot to see which one it was), that might be causing trouble. However, that shouldn’t be a reason to get offline disks. On the other hand there’s the raid controller. Although I’ve seen more issues with the fiber channel LSI controllers, I haven’t seen any scsi error messages from the kernel. Normally when there’s scsi woes, the module will complain about it, and in all 4 (!) times it happened, there was nothing..

Anyway, we’re going to install a new box tomorrow morning and migrate the data on the box to the new one.

