exchanging a defective drive in a software RAID

setup: server with configured software RAID1 and LVM. If you have configured mdadm monitoring correctly you might receive an email like the following one day:

A Fail event had been detected on md device /dev/md/0.
It could be related to component device /dev/sda1.

This requires immediate action because one of your hard disks is possibly broken.

# cat /proc/mdstat
Personalities : [raid1] 
md1 : active raid1 sda2[0](F) sdb2[1]
2929739071 blocks super 1.2 [2/1] [_U]

md0 : active raid1 sda1[0](F) sdb1[1]
524276 blocks super 1.2 [2/1] [_U]

unused devices: <none>

In this case /dev/sda is the defective device.  For your support ticket you need to provide information that proves that the device is broken.

# hdparm -i /dev/sda | grep SerialNo
HDIO_DRIVE_CMD(identify) failed: Input/output error
HDIO_GET_IDENTITY failed: No message of desired type

# hdparm -i /dev/sdb | grep SerialNo
Model=ST3000DM001-1CH166, FwRev=CC24, SerialNo=Z1F1PTCD

# smartctl -H /dev/sda

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Log Sense failed, IE page [scsi response fails sanity test]

Next you need to remove the drive from the RAID:

# mdadm /dev/md0 -r /dev/sda1
# mdadm /dev/md1 -r /dev/sda2

If removing fails with the message “mdadm: hot remove failed for /dev/sdXY: Device or resource busy”, try

mdadm --fail /dev/md1 /dev/sdb2

first.

Again check mdstat:

# cat /proc/mdstat

Personalities : [raid1] 
md1 : active raid1 sdb2[1]
2929739071 blocks super 1.2 [2/1] [_U]

md0 : active raid1 sdb1[1]
524276 blocks super 1.2 [2/1] [_U]

unused devices: <none>

Now you are ready and have all information together to open your support ticket. Ensure that you have provided a mail adress in the robot which you have access to even if your server is not reachable.
After the drive has been replaced you might have to install grub again in order to boot your server. You can achieve this from the rescue system.

First copy the partition table and assign a new random UUID:

# sgdisk -R /dev/sda /dev/sdb
# sgdisk -G /dev/sda

Mount you root filesystem after detecting your existing volume group / drives. Then write grub to the new drive:

# mount /dev/vg0/root /mnt
# mount /dev/md0 /mnt/boot
# mount -o bind /dev /mnt/dev
# chroot /mnt
# mount proc
# mount sys
# grub-install --recheck /dev/sda
# sync
# exit
# umount /mnt/*
# umount /mnt
# reboot

If everything went fine you are now able to log into your system and add the new drive again to the RAID.

# mdadm /dev/md0 -a /dev/sda1
mdadm: added /dev/sda1
# mdadm /dev/md1 -a /dev/sda2
mdadm: added /dev/sda2

Again, check mdstat and hopefully you will see the sync in progress:

# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sda2[2] sdb2[1]
2929739071 blocks super 1.2 [2/1] [_U]
[>....................] recovery = 0.2% (7515776/2929739071) finish=938.2min speed=51911K/sec

md0 : active raid1 sda1[2] sdb1[1]
524276 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Based on http://wiki.hetzner.de/index.php/Festplattenaustausch_im_Software-RAID

This entry was posted in Debian, Hetzner. Bookmark the permalink.