2006-07-29 -- more tests on new fileserver for ETS. Now, use of linux md raid in emergencies. Five disks (147GB 10krpm U320 SCSI), each with two 10000MB partitions at its beginning. Using a gentoo system installed on another (the first) disk, with linux md raid from a 2.6.16-gentoo-r13 kernel, the following raid grow/fail/spare etc. scenarios were worked through. 1) make raid5 array of 4 partitions, format with ext3, put files on it, checksum the files, remove one disk by stopping array and removing superblock from partition, then rerun array and check files are still there! # mdadm --create /dev/md0 --auto=yes -l 5 -n 4 /dev/sd[bcde]1 # mkfs.ext3 /dev/md0 # mount -t ext3 /dev/md0 /mnt/tmp/ # cp file? /mnt/tmp/ # md5sum /mnt/tmp/file? 93727fa64195aaf4c9c043f6a3b45905 /mnt/tmp/file1 e0ebe8f2dc7c51cb5565d3d42f8f8c81 /mnt/tmp/file2 7e14b1cdc730c13da524e280085e9817 /mnt/tmp/file3 8858b2ff1cf9a46eedb9b7f861fab02f /mnt/tmp/file4 48f2f7d32c1d190406f9ef7b17bdc2ec /mnt/tmp/fileb # umount /dev/md0 # mdadm --stop /dev/md0 # mdadm --zero-superblock /dev/sdc1 # mdadm --assemble /dev/md0 --run /dev/sd[bcde]1 mdadm: /dev/sdc1 has no superblock - assembly aborted # mdadm --assemble /dev/md0 --run /dev/sd[bde]1 mdadm: /dev/md0 has been started with 3 drives (out of 4) # cat /proc/mdstat md0 : active raid5 sdb1[0] sde1[3] sdd1[2] 2963520 blocks level 5, 64k chunk, algorithm 2 [4/3] [U_UU] # mount mount -t ext3 /dev/md0 /mnt/tmp/ # md5sum /mnt/tmp/file? (same results as above) 2) add another disk into the array # mdadm --manage /dev/md0 --add /dev/sdf1 (wait a few seconds for the small array to be recovered) # cat /proc/mdstat md0 : active raid5 sdf1[1] sdb1[0] sde1[3] sdd1[2] 2963520 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] 3) with mdadm --monitor running to send emails of "changes in status", use the mdadm commands for failing a component disk # mdadm --manage /dev/md0 --fail /dev/sdb1 (and immediately got a nice email:) This is an automatically generated mail message from mdadm running on penguin A Fail event had been detected on md device /dev/md0. It could be related to component device /dev/sdb1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid0] [raid5] [raid4] [raid1] md0 : active raid5 sdf1[1] sdb1[4](F) sde1[3] sdd1[2] 2963520 blocks level 5, 64k chunk, algorithm 2 [4/3] [_UUU] unused devices: 4) restore the faulted drive to operation (--manage seems redundant) # mdadm /dev/md0 --remove /dev/sdb1 # mdadm /dev/md0 --add /dev/sdb1 # cat /proc/mdstat md0 : active raid5 sdb1[0] sdf1[1] sde1[3] sdd1[2] 2963520 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] 5) add a further device as a hot spare # mdadm /dev/md0 --add /dev/sdc1 # cat /proc/mdstat md0 : active raid5 sdc1[4](S) sdb1[0] sdf1[1] sde1[3] sdd1[2] 2963520 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] 6) have another failure: see how spare copes # mdadm /dev/md0 --fail /dev/sdd1 # cat /proc/mdstat md0 : active raid5 sdc1[2] sdb1[0] sdf1[1] sde1[3] sdd1[4](F) 2963520 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] all very nice! Check the data's actually there: # md5sum /mnt/tmp/file? (yes -- all tallies) 7) try to grow array # mdadm /dev/md0 --remove /dev/sdd1 # mdadm /dev/md0 --grow /dev/sdd1 mdadm: can only add devices to linear arrays Ahhh!! The problem... a many-hour job, then, to increase the array's size, by wiping and rebuilding then restoring data from backup. WORTH NOTING, if a hostraid card can do online growing. 8) not really raid, but related: growing ext3fs. # df . /dev/md/1 2916920 2465504 303240 90% /mnt/tmp # umount /mnt/tmp # resize2fs -f /dev/md1 Resizing the filesystem on /dev/md1 to 987904 (4k) blocks. The filesystem on /dev/md1 is now 987904 blocks long. # mount -t ext3 /dev/md1 /mnt/tmp/ # df . /dev/md/1 3888808 2465504 1225724 67% /mnt/tmp # md5sum /mnt/tmp/file? (again, these are ok)