2010-05-12. (The making of the new 'dint', with more space and even more resilience: this computer has to survive long periods without maintenance available, and last year it had a single disk failure in its raid5 array of 4 500GB disks, followed half a year later by another failure; now we want double redundancy as well as more space. ext4 is also wanted, as ext3 has been such a trouble dealing with several-GB filesystems, giving many-second fsync and mkdir times in some cases.) Now the 4 1TB disks are in a computer, each identically configured as: partition approx-size in array purpose 1 64MB (none) /boot 2 10GB /dev/md0 / (system) 3 15GB /dev/md1 /home 4 ~900GB /dev/md2 /home/public Note: no swap. Partly because it seems silly to put critical things on possibly failing disk, partly because surely not very needed if decent ram, for a server, partly because it's cleaner not to start on logical partitions. The 3 raid arrays were built, each being RAID6 over 4 disks. mdadm --create /dev/md0 -c 16 -l 6 -n 4 /dev/sd[bc]2 missing missing mdadm --create /dev/md1 -c 16 -l 6 -n 4 /dev/sd[bc]3 missing missing mdadm --create /dev/md2 -c 64 -l 6 -n 4 /dev/sd[bc]4 missing missing The component devices (all disks' partitions 2,3,4) were initially labelled as type 'fd', 'Linux raid autodetect'. They were initially built with each having just two component devices (two missing), by putting them into another computer for partitioning and building. grub was installed onto each of the two, telling it that same disk's partition 1 as the root. Each /boot partition (sd[abcd]1) was formatted as ext3 with the smaller inode size to make grub happy, mkfs.ext3 -I128 /dev/sda1 (then sdb1 etc) The reason for doing this on all disks is to allow booting even if the first one fails and is removed. An ext4 filesystem was used for each of the arrays, configured to match the raid chunk-sizes: mkfs.ext4 -L system -b 4096 -E stride=4,stripe-width=8 -m 2 /dev/md0 mkfs.ext4 -L home -b 4096 -E stride=4,stripe-width=8 -m 2 /dev/md1 mkfs.ext4 -L public -b 4096 -E stride=16,stripe-width=32 -m 2 /dev/md2 (and when re-making 'public' on 2010-06-12, with 5 disks total, mkfs.ext4 -L public -b 4096 -E stride=16,stripe-width=48 -m 2 /dev/md2 ) The options md_mod.start_dirty_degraded=1 md_mod.start_ro=1 were added to the kernel options in /boot/grub/grub.conf , to make it accept starting an array with missing devices. The root device root=/dev/md0 was specified. The first troubles (panic) were the cretinous omission to modify /etc/fstab from the values of the machine from which the system had been copied. Correcting this still didn't fix the panic about root device and lack of 'init' executable. Nor did partitioning and building-in the other two disks (using a rescue cd). On a hunch, it was suspected that perhaps the definition of md{0,1,2} varied between me and the kernel ... All disks' partitions 3 & 4 were thus set back to plain type 83 (Linux) rather than fd (Linux raid autodetect), so that the kernel should ignore them. Then it worked, with each partition table looking like this: Disk /dev/sd[abcd]: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0xbef429a6 Device Boot Start End Blocks Id System /dev/sda1 1 9 72260+ 83 Linux /dev/sda2 10 1315 10490441+ fd Linux raid autodetect /dev/sda3 1316 3274 15735661 83 Linux /dev/sda4 3275 121566 950172456+ 83 Linux In the newly started system, running with just the root array md0 assembled, from just its minimum of 2 devices, all the arrays were manually started and the other partitions (now available in the new computer with its 4 SATA sockets) were added to the arrays: mdadm --assemble /dev/md1 /dev/sda3 /dev/sdb3 mdadm --assemble /dev/md2 /dev/sda4 /dev/sdb4 mdadm --manage /dev/md0 --add /dev/sdc2 /dev/sdd2 mdadm --manage /dev/md1 --add /dev/sdc3 /dev/sdd3 mdadm --manage /dev/md2 --add /dev/sdc4 /dev/sdd4 The further partitions were then specified in /etc/mdadm.conf to let the init scripts assemble the arrays with the required names. # mdadm configuration file MAILADDR root DEVICE /dev/sd[abcd][234] ARRAY /dev/md0 devices=/dev/sda2,/dev/sdb2,/dev/sdc2,/dev/sdd2 ARRAY /dev/md1 devices=/dev/sda3,/dev/sdb3,/dev/sdc3,/dev/sdd3 ARRAY /dev/md2 devices=/dev/sda4,/dev/sdb4,/dev/sdc4,/dev/sdd4 After the several-hour rebuild to get all devices actively synced in all arrays, it looked like this: root@tempdint ~ # mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Mon May 10 17:49:55 2010 Raid Level : raid6 Array Size : 20980736 (20.01 GiB 21.48 GB) Used Dev Size : 10490368 (10.00 GiB 10.74 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu May 13 18:10:07 2010 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Chunk Size : 16K UUID : 206503fd:a54541e0:d7a6d8c7:9aeca122 Events : 0.2604 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 18 1 active sync /dev/sdb2 2 8 34 2 active sync /dev/sdc2 3 8 50 3 active sync /dev/sdd2 root@tempdint ~ # mdadm --detail /dev/md1 /dev/md1: Version : 0.90 Creation Time : Mon May 10 17:50:02 2010 Raid Level : raid6 Array Size : 31471104 (30.01 GiB 32.23 GB) Used Dev Size : 15735552 (15.01 GiB 16.11 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Thu May 13 16:56:19 2010 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Chunk Size : 16K UUID : e87e43c2:2e616de1:d7a6d8c7:9aeca122 Events : 0.531 Number Major Minor RaidDevice State 0 8 19 0 active sync /dev/sdb3 1 8 3 1 active sync /dev/sda3 2 8 35 2 active sync /dev/sdc3 3 8 51 3 active sync /dev/sdd3 root@tempdint ~ # mdadm --detail /dev/md2 /dev/md2: Version : 0.90 Creation Time : Mon May 10 17:28:22 2010 Raid Level : raid6 Array Size : 1900344704 (1812.31 GiB 1945.95 GB) Used Dev Size : 950172352 (906.15 GiB 972.98 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Thu May 13 16:55:14 2010 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Chunk Size : 64K UUID : 78af272f:8bd08df3:d7a6d8c7:9aeca122 Events : 0.25295 Number Major Minor RaidDevice State 0 8 20 0 active sync /dev/sdb4 1 8 4 1 active sync /dev/sda4 2 8 36 2 active sync /dev/sdc4 3 8 52 3 active sync /dev/sdd4 /etc/fstab was set up to have the following (note that although /boot won't mount if sda is removed and the others don't end up with names shifting to include sda, it's still not a problem, but just a warning during boot that a partition failed to mount) /dev/md0 / ext4 noatime,data=ordered,journal_checksum,barrier=1,stripe=8,commit=4 0 1 /dev/sda1 /boot ext3 defaults 0 0 /dev/md1 /home ext4 errors=remount-ro,usrjquota=.aquota.user,jqfmt=vfsv0,user_xattr,acl,data=ordered,journal_checksum,barrier=1,stripe=8,commit=4 0 2 /dev/md2 /home/public ext4 errors=remount-ro,user_xattr,acl,data=ordered,journal_checksum,barrier=1,stripe=32,commit=6 0 2 It was then tested how well failures were tolerated. The point is to have a resilient system for leaving remotely with users who won't do much towards intricate fault-finding: we want any two disks to be able to fail and be removed, and still operate as normal. It should not be that the loss of, for example, the first disk causes a trouble in booting. The kernel command-line options mentioned above (md_mod.xxxxx=1) are an essential part in this, allowing degraded arrays to be used. Grub is a possible trouble: if disks are removed, do others change their name (hd[0123]) on next boot (yes -- could use /dev/disks/by-id when running in the system, but we want even grub to work right). For each disk, (hdN) with N=0,1,2,3, in grub's nomenclature, this was run, within the grub command from the running system: root (hdN,0) setup (hdN) to make each disk's MBR want its own first-partition for grub (but does this point to another disk if the disks all move?). It was found that removing the SATA lead to the first disk still resulted in a happy boot and mount; then removing the lead to the last disk as well /still/ permitted this.