2010-05-12.   (The making of the new 'dint', with more space and 
even more resilience: this computer has to survive long periods 
without maintenance available, and last year it had a single disk 
failure in its raid5 array of 4 500GB disks, followed half a year
later by another failure; now we want double redundancy as well 
as more space.  ext4 is also wanted, as ext3 has been such a trouble
dealing with several-GB filesystems, giving many-second fsync and
mkdir times in some cases.)


Now the 4 1TB disks are in a computer, each identically 
configured as:
	partition	approx-size	in array	purpose
	1		64MB		(none)		/boot
	2		10GB		/dev/md0	/  (system)
	3		15GB		/dev/md1	/home
	4		~900GB		/dev/md2	/home/public

Note: no swap.  Partly because it seems silly to put critical 
things on possibly failing disk, partly because surely not 
very needed if decent ram, for a server, partly because it's 
cleaner not to start on logical partitions.


The 3 raid arrays were built, each being RAID6 over 4 disks.
	mdadm --create /dev/md0 -c 16 -l 6 -n 4 /dev/sd[bc]2 missing missing
	mdadm --create /dev/md1 -c 16 -l 6 -n 4 /dev/sd[bc]3 missing missing
	mdadm --create /dev/md2 -c 64 -l 6 -n 4 /dev/sd[bc]4 missing missing
The component devices (all disks' partitions 2,3,4) were initially
labelled as type 'fd', 'Linux raid autodetect'.
They were initially built with each having just two component devices
(two missing), by putting them into another computer for partitioning
and building.  grub was installed onto each of the two, telling it
that same disk's partition 1 as the root.

Each /boot partition (sd[abcd]1) was formatted as ext3 with the smaller
inode size to make grub happy,
mkfs.ext3 -I128 /dev/sda1  (then sdb1 etc)
The reason for doing this on all disks is to allow booting even if the 
first one fails and is removed.

An ext4 filesystem was used for each of the arrays, configured to 
match the raid chunk-sizes:
mkfs.ext4 -L system -b 4096 -E stride=4,stripe-width=8 -m 2 /dev/md0
mkfs.ext4 -L home -b 4096 -E stride=4,stripe-width=8 -m 2 /dev/md1
mkfs.ext4 -L public -b 4096 -E stride=16,stripe-width=32 -m 2 /dev/md2
(and when re-making 'public' on 2010-06-12, with 5 disks total,
  mkfs.ext4 -L public -b 4096 -E stride=16,stripe-width=48 -m 2 /dev/md2
)

The options 
	md_mod.start_dirty_degraded=1 md_mod.start_ro=1
were added to the kernel options in /boot/grub/grub.conf , to make
it accept starting an array with missing devices.
The root device root=/dev/md0 was specified.

The first troubles (panic) were the cretinous omission to modify
/etc/fstab from the values of the machine from which the system
had been copied.
Correcting this still didn't fix the panic about root device
and lack of 'init' executable.
Nor did partitioning and building-in the other two disks 
(using a rescue cd).

On a hunch, it was suspected that perhaps the definition of 
md{0,1,2} varied between me and the kernel ...
All disks' partitions 3 & 4 were thus set back to plain type
83 (Linux) rather than fd (Linux raid autodetect), so that
the kernel should ignore them.  Then it worked, with each 
partition table looking like this:

  Disk /dev/sd[abcd]: 1000.2 GB, 1000204886016 bytes
  255 heads, 63 sectors/track, 121601 cylinders
  Units = cylinders of 16065 * 512 = 8225280 bytes
  Disk identifier: 0xbef429a6
     Device Boot      Start         End      Blocks   Id  System
  /dev/sda1               1           9       72260+  83  Linux
  /dev/sda2              10        1315    10490441+  fd  Linux raid autodetect
  /dev/sda3            1316        3274    15735661   83  Linux
  /dev/sda4            3275      121566   950172456+  83  Linux

In the newly started system, running with just the root array md0
assembled, from just its minimum of 2 devices, all the arrays were 
manually started and the other partitions (now available in the 
new computer with its 4 SATA sockets) were added to the arrays:
mdadm --assemble /dev/md1 /dev/sda3 /dev/sdb3
mdadm --assemble /dev/md2 /dev/sda4 /dev/sdb4
mdadm --manage /dev/md0 --add /dev/sdc2 /dev/sdd2
mdadm --manage /dev/md1 --add /dev/sdc3 /dev/sdd3
mdadm --manage /dev/md2 --add /dev/sdc4 /dev/sdd4

The further partitions were then specified in /etc/mdadm.conf
to let the init scripts assemble the arrays with the required
names.
# mdadm configuration file
MAILADDR root
DEVICE /dev/sd[abcd][234]
ARRAY /dev/md0 devices=/dev/sda2,/dev/sdb2,/dev/sdc2,/dev/sdd2
ARRAY /dev/md1 devices=/dev/sda3,/dev/sdb3,/dev/sdc3,/dev/sdd3
ARRAY /dev/md2 devices=/dev/sda4,/dev/sdb4,/dev/sdc4,/dev/sdd4

After the several-hour rebuild to get all devices actively synced
in all arrays, it looked like this:

root@tempdint ~ #  mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Mon May 10 17:49:55 2010
     Raid Level : raid6
     Array Size : 20980736 (20.01 GiB 21.48 GB)
  Used Dev Size : 10490368 (10.00 GiB 10.74 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May 13 18:10:07 2010
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 16K

           UUID : 206503fd:a54541e0:d7a6d8c7:9aeca122
         Events : 0.2604

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       3       8       50        3      active sync   /dev/sdd2

root@tempdint ~ #  mdadm --detail /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Mon May 10 17:50:02 2010
     Raid Level : raid6
     Array Size : 31471104 (30.01 GiB 32.23 GB)
  Used Dev Size : 15735552 (15.01 GiB 16.11 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Thu May 13 16:56:19 2010
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 16K

           UUID : e87e43c2:2e616de1:d7a6d8c7:9aeca122
         Events : 0.531

    Number   Major   Minor   RaidDevice State
       0       8       19        0      active sync   /dev/sdb3
       1       8        3        1      active sync   /dev/sda3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3

root@tempdint ~ #  mdadm --detail /dev/md2
/dev/md2:
        Version : 0.90
  Creation Time : Mon May 10 17:28:22 2010
     Raid Level : raid6
     Array Size : 1900344704 (1812.31 GiB 1945.95 GB)
  Used Dev Size : 950172352 (906.15 GiB 972.98 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Thu May 13 16:55:14 2010
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           UUID : 78af272f:8bd08df3:d7a6d8c7:9aeca122
         Events : 0.25295

    Number   Major   Minor   RaidDevice State
       0       8       20        0      active sync   /dev/sdb4
       1       8        4        1      active sync   /dev/sda4
       2       8       36        2      active sync   /dev/sdc4
       3       8       52        3      active sync   /dev/sdd4


/etc/fstab was set up to have the following (note that although /boot 
won't mount if sda is removed and the others don't end up with names
shifting to include sda, it's still not a problem, but just a warning
during boot that a partition failed to mount)

/dev/md0        /               ext4    noatime,data=ordered,journal_checksum,barrier=1,stripe=8,commit=4 0 1

/dev/sda1       /boot           ext3    defaults        0 0

/dev/md1        /home           ext4    errors=remount-ro,usrjquota=.aquota.user,jqfmt=vfsv0,user_xattr,acl,data=ordered,journal_checksum,barrier=1,stripe=8,commit=4     0 2

/dev/md2        /home/public    ext4    errors=remount-ro,user_xattr,acl,data=ordered,journal_checksum,barrier=1,stripe=32,commit=6       0 2


It was then tested how well failures were tolerated.  The point
is to have a resilient system for leaving remotely with users
who won't do much towards intricate fault-finding: we want any 
two disks to be able to fail and be removed, and still operate
as normal.   It should not be that the loss of, for example, the 
first disk causes a trouble in booting.  The kernel command-line
options mentioned above (md_mod.xxxxx=1) are an essential part 
in this, allowing degraded arrays to be used.

Grub is a possible trouble: if disks are removed, do others change
their name (hd[0123]) on next boot (yes -- could use /dev/disks/by-id
when running in the system, but we want even grub to work right).
For each disk, (hdN) with N=0,1,2,3, in grub's nomenclature, this was 
run, within the grub command from the running system:
	root (hdN,0)
	setup (hdN)
to make each disk's MBR want its own first-partition for grub 
(but does this point to another disk if the disks all move?).

It was found that removing the SATA lead to the first disk
still resulted in a happy boot and mount; then removing the lead 
to the last disk as well /still/ permitted this.