2010-05-04. 3 new Western Digital 1TB disks, with the warning about a 4096-byte format. root@temp ~ # smartctl -i /dev/sdb smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD10EARS-00Y5B1 Serial Number: WD-WMAV50931376 Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue May 4 01:59:53 2010 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled root@temp ~ # smartctl -i /dev/sdc smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD10EARS-00Y5B1 Serial Number: WD-WMAV50928745 Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue May 4 01:59:54 2010 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled root@temp ~ # smartctl -i /dev/sdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD10EARS-00Y5B1 Serial Number: WD-WMAV50931151 Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue May 4 01:59:55 2010 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Used fdisk with the command 'u' immediately, to switch to sectors. Chose 64 (multiple of 8) instead of the default 63, as the starting point. The 'u' and multiples of 8 apparently are needed because of the kernel doesn't show the block-size properly for the sake of the userspace programs (or do they just not use it anyway?), cat /sys/block/sdb/queue/physical_block_size 512 Tried a little testing: reiserfs (not so time-consuming to mkfs as ext3): fdisk u 64--end mkfs.reiserfs /dev/sdb1 mount -t reiserfs -o notail /dev/sdb1 /mnt/tmp cd /mnt/tmp chown user . bonnie++ -d . -c 4 -s 1000 -u user -g users Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 268 95 85358 46 46841 24 1169 98 205552 36 2067 32 Latency 540ms 545ms 210ms 39167us 12979us 156ms Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 27715 94 +++++ +++ 23670 94 27574 96 +++++ +++ 23543 100 Latency 431us 56us 17625us 347us 112us 159us 1.93c,1.93c,temp,4,1272925286,1000M,,268,95,85358,46,46841,24,1169,98,205552,36,2067,32,16,,,,,27715,94,+++++,+++,23670,94,27574,96,+++++,+++,23543,100,540ms,545ms,210ms,39167us,12979us,156ms,431us,56us,17625us,347us,112us,159us unmount, re-do but starting at number 63 instead of 64, i.e. not aligned suitably, Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 303 99 46144 28 24846 13 1157 97 209638 37 210.3 3 Latency 65366us 866ms 3579ms 84496us 23928us 126ms Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 29227 95 +++++ +++ 23024 89 24312 82 +++++ +++ 21213 90 Latency 169us 97us 71516us 437us 108us 73389us 1.93c,1.93c,temp,4,1272925468,1000M,,303,99,46144,28,24846,13,1157,97,209638,37,210.3,3,16,,,,,29227,95,+++++,+++,23024,89,24312,82,+++++,+++,21213,90,65366us,866ms,3579ms,84496us,23928us,126ms,169us,97us,71516us,437us,108us,73389us So, about half the write and rewrite speeds with the 63 rather than 64. Set all three new disks to have a single partition, starting at sector 64, type fd (linux raid autodetect). Then make raid and ext3 fs, with suitably matched options: # mdadm --create /dev/md0 --chunk=32 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1 mdadm: array /dev/md0 started. # mdadm --examine /dev/sdb1 /dev/sdb1: Magic : a92b4efc Version : 0.90.00 UUID : 96e02876:7bd62c71:f331e46f:85a886cd Creation Time : Tue May 4 01:22:10 2010 Raid Level : raid5 Used Dev Size : 976762432 (931.51 GiB 1000.20 GB) Array Size : 1953524864 (1863.03 GiB 2000.41 GB) Raid Devices : 3 Total Devices : 4 Preferred Minor : 0 Update Time : Tue May 4 01:22:10 2010 State : clean Active Devices : 2 Working Devices : 3 Failed Devices : 1 Spare Devices : 1 Checksum : 6b37bcb - correct Events : 1 Layout : left-symmetric Chunk Size : 32K Number Major Minor RaidDevice State this 0 8 17 0 active sync /dev/sdb1 0 0 8 17 0 active sync /dev/sdb1 1 1 8 33 1 active sync /dev/sdc1 2 2 0 0 2 faulty 3 3 8 49 3 spare /dev/sdd1 # time mkfs.ext3 -L home -b 4096 -E stride=8,stripe-width=16 -m 2 /dev/md0 mke2fs 1.41.9 (22-Aug-2009) Filesystem label=home OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 122101760 inodes, 488381216 blocks 9767624 blocks (2.00%) reserved for the super user First data block=0 Maximum filesystem blocks=0 14905 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 28 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. real 9m54.595s user 0m1.320s sys 1m35.178s mdadm --create /dev/md0 --chunk=16 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1 mdadm --create /dev/md1 --chunk=64 --level=5 --raid-devices=3 /dev/sdb2 /dev/sdc2 /dev/sdd2 time mkfs.ext3 -L home -b 4096 -E stride=4,stripe-width=8 -m 2 /dev/md0 7s real mount -t ext3 /dev/md0 /d This was done with the rebuild (making raid) process still going on for md1. Now about 1s for 100mkdirs. root@temp /d # bonnie++ -d . -c 4 -s 1000 -u user -g users Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 314 95 77804 38 57694 28 1075 90 184195 38 1322 21 Latency 200ms 1614ms 359ms 58775us 74513us 80605us Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 24360 71 +++++ +++ +++++ +++ 28004 74 +++++ +++ +++++ +++ Latency 14845us 444us 271us 9847us 102us 99us 1.93c,1.93c,temp,4,1272960542,1000M,,314,95,77804,38,57694,28,1075,90,184195,38,1322,21,16,,,,,24360,71,+++++,+++,+++++,+++,28004,74,+++++,+++,+++++,+++,200ms,1614ms,359ms,58775us,74513us,80605us,14845us,444us,271us,9847us,102us,99us root@temp ~ # for d in /dev/sd[abcd] /dev/md0 ; do hdparm -tT $d ; hdparm -tT $d ; done /dev/sda: Timing cached reads: 1788 MB in 2.00 seconds = 893.70 MB/sec Timing buffered disk reads: 226 MB in 3.01 seconds = 75.19 MB/sec /dev/sda: Timing cached reads: 1742 MB in 2.00 seconds = 871.45 MB/sec Timing buffered disk reads: 226 MB in 3.01 seconds = 74.98 MB/sec /dev/sdb: Timing cached reads: 1778 MB in 2.00 seconds = 889.40 MB/sec Timing buffered disk reads: 320 MB in 3.01 seconds = 106.39 MB/sec /dev/sdb: Timing cached reads: 1794 MB in 2.00 seconds = 897.16 MB/sec Timing buffered disk reads: 314 MB in 3.01 seconds = 104.42 MB/sec /dev/sdc: Timing cached reads: 1760 MB in 2.00 seconds = 879.62 MB/sec Timing buffered disk reads: 322 MB in 3.01 seconds = 106.99 MB/sec /dev/sdc: Timing cached reads: 1780 MB in 2.00 seconds = 889.82 MB/sec Timing buffered disk reads: 318 MB in 3.01 seconds = 105.52 MB/sec /dev/sdd: Timing cached reads: 1758 MB in 2.00 seconds = 878.92 MB/sec Timing buffered disk reads: 322 MB in 3.01 seconds = 106.86 MB/sec /dev/sdd: Timing cached reads: 1802 MB in 2.00 seconds = 900.75 MB/sec Timing buffered disk reads: 320 MB in 3.00 seconds = 106.64 MB/sec /dev/md0: Timing cached reads: 1754 MB in 2.00 seconds = 876.88 MB/sec Timing buffered disk reads: 506 MB in 3.00 seconds = 168.40 MB/sec /dev/md0: Timing cached reads: 1798 MB in 2.00 seconds = 899.03 MB/sec Timing buffered disk reads: 500 MB in 3.00 seconds = 166.64 MB/sec root@temp ~ # bonnie++ -d . -c 4 -s 1000 -u user -g users Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 281 97 60054 37 40173 24 991 97 115120 24 1628 25 Latency 161ms 432ms 206ms 42901us 97951us 226ms Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 18140 57 +++++ +++ 25849 56 18729 61 +++++ +++ 32239 70 Latency 14219us 483us 488us 26726us 37us 2089us 1.93c,1.93c,temp,4,1272968488,1000M,,281,97,60054,37,40173,24,991,97,115120,24,1628,25,16,,,,,18140,57,+++++,+++,25849,56,18729,61,+++++,+++,32239,70,161ms,432ms,206ms,42901us,97951us,226ms,14219us,483us,488us,26726us,37us,2089us Then change the ext3 for reiserfs (mounted notail), and repeat: Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 269 99 88078 42 42008 27 1260 98 158447 36 1838 43 Latency 75488us 157ms 816ms 49056us 84401us 119ms Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 20467 73 +++++ +++ 18383 89 22463 80 +++++ +++ 20050 89 Latency 137us 118us 24892us 875us 106us 252us 1.93c,1.93c,temp,4,1272957586,1000M,,269,99,88078,42,42008,27,1260,98,158447,36,1838,43,16,,,,,20467,73,+++++,+++,18383,89,22463,80,+++++,+++,20050,89,75488us,157ms,816ms,49056us,84401us,119ms,137us,118us,24892us,875us,106us,252us root@temp /d # bonnie -d . File './Bonnie.11147', size: 104857600 Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently...done Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 100 33046 99.5 329895 96.7 531594 91.4 35829 99.4 1692701 99.2 77737.8 194.4 root@temp /d # bonnie -d . -s 2000 -m temp_reiser_r5n3 File './Bonnie.11153', size: 2097152000 Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently...done Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU temp_rei 2000 34149 97.7 91504 48.2 49780 29.4 30951 92.6 164055 48.9 359.5 2.6 root@temp /d # cat /dev/zero >f & [1] 11221 root@temp /d # cat f >/dev/null & [2] 11238 root@temp /d # ls -sh total 1.1G 1.1G f root@temp /d # time mkdir `seq -w 1 3000` real 0m0.304s user 0m0.032s sys 0m0.116s root@temp / # mkfs.ext3 /dev/md0 root@temp /d # bonnie++ -d . -c 4 -s 1000 -u user -g users Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 311 93 83247 42 50258 28 1265 99 160530 31 1664 31 Latency 558ms 369ms 138ms 32330us 52592us 122ms Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 15080 50 +++++ +++ 31524 70 16930 53 +++++ +++ 25929 56 Latency 25262us 457us 265us 23886us 151us 214us 1.93c,1.93c,temp,4,1272957896,1000M,,311,93,83247,42,50258,28,1265,99,160530,31,1664,31,16,,,,,15080,50,+++++,+++,31524,70,16930,53,+++++,+++,25929,56,558ms,369ms,138ms,32330us,52592us,122ms,25262us,457us,265us,23886us,151us,214us root@temp /d # bonnie -d . -s 2000 -m temp_reiser_r5n3 File './Bonnie.11318', size: 2097152000 Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently...done Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done... -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU temp_rei 2000 33248 98.7 91963 44.9 51580 24.4 32650 91.9 158280 31.9 359.2 1.6 Now similar ext3 and reiser (was /worse/ ext3 when with the 'proper' raid-related size options). Still amazingly slow ext3 for mkdir `seq -w 1 3000`: 1m30s--2m, as opposed to second or so. Tried some options from man mount mount -t ext3 -o data=ordered,barrier=1,commit=1 /dev/md0 /d Fairly consistent 3.5s per 100 mkdirs. Using a background read/write, cat /dev/zero >f & cat f >/dev/null & running while mkdir: about 6s/100mkdir, again quite consistenly. There are some webpages about mdraid+ext3+~2TB and many-second waits for mkdir. Suggestions of many directory entries in parent (clearly not so here, but we can't claim to have seen yet a /single/ mkdir hanging for long). ----------------------------------------------------------------------------------------- That was all a little unsatisfactory. In the end, ext4 was considered a good candidate, apparently having the main good points of ext3 for stability, but avoiding some troubles including multi-second waits with certain (data-)cautious settings. The two-partition scheme, /home and /home/public , was kept anyway, even if ext4 might have made a single big partition work efficiently even with lots of mkdirs and so on: the /home can be focused on smaller files, with modest total size per user (perhaps ~2GB) and quotas to protect them from each other (perhaps 5GB?) within a filesystem of just some 30GB; the /home/public can be aimed at bigger files, without any quotas. mdadm --create /dev/md0 --chunk=16 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1 mkfs.ext4 -L public -b 4096 -E stride=4,stripe-width=8 /dev/md0 tune2fs -c 0 -i 0 -m 2 -e remount-ro -o user_xattr,acl /dev/md0 mdadm --create /dev/md1 --chunk=64 --level=5 --raid-devices=3 /dev/sdb2 /dev/sdc2 /dev/sdd2 mkfs.ext4 -L public -b 4096 -E stride=16,stripe-width=32 /dev/md1 tune2fs -c 0 -i 0 -m 2 -e remount-ro -o user_xattr,acl /dev/md1 bonnie++ -d . -c 4 -s 1000 -u user -g users Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP temp 1000M 312 99 83775 29 73256 24 1170 97 209158 36 2237 33 Latency 50638us 230ms 53220us 33089us 90416us 57895us Version 1.93c ------Sequential Create------ --------Random Create-------- temp -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 18418 75 +++++ +++ 27908 70 19678 82 +++++ +++ 28178 72 Latency 505us 413us 257us 496us 15us 199us 1.93c,1.93c,temp,4,1272992893,1000M,,312,99,83775,29,73256,24,1170,97,209158,36,2237,33,16,,,,,18418,75,+++++,+++,27908,70,196,+++++,+++,28178,72,50638us,230ms,53220us,33089us,90416us,57895us,505us,413us,257us,496us,15us,199us /dev/md0 /home ext4 errors=remount-ro,usrquota,user_xattr,acl,data=ordered,journal_checksum,barrier=1,stripe=8 0 2 #/dev/md1 /home/public ext4 errors=remount-ro,user_xattr,acl,data=ordered,journal_checksum,barrier=1,stripe=32 0 2