2011-03-27. Ever since the newly compiled systems on serv and box (at home) in September 2009, there's been some sort of trouble with what once was the highly responsive NFS mounts of /home and /home/public (audio, video, print, etc). On the server these are raid arrays of several (4, 5) newish disks, each capable of some 80MB/s continuous read speed, and performing very well locally on the server; they were also fine for pure reading from clients, rapidly. The filesystem was initially ext3. In that first year the real annoyance was that a mkdir operation would sometimes take ages (seconds or tens of seconds). Several kernels were used, and various settings of disk mount and nfs mount. The array size was about 1.8TB from 2009-09, then on changing the hardware in 2010-05 (new server motherboard, new disks giving bigger array) there was a 2.7TB /home/public and a much smaller (50GB) /home. The ext4 filesystem was used, supposedly with better performance for bigger disks (sometimes mentioned as avoiding this type of delay). Indeed the mkdir wasn't noticed as a problem after the change to ext4 (though perhaps the splitting of /home/ from less-frequently mkdired /home/public was part of this?). But still, attempts at continuous writing caused long hangs of the whole nfs mount from a linux client. Copying photos from a camera to /home/public/scratch ; copying some pictures between /home/$USER and /home/public/scratch/ ; doing video-file muxing, .... all peppered with very predictable, /very/ disturbing delays in which every program trying to read or write those filesystems was just frozen, waiting, even for a minute. It always worked in the end. If logged in to the server it was ok -- no trouble listing the affected directories and filesystems. Use of FreeBSD (around autumn 2010) as the client suggested that the problem was in the server: the BSD system showed no such problem. On being annoyed again today (2011-03) this was one thing found, http://ubuntuforums.org/archive/index.php/t-1478413.html Not very clear what problem was being solved, but the interesting thing was that an NFS patch had been made that was seen not to be on serv's kernel yet. serv's linux-2.6.32 was thus upgraded to linux-2.6.38, but this caused apparently a lock-up of the /system/ (i.e. v extreme) within minutes (loaded again with the muxing operation), and then again soon after a reboot. A linux-2.6.37 (known to be good on several desktop systems) was used instead. One other difference was ext4 mount options on serv, using data=writeback instead of defaulting to data=ordered for the /home/public and /home arrays. Then, everything worked beautifully from the same client (which hadn't even been turned off during all this fooling, but was just waiting for its nfs server to return...). A reboot with the same 2.6.37 kernel but with data=ordered was expected to revert to the bad state (upgrading kernels had become something of a folorn effort), but in fact it still worked fine. One is thus persuaded that: * something to do with ext4/nfs between 2.6.32 and 2.6.37 has corrected a very annoying and long-existing problem that turned up sometime around 2009 (2.6.2[678] ?) * although things like the BSD/linux comparison suggested it was the client, it seems it was on the server side (even if needing particular client behaviour to stimulate it). Never assume much... Check everything. Vary extra parameters even if not expected to do anything. FINALLY! Linux network servers/clients can again be used as one would expect, instead of being a multi-year downright embarrassment. ---- Looking around a bit (Google on 'ext4 long delays nfs'), some further material was found about reducing long delays during heavy NFS load, but it was not clear which were related to exactly this problem. Something perhaps also of use, but presumably not yet included in the later kernel (2.6.37) that did actually work ok. http://www.spinics.net/lists/linux-fsdevel/msg42637.html Actual (permanent) freeze of system, after certain amount of writing, https://bugzilla.redhat.com/show_bug.cgi?id=531493 There are several other references (found via the ubuntu link in the main text, above) about NFS regressions that were fixed around 2.6.33 and immediate descendents.