2007-10-30. The shell scripts of the `batching' set (made for PDfem simulations), run a `local_job_coordinator' for each CPU of each host, whose job is to find, assert ownership, and run, the next numbered job that isn't already taken. In order to avoid more than one host/CPU getting a job, the NFS mounts are done with no attribute caching, and the coordinator, in versions up to the current date, does the following: if [ -f JOBNUMBER.begun ]; then continue; fi touch JOBNUMBER.begun echo "`hostname`,$instance `date +'%F %H:%M:%S %Z'`" >JOBNUMBER.begun host_that_won="` head -n1 "$logs.begun" | cut -d' ' -f1 `" if [ "$host_that_won" != "`hostname`,$instance" ]; then continue; ...; fi This was seen to have a failure in practice -- two instances of the script took the same job. This is not surprising: the third line has a definite bug, where I remember I intended an append rather than a truncate (>>, not >). As it stands, the code is not much better than its first line. With an append, there is a hope that as long as the NFS server really is queried immediately Some tests were made using several hosts with a shared NFS directory, with scripts on each host trying to acquire the rights to each number in a sequence of some 8000. With close timing, the use of >> for the initial write was seen to reduce the interference between scripts very much, but not to eliminate it. The best case acheived was 16 duplicates in 8192, with tight timing (high chance of problems) as opposed for example to ~600 for the > case with quite loose timing, and ~6000 for the > case with tight timing! ------------------------------------------------------------------------- The correction was made (>>), with the addition of a chmod a-w. A simple example, without the initial check for the existence of the file, was tested by running two instances on each of three hosts, simultaneously, doing ~8000 values without any further delay such as a job to run. inst=1 ; for i in `seq -w 1 8192` ; do echo "`hostname`_$inst, `date`" >>$i ; chmod a-w $i ; winner="`head -n1 $i | cut -d, -f1`" ; if [ "$winner" == "`hostname`_$inst" ]; then success="1"; else success="0"; fi echo "$i: $success" >>`hostname`_$inst.log ; done The result logfiles were examined by: grep ': 1' *.log >success; for n in `seq -w 0 8192`; do c="`grep $n success | wc -l`"; if [ $c -ne 1 ]; then echo "$n, $c times"; fi; done It was seen that each job was acquired by exactly one host,instance. (It later seemed that this was only through not very tight simultaneity of the start of each number by each host,instance -- see results later). Then the original form from the script was intended to be tried for comparison, but this was badly done as the lack of an initial -f $i test caused inevitable overwriting. inst=1 ; for i in `seq -w 1 8192` ; do touch $i; echo "`hostname`_$inst, `date`" >$i ; winner="`head -n1 $i | cut -d, -f1`" ; if [ "$winner" != "`hostname`_$inst" ]; then success="0"; else success="1"; fi ; echo "$i: $success" >>`hostname`_$inst.log ; done Being run on 3 hosts, 2 instances each, this gave 6 cases of "success" for each job-number: all got all. That's interesting, as the timing can't have been very critical, else another might have got in between write and read. The initial existence test was added, leaving the same form as the original script: inst=1 ; for i in `seq -w 1 8192` ; do if [ -f $i ]; then continue; fi ; touch $i; echo "`hostname`_$inst, `date`" >$i ; winner="`head -n1 $i | cut -d, -f1`"; if [ "$winner" != "`hostname`_$inst" ]; then success="0"; else success="1"; fi ; echo "$i: $success" >>`hostname`_$inst.log ; done This resulted in quite a 596 double jobs, and 3 triples. Then, the original form modified only in the >> instead of >, was tried (i.e. no chmod a-w, initial existence test). inst=1 ; for i in `seq -w 1 8192` ; do if [ -f $i ]; then continue; fi ; echo "`hostname`_$inst, `date`" >>$i ; winner="`head -n1 $i | cut -d, -f1`" ; if [ "$winner" != "`hostname`_$inst" ]; then echo "$i: 0" ; else echo "$i: 1"; fi; done >`hostname`_$inst.log This gave 22 cases of a double attempt at a job, with quite evenly distribution of the double attempts. To give an even better chance of problems being seen, the same principle but with a slight delay on successful acquisition, was tried (so that one host.instance doesn't get and stay ahead), and the starting of all the instances was done by waiting until a certain time rather than just trying to press Enter in 6 terminals at once. until [ `date +%S` == "00" ]; do true ; done ; inst=1 ; for i in `seq -w 1 8192` ; do if [ -f $i ]; then continue; fi ; echo "`hostname`_$inst, `date`" >>$i ; winner="`head -n1 $i | cut -d, -f1`" ; if [ "$winner" != "`hostname`_$inst" ]; then echo "$i: 0" ; else echo "$i: 1"; grep sbin /etc/passwd | sort -k2 | wc >/dev/null; fi; done >`hostname`_$inst.log Looking at the numbers of hosts writing to the files ($i), in which case anything other than 1 is a failure due to timing in the existence test, and is potentially recoverable by the winner test, for n in 0 1 2 3 4 5 6; do echo -n "$n: "; wc -l * | grep ' '$n' ' | wc -l ; done gave: 0: 0 1: 319 2: 1773 3: 1280 4: 2870 5: 950 6: 1000 which is good, as it suggests very close timing, i.e. plenty of chance of showing up problems. The result of the standard test for cases that slipped through, was that 24 cases had a double attempt at running a job. With a reinsertion of the chmod a-w statement, the same was tried again: until [ `date +%S` == "00" ]; do true ; done ; inst=1 ; for i in `seq -w 1 8192` ; do if [ -f $i ]; then continue; fi ; echo "`hostname`_$inst, `date`" >>$i ; chmod a-w $i; winner="`head -n1 $i | cut -d, -f1`" ; if [ "$winner" != "`hostname`_$inst" ]; then echo "$i: 0" ; else echo "$i: 1"; grep sbin /etc/passwd | sort -k2 | wc >/dev/null; fi; done >`hostname`_$inst.log This gave 16 duplicate cases, not strongly distinguished from the case without chmod. This is not good! But, given the strong efforts to cause timing problems, and the relatively slow pace of, e.g., pdfem, it's probably not going to give a real-life error in many, many runs. Compare to the case with > instead of >>, where there were ~600 multiple cases even without the tighter timing (without the tighter timing, the >> method behaved `perfectly'). Now the simple > case was tried, this time with tight timing: until [ `date +%S` == "00" ]; do true ; done ; inst=1 ; for i in `seq -w 1 8192` ; do if [ -f $i ]; then continue; fi ; echo "`hostname`_$inst, `date`" >$i ; winner="`head -n1 $i | cut -d, -f1`" ; if [ "$winner" != "`hostname`_$inst" ]; then echo "$i: 0" ; else echo "$i: 1"; grep sbin >/etc/passwd | sort -k2 | wc >/dev/null; fi; done >`hostname`_$inst.log This gave 5744 multiple cases (!), 2: 3302 3: 2112 4: 315 5: 14 6: 0