2007-10-30.

The shell scripts of the `batching' set (made for PDfem simulations), run 
a `local_job_coordinator' for each CPU of each host, whose job is to find,
assert ownership, and run, the next numbered job that isn't already taken.

In order to avoid more than one host/CPU getting a job, the NFS mounts 
are done with no attribute caching, and the coordinator, in versions up 
to the current date, does the following:
	if [ -f JOBNUMBER.begun ]; then continue; fi 
	touch JOBNUMBER.begun
	echo "`hostname`,$instance  `date +'%F %H:%M:%S %Z'`" >JOBNUMBER.begun
	host_that_won="` head -n1 "$logs.begun" | cut -d' ' -f1 `"
	if [ "$host_that_won" != "`hostname`,$instance" ]; then continue; ...; fi 

This was seen to have a failure in practice -- two instances of the script
took the same job.  This is not surprising: the third line has a definite 
bug, where I remember I intended an append rather than a truncate (>>, not >).
As it stands, the code is not much better than its first line.  With an 
append, there is a hope that as long as the NFS server really is queried 
immediately 

Some tests were made using several hosts with a shared NFS directory, with 
scripts on each host trying to acquire the rights to each number in a sequence
of some 8000.  With close timing, the use of >> for the initial write 
was seen to reduce the interference between scripts very much, but not to 
eliminate it.  The best case acheived was 16 duplicates in 8192, with 
tight timing (high chance of problems) as opposed for example to ~600 for 
the > case with quite loose timing, and ~6000 for the > case with tight timing!


-------------------------------------------------------------------------


The correction was made (>>), with the addition of a chmod a-w. A simple
example, without the initial check for the existence of the file, was
tested by running two instances on each of three hosts, simultaneously, 
doing ~8000 values without any further delay such as a job to run.
inst=1 ;
for i in `seq -w 1 8192` ;
do echo "`hostname`_$inst, `date`" >>$i ;
chmod a-w $i ;
winner="`head -n1 $i | cut -d, -f1`" ;
if [ "$winner" == "`hostname`_$inst" ];
	then success="1";
	else success="0";
fi
echo "$i:
$success" >>`hostname`_$inst.log ;
done
The result logfiles were examined by:
grep ': 1' *.log >success; for n in `seq -w 0 8192`; 
do c="`grep $n success | wc -l`"; 
if [ $c -ne 1 ]; then echo "$n, $c times"; fi;
done
It was seen that each job was acquired by exactly one host,instance.
(It later seemed that this was only through not very tight simultaneity
of the start of each number by each host,instance -- see results later).

Then the original form from the script was intended to be tried for
comparison, but this was badly done as the lack of an initial -f $i 
test caused inevitable overwriting.
inst=1 ; 
for i in `seq -w 1 8192` ;
do touch $i;
echo "`hostname`_$inst,
`date`" >$i ;
winner="`head -n1 $i | cut -d, -f1`" ;
if [ "$winner" != "`hostname`_$inst" ];
	then success="0";
	else success="1";
fi ;
echo "$i:
$success" >>`hostname`_$inst.log ;
done 
Being run on 3 hosts, 2 instances each, this gave 6 cases of "success" for 
each job-number: all got all.  That's interesting, as the timing can't have 
been very critical, else another might have got in between write and read.


The initial existence test was added, leaving the same form as the 
original script:
inst=1 ;
for i in `seq -w 1 8192` ;
do if [ -f $i ];
	then continue;
fi ;
touch $i;
echo "`hostname`_$inst, `date`" >$i ;
winner="`head -n1 $i | cut -d, -f1`";
if [ "$winner" != "`hostname`_$inst" ];
	then success="0";
	else success="1";
fi ;
echo "$i: $success" >>`hostname`_$inst.log ;
done   
This resulted in quite a 596 double jobs, and 3 triples.

Then, the original form modified only in the >> instead of >,
was tried (i.e. no chmod a-w, initial existence test).
inst=1 ;
for i in `seq -w 1 8192` ;
do if [ -f $i ]; 
	then continue; 
fi ;
echo "`hostname`_$inst, `date`" >>$i ;
winner="`head -n1 $i | cut -d, -f1`" ;
if [ "$winner" != "`hostname`_$inst" ]; 
	then echo "$i: 0" ; 
	else echo "$i: 1"; 
fi;
done >`hostname`_$inst.log
This gave 22 cases of a double attempt at a job, with quite 
evenly distribution of the double attempts.

To give an even better chance of problems being seen, the same 
principle but with a slight delay on successful acquisition, was 
tried (so that one host.instance doesn't get and stay ahead), and the
starting of all the instances was done by waiting until a certain time
rather than just trying to press Enter in 6 terminals at once.
until [ `date +%S` == "00" ]; do true ; done ; 
inst=1 ;
for i in `seq -w 1 8192` ;
do if [ -f $i ]; 
	then continue; 
fi ;
echo "`hostname`_$inst, `date`" >>$i ; 
winner="`head -n1 $i | cut -d, -f1`" ;
if [ "$winner" != "`hostname`_$inst" ]; 
	then echo "$i: 0" ;
	else 
		echo "$i: 1";
		grep sbin /etc/passwd | sort -k2 | wc >/dev/null;
	fi; 
done >`hostname`_$inst.log
Looking at the numbers of hosts writing to the files ($i), in which case
anything other than 1 is a failure due to timing in the existence test,
and is potentially recoverable by the winner test, 
for n in 0 1 2 3 4 5 6; do echo -n "$n: "; wc -l * | grep  ' '$n' ' | wc -l ; done
gave:
0:    0
1:  319
2: 1773
3: 1280
4: 2870
5:  950
6: 1000
which is good, as it suggests very close timing, i.e. plenty of chance of
showing up problems.
The result of the standard test for cases that slipped through, was that 24
cases had a double attempt at running a job.

With a reinsertion of the chmod a-w statement, the same was tried again:
until [ `date +%S` == "00" ]; do true ; done ; 
inst=1 ; 
for i in `seq -w 1 8192` ; 
do if [ -f $i ]; 
	then continue;
fi ;
echo "`hostname`_$inst, `date`" >>$i ;
chmod a-w $i; 
winner="`head -n1 $i | cut -d, -f1`" ; 
if [ "$winner" != "`hostname`_$inst" ]; 
	then echo "$i: 0" ; 
	else 
		echo "$i: 1"; 
		grep sbin /etc/passwd | sort -k2 | wc >/dev/null;
	fi;
done >`hostname`_$inst.log
This gave 16 duplicate cases, not strongly distinguished from 
the case without chmod.
This is not good!  But, given the strong efforts to cause timing 
problems, and the relatively slow pace of, e.g., pdfem, it's probably
not going to give a real-life error in many, many runs. Compare to the 
case with > instead of >>, where there were ~600 multiple cases even 
without the tighter timing (without the tighter timing, the >> method
behaved `perfectly').

Now the simple > case was tried, this time with tight timing:
until [ `date +%S` == "00" ]; do true ; done ; 
inst=1 ;
for i in `seq -w 1 8192` ;
do if [ -f $i ]; then continue; fi ;
echo "`hostname`_$inst, `date`" >$i ; 
winner="`head -n1 $i | cut -d, -f1`" ; 
if [ "$winner" != "`hostname`_$inst" ]; 
	then echo "$i: 0" ; 
	else 
		echo "$i: 1"; 
		grep sbin >/etc/passwd | sort -k2 | wc >/dev/null; 
	fi; 
done >`hostname`_$inst.log
This gave 5744 multiple cases (!), 
2: 3302
3: 2112
4:  315
5:   14
6:    0