Addendum:  the quick fix:

Given an input jpg "in.jpg" containing single-column text,
(more args if input is two page per sheet etc -- see later),


convert in.jpg tmp.ppm
unpaper tmp.ppm tmp_.ppm     
convert tmp_.ppm tmp.tif
tesseract tmp.tif out -l eng


This produces a file "out.txt" of the OCR'd text.

The commands are:
  "convert" from ImageMagick, 
  "unpaper" (separate package of the same name), and 
  "tesseract" (an OCR program associated with Google)

Do similarly for pdf input, but instead of starting with 
  convert in.jpg tmp.ppm 
use  
  pdftoppm <in.pdf >tmp.ppm
(pdftoppm is from the poppler-utils package).

Put in a loop for whole documents, using e.g. 
  pdftk in.pdf burst 
to separate a multi-page pdf file into single pages.


--------------------------------------------------------------
2010-07-18.
The original file.


http://www.linux.com/archive/articles/138511
"How to scan and OCR like a pro with open source tools" (2008)


# check what scanners are seen
scanimage -L

# show options for a particular scanner
scanimage --help --device 'snapscan:libusb:001:010'

# get an image  ( -l -t are distances [mm] from top left, -x -y are page sizes)
scanimage --device 'snapscan:libusb:001:010' --format=pnm --mode=Gray --resolution 300 -l 10 -t 10 -x 100 -y 150 --brightness -20 --contrast 15 >01.pnm 

# this is suggested for splitting double-pages, aligning, sorting out edges etc.
# see   unpaper --help   for the many details of rotation, 2=>1 1=>2 etc.
unpaper 01.pnm 01_.pnm

# suitable format for tesseract
convert 01_.pnm 01.tif

That webpage used all of:
	gocr
	Ocrad
	Tesseract-OCR , 
but found gocr and Orcad to have many mistakes.
I haven't tried Ocrad, but from the website's samples it look almost as 
awful as gocr -- and the gocr output was possibly rather /better/ than 
the junk I've ever got out of gocr.  
Tesseract was clearer vastly superior.  So I'll ignore the others here.

# run simple tesseract
tesseract 01.tif 01 -l eng

# output was /excellent/; faultless for the 10 lines in the example
cat 01.txt

# That said, the output /is/ sensitive to the presence of images, extra 
# lines, distorted and darkened text around the spine of the scanned book,
# etc.  The proprietary programs would doubtless do this rather better.
# But for many purposes, such as pdf sources, tables, and dvd /subtitles/,
# this tesseract should be excellent.  I always felt that gocr couldn't 
# really be the 'best' available! (It takes more time to correct the junk
it generates than it would to do it entirely manually!)

#
# Recipe for converting double-page pdf scans into single text-file:
#
# N, 2010-09-09.
#
#!/bin/bash

pdftk "input_file.pdf" burst

i=0
for f in pg_*.pdf
do
        F="${f//\.pdf/}"
        pdftoppm <$f >$F.ppm
        unpaper -l double -op 2 $F.ppm ${F}_%1d.ppm
        for g in ${F}_?.ppm
        do
                i=$((i+1))
                G=${g//\.ppm/}
                echo "$i:  $F,  $G"
                convert $g $G.tif
                tesseract $G.tif $G -l eng
                ln -s $G.txt `printf "%03d" $i`.txt
        done
done
cat [0-9][0-9][0-9].txt >OCR_output.txt