Addendum: the quick fix: Given an input jpg "in.jpg" containing single-column text, (more args if input is two page per sheet etc -- see later), convert in.jpg tmp.ppm unpaper tmp.ppm tmp_.ppm convert tmp_.ppm tmp.tif tesseract tmp.tif out -l eng This produces a file "out.txt" of the OCR'd text. The commands are: "convert" from ImageMagick, "unpaper" (separate package of the same name), and "tesseract" (an OCR program associated with Google) Do similarly for pdf input, but instead of starting with convert in.jpg tmp.ppm use pdftoppm tmp.ppm (pdftoppm is from the poppler-utils package). Put in a loop for whole documents, using e.g. pdftk in.pdf burst to separate a multi-page pdf file into single pages. -------------------------------------------------------------- 2010-07-18. The original file. http://www.linux.com/archive/articles/138511 "How to scan and OCR like a pro with open source tools" (2008) # check what scanners are seen scanimage -L # show options for a particular scanner scanimage --help --device 'snapscan:libusb:001:010' # get an image ( -l -t are distances [mm] from top left, -x -y are page sizes) scanimage --device 'snapscan:libusb:001:010' --format=pnm --mode=Gray --resolution 300 -l 10 -t 10 -x 100 -y 150 --brightness -20 --contrast 15 >01.pnm # this is suggested for splitting double-pages, aligning, sorting out edges etc. # see unpaper --help for the many details of rotation, 2=>1 1=>2 etc. unpaper 01.pnm 01_.pnm # suitable format for tesseract convert 01_.pnm 01.tif That webpage used all of: gocr Ocrad Tesseract-OCR , but found gocr and Orcad to have many mistakes. I haven't tried Ocrad, but from the website's samples it look almost as awful as gocr -- and the gocr output was possibly rather /better/ than the junk I've ever got out of gocr. Tesseract was clearer vastly superior. So I'll ignore the others here. # run simple tesseract tesseract 01.tif 01 -l eng # output was /excellent/; faultless for the 10 lines in the example cat 01.txt # That said, the output /is/ sensitive to the presence of images, extra # lines, distorted and darkened text around the spine of the scanned book, # etc. The proprietary programs would doubtless do this rather better. # But for many purposes, such as pdf sources, tables, and dvd /subtitles/, # this tesseract should be excellent. I always felt that gocr couldn't # really be the 'best' available! (It takes more time to correct the junk it generates than it would to do it entirely manually!) # # Recipe for converting double-page pdf scans into single text-file: # # N, 2010-09-09. # #!/bin/bash pdftk "input_file.pdf" burst i=0 for f in pg_*.pdf do F="${f//\.pdf/}" pdftoppm <$f >$F.ppm unpaper -l double -op 2 $F.ppm ${F}_%1d.ppm for g in ${F}_?.ppm do i=$((i+1)) G=${g//\.ppm/} echo "$i: $F, $G" convert $g $G.tif tesseract $G.tif $G -l eng ln -s $G.txt `printf "%03d" $i`.txt done done cat [0-9][0-9][0-9].txt >OCR_output.txt