2010-07-18. Another attempt at including subtitles from DVDs as /text/ in a ripped file (e.g. Matroska container). This was completely given up a few years ago when the egregious gocr seemed the best available program for producing the text ... but now, with tesseract shown to do much better OCR, another shot! The native format of subtitles in a DVD is a sort of overlay image: the larger filesize compared to text is not important in a DVD, and the independence of character-sets and available glyphs is a boon for worldwide compatibility of players. The mplayer/mencoder package allows these 'vob' subtitles to be saved as files: the files are in pairs, with one name ending in .sub (the data) the other in .idx (the timings?). This part on the mencoder command-line selects a subtitle stream by its id-number NUM (seen in the mplayer console output when playing) and saves to NAME.{idx,sub} -sid NUM -vobsubout NAME E.g. mencoder -dvd-device file.iso dvd://1 -nosound -ovc frameno -o /dev/null -sid NUM -vobsubout NAME Then, using commands from subpackages of the ogmrip package, the sub/idx files can be converted to an XML file of timings, along with an image for each subtitle: subp2tiff -v -n -o OUTNAME NAME which reads NAME.{idx,sub} and writes OUTNAME0001.tif, OUTNAME0002.tif, ... With tesseract, these images can be converted to text (still not very wonderfully, albeit much much better than gocr: the glyphs have mucky middles and edges, and frequently connect to each other). for f in *.tif ; do tesseract $f $f -l eng ; done makes OUTNAME0002.tif.txt from OUTNAME0002.tif , assuming the text to be English. Finally, from ogmrip again, the subptools command can replace all the references to tiff files in the XML with the respective OCR-treated text, and give output in the common 'srt' subtitle format, subptools --subst --convert srt NAME.srt The part in this chain that needs attention is the images: a further step should be added between subp2tiff and tesseract to make the images nicer for OCR; or one of those steps should be modified to make the images nicer or make tesseract more tolerant.