Linux ocr Tesseract & PDFsandwich also scanning

Scanning notes

vuescan is $50, use on four machines?

https://www.hamrick.com/reg.html

Paperwork scans, ocr’s and searches

http://www.linux-magazine.com/Issues/2014/166/Paperwork-Document-Manager

https://github.com/jflesch/paperwork

https://github.com/jflesch/paperwork/blob/unstable/doc/install.debian.markdown

Tesseract

http://onetransistor.blogspot.com/2015/12/ocr-searchable-pdf-linux.html

sudo apt-get install tesseract-ocr tesseract-ocr-all

—-

 

#!/bin/bash
LANG=eng #replace with your language code

shopt -s nullglob

for f in *.tif; do
echo “Running OCR on $f”
tesseract -psm 1 -l $LANG $f $f pdf
done

echo “Joining files into single PDF…”
pdftk *.pdf cat output ../outdocument.pdf
rm -r -f *.pdf

 

 

—-

This script takes all .tif files from the directory where it is run and processes them with tesseract. To use it, you need also pdftk installed. Copy the above snippet into a new file ocr.sh, make it executable (chmod +x ocr.sh), then place it in the folder with scanned images and run it.

…Things get complicated if you already have a PDF document that you want to make searchable. …

In this situation, you can use the pdfsandwich script by Tobias Elze.

  • -nopreproc is useful when the PDF already contains processed images and you don’t want any other processing. Note that by default, this script will convert your document to black and white! Using this option you avoid any kind of conversion.
  • -resolution has a default value of 300 DPI. This is used when converting PDF pages to images and 300 is a good value. But if your document contains small text and you know/believe it may have been scanned at a higher DPI, specify it.
pdfsandwich -lang eng input_document.pdf

The result will be input_document_ocr.pdf in the same folder as the initial document

From the pdfsandwich site:

“some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR’ed file”