Scanning notes
vuescan is $50, use on four machines?
https://www.hamrick.com/reg.html
Paperwork scans, ocr’s and searches
http://www.linux-magazine.com/Issues/2014/166/Paperwork-Document-Manager
https://github.com/jflesch/paperwork
https://github.com/jflesch/paperwork/blob/unstable/doc/install.debian.markdown
Tesseract
http://onetransistor.blogspot.com/2015/12/ocr-searchable-pdf-linux.html
sudo apt-get install tesseract-ocr tesseract-ocr-all
—-
#!/bin/bash
LANG=eng #replace with your language codeshopt -s nullglob
for f in *.tif; do
echo “Running OCR on $f”
tesseract -psm 1 -l $LANG $f $f pdf
doneecho “Joining files into single PDF…”
pdftk *.pdf cat output ../outdocument.pdf
rm -r -f *.pdf
—-
This script takes all .tif files from the directory where it is run and processes them with tesseract. To use it, you need also pdftk installed. Copy the above snippet into a new file ocr.sh, make it executable (chmod +x ocr.sh), then place it in the folder with scanned images and run it.
…Things get complicated if you already have a PDF document that you want to make searchable. …
In this situation, you can use the pdfsandwich script by Tobias Elze.
- -nopreproc is useful when the PDF already contains processed images and you don’t want any other processing. Note that by default, this script will convert your document to black and white! Using this option you avoid any kind of conversion.
- -resolution has a default value of 300 DPI. This is used when converting PDF pages to images and 300 is a good value. But if your document contains small text and you know/believe it may have been scanned at a higher DPI, specify it.
pdfsandwich -lang eng input_document.pdfThe result will be input_document_ocr.pdf in the same folder as the initial document
From the pdfsandwich site:
“some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR’ed file”