{"id":2018,"date":"2016-09-24T14:21:06","date_gmt":"2016-09-24T22:21:06","guid":{"rendered":"http:\/\/systemsolver.com\/StatlerBlog\/?p=2018"},"modified":"2016-09-24T14:21:06","modified_gmt":"2016-09-24T22:21:06","slug":"linux-ocr-tesseract-pdfsandwich","status":"publish","type":"post","link":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/2016\/09\/24\/linux-ocr-tesseract-pdfsandwich\/","title":{"rendered":"Linux ocr Tesseract &#038; PDFsandwich also scanning"},"content":{"rendered":"<h2>Scanning notes<\/h2>\n<p>vuescan is $50, use on four machines?<\/p>\n<p><a href=\"https:\/\/www.hamrick.com\/reg.html\">https:\/\/www.hamrick.com\/reg.html<\/a><\/p>\n<p>Paperwork scans, ocr&#8217;s and searches<\/p>\n<p><a href=\"http:\/\/www.linux-magazine.com\/Issues\/2014\/166\/Paperwork-Document-Manager\">http:\/\/www.linux-magazine.com\/Issues\/2014\/166\/Paperwork-Document-Manager<\/a><\/p>\n<p><a href=\"https:\/\/github.com\/jflesch\/paperwork\">https:\/\/github.com\/jflesch\/paperwork<\/a><\/p>\n<p><a href=\"https:\/\/github.com\/jflesch\/paperwork\/blob\/unstable\/doc\/install.debian.markdown\">https:\/\/github.com\/jflesch\/paperwork\/blob\/unstable\/doc\/install.debian.markdown<\/a><\/p>\n<h2>Tesseract<\/h2>\n<p><a href=\"http:\/\/onetransistor.blogspot.com\/2015\/12\/ocr-searchable-pdf-linux.html\">http:\/\/onetransistor.blogspot.com\/2015\/12\/ocr-searchable-pdf-linux.html<\/a><\/p>\n<blockquote><p><code>sudo apt-get install tesseract-ocr tesseract-ocr-all<\/code><\/p>\n<p>&#8212;-<\/p>\n<p>&nbsp;<\/p>\n<p>#!\/bin\/bash<br \/>\nLANG=eng #replace with your language code<\/p>\n<p>shopt -s nullglob<\/p>\n<p>for f in *.tif; do<br \/>\necho &#8220;Running OCR on $f&#8221;<br \/>\ntesseract -psm 1 -l $LANG $f $f pdf<br \/>\ndone<\/p>\n<p>echo &#8220;Joining files into single PDF&#8230;&#8221;<br \/>\npdftk *.pdf cat output ..\/outdocument.pdf<br \/>\nrm -r -f *.pdf<\/p>\n<p>&nbsp;<\/p><\/blockquote>\n<p>&nbsp;<\/p>\n<blockquote><p>&#8212;-<\/p>\n<p>This script takes all <b>.tif<\/b> files from the directory where it is run and processes them with <b>tesseract<\/b>. To use it, you need also <b>pdftk<\/b> installed. Copy the above snippet into a new file <b>ocr.sh<\/b>, make it executable (<b>chmod +x ocr.sh<\/b>), then place it in the folder with scanned images and run it.<\/p>\n<p>&#8230;Things get complicated if you already have a PDF document that you want to make searchable. &#8230;<\/p>\n<p>In this situation, you can use the <b><a href=\"http:\/\/www.tobias-elze.de\/pdfsandwich\/\" target=\"_blank\" data-blkn-colour=\"rgba(43,100,215,1)\" rel=\"noopener\">pdfsandwich<\/a><\/b> script by <a href=\"http:\/\/www.tobias-elze.de\/\" target=\"_blank\" data-blkn-colour=\"rgba(43,100,215,1)\" rel=\"noopener\">Tobias Elze<\/a>.<\/p>\n<ul>\n<li><b>-nopreproc<\/b> is useful when the PDF already contains processed images and you don&#8217;t want any other processing. Note that by default, this script will convert your document to black and white! Using this option you avoid any kind of conversion.<\/li>\n<li><b>-resolution<\/b> has a default value of 300 DPI. This is used when converting PDF pages to images and 300 is a good value. But if your document contains small text and you know\/believe it may have been scanned at a higher DPI, specify it.<\/li>\n<\/ul>\n<pre>pdfsandwich -lang eng input_document.pdf<\/pre>\n<p>The result will be <b>input_document_ocr.pdf<\/b> in the same folder as the initial document<\/p><\/blockquote>\n<p>From the pdfsandwich site:<\/p>\n<p>&#8220;some pdf files, pdfsandwich produces much larger files after OCR processing. In this case, it might help to call pdfsandwich again on the already OCR&#8217;ed file&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scanning notes vuescan is $50, use on four machines? https:\/\/www.hamrick.com\/reg.html Paperwork scans, ocr&#8217;s and searches http:\/\/www.linux-magazine.com\/Issues\/2014\/166\/Paperwork-Document-Manager https:\/\/github.com\/jflesch\/paperwork https:\/\/github.com\/jflesch\/paperwork\/blob\/unstable\/doc\/install.debian.markdown Tesseract http:\/\/onetransistor.blogspot.com\/2015\/12\/ocr-searchable-pdf-linux.html sudo apt-get install tesseract-ocr tesseract-ocr-all &#8212;- &nbsp; #!\/bin\/bash LANG=eng #replace with your language code shopt -s nullglob for f in *.tif; do echo &#8220;Running OCR on $f&#8221; tesseract -psm 1 -l $LANG $f $f pdf [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-2018","post","type-post","status-publish","format-standard","hentry","category-general"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/posts\/2018","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/comments?post=2018"}],"version-history":[{"count":0,"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/posts\/2018\/revisions"}],"wp:attachment":[{"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/media?parent=2018"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/categories?post=2018"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/systemsolver.goodhealthyday.com\/StatlerBlog\/wp-json\/wp\/v2\/tags?post=2018"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}