Current scan process

With current meaning close to 2024-03-24

scan pages to tif files

Three options here:

for separate pages A4/letter: HP7650 scanner with ADF using the scanimage package. Example invocation:
scanimage -y 292 -x 209 -l 3 --batch --format=tiff --mode Lineart --resolution 600 --source ADF --batch-increment 2 --batch-start=1 --batch='BH_05_%03d.tif' -p
for larger pages: HP7650 scanner flatbed mode using the xsane package
for bound books: bookscanner with Canon A2500 camera's using the Pi-Scan package. This results in JPGs, which ScanTailor converts to TIFFs. source: Pi-Scan

post process files with ScanTailor

generate pdfs with text with tesseract

Tesseract is a standard Linux package. Example invocation: for FILE in `ls *.tif`; do tesseract $FILE ocrpdf/`echo $FILE | sed -e 's/\.tif//'` pdf ; done

concatenate pdfs with pdftk. pdftk is a standard Linux package.

Example invocation: pdftk *.pdf cat output totalPdf.pdf
This results in a PDF file without bookmarks, pages are numbered 1 and up.

generate a bookmarks file with genMetaBookmarks.py*.

Example invocation from the directory containing the PDFs: python3 genMetaBookmarks.py > bookmarkMeta.txt

concatenate bookmarks to totalPdf with pdftk.

Example invocation: pdftk totalPdf.pdf update_info bookmarkMeta.txt output totalPdf_ocr.pdf verbose

*) The genMetaBookmarks.py script is currently under development and very limited. This is a simple script creating a bookmarks file for pdftk. It uses the file name of all single page PDFs as the bookmark text. It currently works only for pages named in strict alphabetic order.