Current scan process

With current meaning close to 2024-03-24

scan pages to tif files
Three options here:
post process files with ScanTailor
source: scantailor-advanced
generate pdfs with text with tesseract
Tesseract is a standard Linux package. Example invocation: for FILE in `ls *.tif`; do tesseract $FILE ocrpdf/`echo $FILE | sed -e 's/\.tif//'` pdf ; done
concatenate pdfs with pdftk. pdftk is a standard Linux package.
Example invocation: pdftk *.pdf cat output totalPdf.pdf
This results in a PDF file without bookmarks, pages are numbered 1 and up.
generate a bookmarks file with genMetaBookmarks.py*.
Example invocation from the directory containing the PDFs: python3 genMetaBookmarks.py > bookmarkMeta.txt
concatenate bookmarks to totalPdf with pdftk.
Example invocation: pdftk totalPdf.pdf update_info bookmarkMeta.txt output totalPdf_ocr.pdf verbose

*) The genMetaBookmarks.py script is currently under development and very limited. This is a simple script creating a bookmarks file for pdftk. It uses the file name of all single page PDFs as the bookmark text. It currently works only for pages named in strict alphabetic order.