Current scan process
With current meaning close to 2024-03-24
- scan pages to tif files
- Three options here:
- for separate pages A4/letter: HP7650 scanner with ADF using the scanimage package.
Example invocation:
scanimage -y 292 -x 209 -l 3 --batch --format=tiff --mode Lineart --resolution 600 --source ADF --batch-increment 2 --batch-start=1 --batch='BH_05_%03d.tif' -p
- for larger pages: HP7650 scanner flatbed mode using the xsane package
- for bound books: bookscanner with Canon A2500 camera's using the Pi-Scan package. This results in JPGs, which ScanTailor converts to TIFFs.
source: Pi-Scan
- post process files with ScanTailor
- source: scantailor-advanced
- generate pdfs with text with tesseract
- Tesseract is a standard Linux package. Example invocation:
for FILE in `ls *.tif`; do tesseract $FILE ocrpdf/`echo $FILE | sed -e 's/\.tif//'` pdf ; done
- concatenate pdfs with pdftk. pdftk is a standard Linux package.
- Example invocation:
pdftk *.pdf cat output totalPdf.pdf
This results in a PDF file without bookmarks, pages are numbered 1 and up.
- generate a bookmarks file with genMetaBookmarks.py*.
-
Example invocation from the directory containing the PDFs:
python3 genMetaBookmarks.py > bookmarkMeta.txt
- concatenate bookmarks to totalPdf with pdftk.
- Example invocation:
pdftk totalPdf.pdf update_info bookmarkMeta.txt output totalPdf_ocr.pdf verbose
*) The genMetaBookmarks.py script is currently under development and very limited. This is a simple script creating a bookmarks file for pdftk. It uses the file name of all single page PDFs as the bookmark text. It currently works only for pages named in strict alphabetic order.