Make faster

It takes this script, running tesseract 4, anywhere from 20-50 seconds to generate hOCR per page. @isong reports that Abbeyy usually takes 5 or 6 seconds to generate OCR. Clearly we need to make this script faster.

Some possible optimization strategies:

Test with tesseract 5 on Windows (so far it's been running tesseract 4 on Ubuntu) - apparently 5 is faster than 4.
Test the same script but not using pytesseract; that is, call tesseract directly. Word on the web has it that pytesseract adds overhead. This would also make the script more portable.
The script already uses Python's multiprocessing to created two parallel processes, one for odd-numbered pages and one for even-numbered pages, with a slight increase in per-page performance compared to just one process. I don't think we'll gain much by creating more processes on the same CPU.

This is not "make the script faster", but we may need to consider how we'd run this on 5 or more computers simultaneously.

Edited Jan 12, 2024 by Mark Jordan