Make faster
It takes this script, running tesseract 4, anywhere from 20-50 seconds to generate hOCR per page. @isong reports that Abbeyy usually takes 5 or 6 seconds to generate OCR. Clearly we need to make this script faster.
Some possible optimization strategies:
- Test with tesseract 5 on Windows (so far it's been running tesseract 4 on Ubuntu) - apparently 5 is faster than 4.
- Test the same script but not using pytesseract; that is, call tesseract directly. Word on the web has it that pytesseract adds overhead. This would also make the script more portable.
- The script already uses Python's multiprocessing to created two parallel processes, one for odd-numbered pages and one for even-numbered pages, with a slight increase in per-page performance compared to just one process. I don't think we'll gain much by creating more processes on the same CPU.
This is not "make the script faster", but we may need to consider how we'd run this on 5 or more computers simultaneously.
Edited by Mark Jordan