TY - THES ID - 136802613 TI - Optimizing OCR Workflow – reOCRing AU - Li, Chenlu AU - Fantoli, Margherita AU - KU Leuven. Faculteit Wetenschappen. Opleiding Master of Digital Humanities (Leuven) PY - 2023 PB - Leuven KU Leuven. Faculteit Wetenschappen DB - UniCat UR - https://www.unicat.be/uniCat?func=search&query=sysid:136802613 AB - The preservation and accessibility of historical newspapers have significant challenges over the past few decades. Optical Character Recognition (OCR) technology is crucial in converting the printed text from physical paper into machine-readable format. This paper serves as a comprehensive work record and learning experience from an internship focused on optimizing OCR of historical newspapers. The objective was to explore workflows and utilize various tools to digitize and extract text from these invaluable resources. The usage of tools such as Tesseract, pytesseract, OCRmyPDF, Transcribus, Layoutparser, Google Vision API, and OpenCV were documented. Three different re-OCRing workflows for improving accuracy of OCR results are compared. However, the physical damage and degradation inherent in historical newspapers presented challenges that impacted OCR accuracy. The paper highlights the challenges faced, methodologies employed, lessons learned, and limitations, providing a practical experience for future projects in historical newspapers. ER -