Export directly to RefWorks

TY  - THES
ID  - 136802613
TI  - Optimizing OCR Workflow – reOCRing
AU  - Li, Chenlu
AU  - Fantoli, Margherita
AU  - KU Leuven. Faculteit Wetenschappen. Opleiding Master of Digital Humanities (Leuven)
PY  - 2023
PB  - Leuven KU Leuven. Faculteit Wetenschappen
DB  - UniCat
UR  - https://www.unicat.be/uniCat?func=search&query=sysid:136802613
AB  - The preservation and accessibility of historical newspapers have significant challenges over the past few decades. Optical Character Recognition (OCR) technology is crucial in converting the printed text from physical paper into machine-readable format. This paper serves as a comprehensive work record and learning experience from an internship focused on optimizing OCR of historical newspapers. The objective was to explore workflows and utilize various tools to digitize and extract text from these invaluable resources. The usage of tools such as Tesseract, pytesseract, OCRmyPDF, Transcribus, Layoutparser, Google Vision API, and OpenCV were documented. Three different re-OCRing workflows for improving accuracy of OCR results are compared. However, the physical damage and degradation inherent in historical newspapers presented challenges that impacted OCR accuracy. The paper highlights the challenges faced, methodologies employed, lessons learned, and limitations, providing a practical experience for future projects in historical newspapers.
ER  -