A Web-based Semantic Navigation System for Migne’s Patrologia Graeca based on OCR extracted Page and Volume Numbers from the Table of Contents of Dorotheos Scholarios

AUTHORS: Evagelos Varthis, Marios Poulos, Ilias Giarenis, Sozon Papavlasopoulos

ABSTRACT: In this paper, the prototype of a new tool is presented for the navigation of a 19th century collection of Greek authors. This collection is published by Jacques Paul Migne and it is known today as Patrologia Graeca (PG). The project aspires to interconnect this vast amount of about 120000 scanned pages with the scanned Table of Contents (TOC) published by D.Scholarios in 1879. The D.Scholarios’s work contain summaries for the chapters and sub-chapters of PG, having next to them the corresponding volume and page number of the location in the PG. Using Optical Character Recognition (OCR) and pattern recognition techniques, we extract from D.Scholarios’s work the appropriate information in order to create links to the specific pages of PG. Our aim is to provide a Web Interface in which D.Scholarios’s work is used as a semantic compass for PG about the subjects it covers. The complete system consists by three main sections. A REST API backbone service for the scanned images of PG. OCR and pattern recognition techniques for extracting the volume and the page information from the scanned pages of D.Scholarios. A Web interface presenting the TOC by D.Scholarios with the appropriate functionality. The originality of our system lies in the interconnection of two different scanned texts for semantic enrichment and browsing convenience, especially if one is nearly 120000 pages and the other about 600 pages.

KEYWORDS: Migne’s Patrologia Graeca, Dorotheos Scholarios, Rest API; Web Interface, Semantic Web.


[7] Bruce Robertson, Christoph Dalitz, Fabian Schmitt, Automated Page Layout Simplification of Patrologia Graeca, DATeCH '14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, Pages 167-172, Madrid, Spain — May 19 - 20, 2014.

[8] Boschetti F., Romanello M., Babeu A., Bamman D., Crane G. (2009) Improving OCR Accuracy for Classical Critical Editions. In: Agosti M., Borbinha J., Kapidakis S., Papatheodorou C., Tsakonas G. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2009. Lecture Notes in Computer Science, vol 5714. Springer, Berlin, Heidelberg.

[9] Bruce Robertson, Federico Boschetti, LargeScale Optical Character Recognition of Ancient Greek, Mouseion: Journal of the Classical Association of Canada Volume 14, no. 3, 341- 359, 2017.

[13] Smith, R.: An Overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE Computer Society, Los Alamitos (2007) Google Scholar.

