Automated Processes to Analyze Text Information and Identify Travel Locations in Rare Books

Professor Detlev Doherr
Co-author Andreas Jankowski
Offenburg University of Applied Sciences

Abstract: Alexander von Humboldt, the German explorer in the 19th century, saw the world in a way that everything is interconnected and interdependent. He published an enormous set of volumes including his discoveries and scientific observations he made on his travels to the Americas. Digital libraries and online archives provide his volumes as scanned images and text documents, which is making even rare books accessible for the public. To ensure high quality discovery and accessibility of digitized surrogates, many libraries have been developing web applications to open access to the collections and improve searchability of information by creating and promoting interoperability standards including institutional repositories. Reflecting these quality aspects, the Portal Alexander von Humboldt ( was an approach to create an interconnected information network in portal technology, using the example of the transdisciplinary legacy of Humboldt. But the portal is a lot more: The embedded text analysis feature is opening a path to a full text search function across various digital libraries with partially deeper outcome compared to the original search functions. The text analyzer contains an internal recognition of terms as information objects, which can be classified by references to linked archives and by comparison of identified objects. Additionally a feature was implemented to recognize and to present travel routes and locations from Humboldt’s travels in a Google Maps application. Because of the known travel route the accuracy of the service can be improved by rankings of probability, expressed by the distance of the location to Humboldt’s route. Unfortunately place names aren’t unique, were possibly renamed or changed in notation over the last hundred years. This circumstance leads to difficulties in the recognition of terms and provides no simple remedy. A case study of Goethe’s “Italienische Reise” led to an improved process chain of text recognition, which ensures an almost automatic extraction, recognition and presentation of locations mentioned or described in the text document. As a first step the whole text was split into separate object oriented definitions of terms, containing properties like classification, coordinates or more object related information. For this purpose, to harvest properties and relevant details, common and publicly available open source databases and tools like Wikipedia, Wiktionary and OpenStreetMap with well documented APIs were used. It can be shown, that the object recognition improves the identification of mentioned location names in rare books in comparison to earlier developments. With further improvement, the method of object definition can be used to automatically detect details like names of persons, animals and plants, which can encourage a better text understanding. As a consequence the system can support the reader of a rare book by identifying most of the place names and visualization of the mentioned locations in a map parallel to the text pages.

Brief Biography of the Speaker: Dr. Detlev Doherr is Professor in Informatics and Geoinformatics of the University of Applied Sciences Offenburg, Germany, since 1990. He received the degrees of diploma and Dr. rer. nat. from the University of Göttingen, Germany in 1983. After an employment at the German Rock salt and Potash industry, where he developed a Geographical Information System for mining and exploration, he serves as Professor in Offenburg beginning in 1990. There he developed the campus network of the University and became director of the Computer Center of the University. In 1992 he founded the Steinbeis-Transfercenter of Information Technology in Offenburg, which is part of the German Steinbeis- Stiftung. Since 2001 he is working in the fields of digital libraries, Internet portals and virtual environments. He has more than 20 years experiences in developing of Internet based information systems combined with artificial intelligence. He published about 100 articles and made presentations about GIS, Humboldt Digital Library, and Digital Libraries. Prof. Doherr is member of the organization and programme committee of the International Conference on Complexity, Informatics and Cybernetics, USA, and member of the International Institute of Informatics and Systemics. His current interests include artificial intelligence, knowledge based computing, information technology, history of natural sciences.

