21 September 2020

ARTICLE: Annemieke ROMEIN, Sara VELDHOEN & Michel DE GRUIJTER, “The Datafication of Early Modern Ordinances”, [Digital Humanities in Society] DH Benelux Journal II (2020)


The journal Digital Humanities in Society has published “The Datafication of Early Modern Ordinances” by Annemieke Romein.

Abstract: The project Entangled Histories used early modern printed normative texts. The computer used to have significant problems being able to read Dutch Gothic print, which is used in the vast majority of the sources. Using the Handwritten Text Recognition suite Transkribus (v.1.07-v.1.10), we reprocessed the original scans that had poor quality OCR, obtaining a Character Error Rate (CER) much lower than our initial expectations of <5% CER. This result is a significant improvement that enables the searching through 75,000 pages of printed normative texts from the seventeen provinces, also known as the Low Countries. The books of ordinances are compilations; thus, segmentation is essential to retrace the individual norms. We have applied – and compared – four different methods: ABBYY, P2PaLA, NLE Document Recognition and a custom rule-based tool that combines lexical features with font recognition. Each text (norm) in the books concerns one or more topics or categories. A selection of normative texts was manually labelled with internationally used (hierarchical) categories. Using Annif, a tool for automatic subject indexing, the computer was trained to apply the categories by itself. Automatic metadata makes it easier to search relevant texts and allows further analysis. Text recognition, segmentation and categorisation of norms together constitute the datafication of the Early Modern Ordinances. Our experiments for automating these steps have resulted in a provisional process for datafication of this and similar collections

 

The full text can be found here (DOI 10.17613/80sx-m116)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.