The journal Digital Humanities in Society has published “The Datafication of Early Modern Ordinances” by Annemieke Romein.
Abstract: The project Entangled
Histories used early modern printed normative texts. The computer used to have
significant problems being able to read Dutch Gothic print, which is used in
the vast majority of the sources. Using the Handwritten Text Recognition suite
Transkribus (v.1.07-v.1.10), we reprocessed the original scans that had poor
quality OCR, obtaining a Character Error Rate (CER) much lower than our initial
expectations of <5% CER. This result is a significant improvement that
enables the searching through 75,000 pages of printed normative texts from the
seventeen provinces, also known as the Low Countries. The books of ordinances
are compilations; thus, segmentation is essential to retrace the individual
norms. We have applied – and compared – four different methods: ABBYY, P2PaLA,
NLE Document Recognition and a custom rule-based tool that combines lexical
features with font recognition. Each text (norm) in the books concerns one or
more topics or categories. A selection of normative texts was manually labelled
with internationally used (hierarchical) categories. Using Annif, a tool for
automatic subject indexing, the computer was trained to apply the categories by
itself. Automatic metadata makes it easier to search relevant texts and allows
further analysis. Text recognition, segmentation and categorisation of norms
together constitute the datafication of the Early Modern Ordinances. Our
experiments for automating these steps have resulted in a provisional process for
datafication of this and similar collections
The full text can be found here (DOI 10.17613/80sx-m116)
No comments:
Post a Comment