From 9 to 13 June 2025, Princeton University hosted a week-long discussion of the state of the art in Handwritten Text Recognition (HTR) and its broader context within Automated Text Recognition (ATR).
Following a series of HTR winter schools held in recent years at the Institute for Medieval Research at the Austrian Academy of Sciences, a three-day workshop at the beginning of the week provided Princeton humanities students with both foundational knowledge and hands-on experience. The instructors introduced the technical basis of machine learning, the specific challenges of the HTR/ATR use case, and the range of available tools—particularly Transkribus and eScriptorium, which played a central role in the practical sessions. Students conducted test runs using their own data, and considerable time was spent discussing experiences, model selection, and the critical evaluation of results. Key questions included: Was layout detection functioning properly? To what extent was the Character Error Rate (CER) influenced by punctuation or abbreviations? And, when training one’s own model—how scalable is it?
The workshop, held each afternoon, was followed by a two-day conference at the Institute for Advanced Study, which broadened the conversation. Around 50 participants (mainly from European and North American universities) presented their work and projects, helping to map out the diversity of approaches and spark lively debate during the final session.
One central issue concerns the future development of the large-scale models currently available in Transkribus and eScriptorium. One path forward is represented by PARTY, a multilingual and multiscript ATR model that encodes images and decodes them into text—page by page—using language models. Its strengths include a broad training base (660,000 historical document pages), while its drawbacks include the need for substantial datasets to fine-tune the model. The contrasting paradigm involves a variety of smaller models trained on relatively modest datasets, often aimed simply at producing a functional first transcription—frequently for “minority” languages or scripts such as Yiddish or Glagolitic. An overarching question persists: to what extent should large language models (LLMs) be integrated into this process, and if so, where, how, and with what kind of explanation (especially regarding terminology) for scholarly users?
The choice of approach depends significantly on the community behind a corpus or project, and the goals being pursued. These may include efforts to scale up the quantity of handwritten Persian texts in OpenITI via Automatic Collation for Diversifying Coprpra, or the engagement of local communities in Chocó, Colombia to help correct VLM-based auto-catalogued historical materials. Other initiatives might focus on tracking character frequency distribution in time and space, or on using CNNs for text recognition of Greek epigraphy.
Once ATR/HTR has been successfully applied to manuscript material, it opens the door to a range of analytical possibilities. One example is a project on “Scripts, Scribes and the Production of Literature in London 1377-1471,” in which computer vision methods are used for scribal hand identification—essentially a form of digital palaeography, particularly focused on Gothic Cursive. Automated scribe classification using visual transformer models has also been tested on Vat. Lat. 653, with accompanying accuracy visualizations designed to explain the algorithm’s decisions—although these often hinge more on background features than on actual character shapes. Such explanations remain ex post facto and do not necessarily reveal the model’s internal statistical reasoning.
At a more granular level, transformer models are used to compare individual letters in Greek papyri from the era between Alexander the Great and the Arab conquest. But the role of the individual character remains ambiguous in digital paleography—not least because medieval scribes often aimed to de-individualize their handwriting. Conversely, images of entire papyri pages (not individual characters) have shown potential for automated dating.
Other research cases discussed included: Latin and Celtic glossing traditions (analyzed using eScriptorium with SegmOnto for layout and a CATMuS model for transcription); orthographic change in medieval Czech texts between the 15th and the 19th centuries (using a Transkribus model based on a late medieval norm); and code-switching in Czech words embedded in Latin sermons, including the identification of trigger words for such switches.
Future directions may include automated AI-based text comparison or Named Entity Recognition; many dedicated software packages already exist for such purposes.
Some use cases present particular challenges that go far beyond typical HTR/ATR applications—for instance: X-ray tomography (Dragonfly) to read fragments inside book bindings, or the automated transcription of Byzantine neumes, where neumes are converted into word sequences and aligned semantically and visually. This raises the question of whether it might eventually be possible to work directly with the sonic dimension of historical music.
Other presentations focused on community-building, often centered around a particular language or script—such as Syriac, Sanskrit, Indic Scripts, or Arabic. In many cases, layout recognition, a crucial component of ATR, may be best addressed using lightweight vision models. These communities, like in the case of Old French, also highlight the need for shared standards—whether for eScriptorium server specifications, dataset citation standards of different types, or transcription guidelines like those developed by CATMuS. These resources help researchers make informed decisions when creating ground-truth data for training.
Yet another form of community is built within heritage institutions, which increasingly see manuscript collections as research data and aim to support workflows and pipelines accordingly. The Linguistic Data Consortium provides similar support through workflows for Multilingual Automatic Document Classification, Analysis and Translation, for the linguistic processing of low-resource / endangered languages, or ground-truth datasets (plus images). While using existing editions as ground truth has not yet been tested, it is clear that such tools are essential for academic teaching—otherwise, students will not be able to work effectively with manuscript sources.
Throughout the discussions, participants repeatedly emphasized the need for collaboration and exchange—whether through domain-specific communities, shared infrastructure, collaborative projects, or international fora.
A long-term goal will be to structure and organize the discussions and findings from this workshop into concrete agendas—at the levels of community, academic research, and technical infrastructure. These agendas must remain realistic, adapted to what can be achieved through cooperation, and also identify areas where individual solutions (shaped by local infrastructures) will likely continue to dominate.
This will require distinguishing between academic fields in which collaboration can build on shared models, datasets, and results, and services that depend more on user engagement than on research per se.
Topical clusters emerging from the workshop thus include: (a) Corpus building / standards / ground-truth provision; (b) Models – universality, AI integration, chunking, quality control; (c) Outreach, teaching, community-building, and business models; (d) Interfaces, workflows, publication pipelines, and scholarly use cases.
A follow-up meeting is planned for late summer 2026 in Vienna, with the aim of further developing such agendas, potentially in the framework of a COST Action or similar collaborative initiative. The event will be announced and featured on this website. (Report by Thomas Wallnig, Vienna)