IMC Leeds Paper: Sending 15th-Century Missives through Algorithms: Testing and Evaluating HTR with 2,200 Documents

[This paper was given at the IMC 2017 in Leeds]. For the discussion on twitter see also the storify.


Abstract

Is it possible to teach algorithms to read medieval handwriting? Does it make sense to have the material prepared by students, learning to read gothic writing at the same time? Those two simple questions lay the ground to discuss how and whether handwritten text recognition and teaching of the Middle ages can be intertwined.

The material to address the tasks consist of 2’200 missives from Thun, a small town in Switzerland. 120 documents were transcribed and used for training. In the process three difficulties had to be identified: Different and changing hands, difficult layout structures, and abbreviations. The identified difficulties are typical for such an endeavor. Unfortunately the results of the recognition are insufficient and cannot be used by scholars. The „small“ amount of material for training is a reason for this. Using language models, the results can be improved, although crucial parts such as names and verbs still remain only partially identifiable.

At the same time the combination of teaching and the use of cutting-edge technological tools proved engaging. The students involved were highly motivated and welcomed the possibility to take part in a digital research endeavor.


Intro

The teaching of paleography is in my opinion one of the core tools in order to get an insight of what the study of the Middle Ages as well as the Middle Age are about.

At the same time, technology promises to help us in regards of transcription but in the future also the identification of places, persons etc.

In the last six months, I tried to bring both aspects – the teaching of paleography as well as the technology – together by teaching students and algorithms to read gothic cursive of the 15th century.

This paper therefore lays out the Idea behind, the material used for the experiment and the outcomes of the text recognition as well as the outcome regarding the aspect of teaching paleography. So this is basically a lab report (and fits as blog post).

Since the results of the tests with the documents were not sufficient, at the end of the paper, I’d like to bring up briefly an approach, how handwritten text recognition worked better.

Idea

Was as simple as fitting: Bring together a software called Transkribus and students from the University of Zurich in the endeavor to transcribe missives. The goal was to let the students experience reading handwriting and at the same time provide material in order to train handwritten text recognition. Of course we also wanted to try out the HTR in the classroom.

READ_imc-leeds-2017-07-03-1.png

Idea came from teaching with an e-learning tool, developed at the university of Zurich, called «ad fontes».

The e-learning teaches people to read handwriting (with a focus on medieval and early modern scripts).

There are a lot of advantages to the approach, for example that the user can get tipps or receives feedback on words not or wrongly deciphered (ie. you see that Schultheiss is transcribed incorrectly with a „z“).

The downside of the tool lies in the fact, that nothing new can be detected. Everything is to be found in a perfectly streamlined way. In order to get students an impression how actual research or at least the transcription of single documents work, I started using Transkribus as a tool that allows for the transcription in the cloud.

The documents used in Transkribus are stored on servers at the university of Innsbruck. Access is given only to those designated by the uploader and/or the owner of a collection. Transkribus can also be used to train recurrent neural networks for Handwritten Text Recognition (or rather to produce a model for text recognition). Therefore the idea was not far to bring both approaches together and use the material provided in order to produce a model specific for fifteen century missives.

READ_imc-leeds-2017-07-03-2.png

Since the training of Handwritten Text Recognition, same as training of other neural networks, depends on masses of material for training in order to give good results, I added further transcriptions that were already provided by students on a wiki (developed some years ago).

But let’s first take a step back and look at the material:

From the first half to the end of the fifteenth century, more than two thousand missives have been preserved containing instructions of the city council of Berne to its bailiffs in Thun, a small town that became part of the Bernese territory at the end of the fourteenth century. These documents shed light on local dimensions of a city’s territorial lordship. Like many other cities subject to similar development, Berne acquired territorial lordship over an extended hinterland, claimed control over minor cities, and used them as district-towns in its territorial administration.

The topics dealt with in the missives are very diverse and sometimes amusing to the modern reader. Skirts were stolen by husbands or roosters deemed to loud in the morning. As a consequence, petitioner (male and female) went to Berne, complained in front of the council and the bailiff was then contacted by the council, using such missives.

The corpus spans more than 100 years, therefore we find a variety of scribes and also an evolving chancery that put forth rather irregular scripts.

Four problems or let’s say elements influence the results of the HTR:

  1. Different scribes —> the students did not have to focus on a single scribe/time-frame; therefore the transcriptions produced were from a variety of scribes.
  2. Not very regular scripts (the documents were not for display but rather needed to be sent out quickly)
  3. recognition of layout (or layout analysis as it’s called) does not work properly: the task that is even for regular scripts difficult; more so for missives (also due to this, Layout analysis has not been taken into account)
  4. and only to some degrees problematic, is a missing language model (like all vernacular of the Middle Ages early modern German lacks standardization), as we will see, we still can produce a very simple language model

From the teaching perspective

Transkribus can be recommended, but only if you are willing to allow for some time spent on explaining the tools:

  • it allows for a simple feedback-loop thanks to shared working-space (everything saved in a cloud)
  • but no real-time help

Every student was able to prepare transcriptions for 2 missives (11 students, makes 22 missives); they were also responsible for corrections of the layout. The process of transcribing led to intense (and heated) discussions about rules for transcription (normalization, dealing with abbreviation). In order to deal with HTR this is very important, i.e. expansion of abbreviation without „telling“ the algorithms leads to diminished quality of recognition.

At the same time, the discussions were a very fruitful experience in order to understand what rules and forms of normalization are common in edition (digital or analogue) but also to raise the question what a „text“ makes.

In order to produce further material for training about 100 already transcribed missives were added by myself. In the end the so called Ground Truth consisted of 21’682 words on 1900 lines.

READ_imc-leeds-2017-07-03-4.png

A look at the training curve shows that the test sets (that’s the document who were prepared but not used for training in order to determine the quality of the recognition) were recognized with around 26% Character Error Rate.

Meaning: That more than every forth character was recognized incorrect. That’s quite a lot, if you look at the example in the screenshot.

READ_imc-leeds-2017-07-03-5.png

This is one of the „better“ examples with 20% CER. Only one five letters is identified incorrectly.

As mentioned, there is the possibility to add „a language model“ or rather a vocabulary consisting of the words used in the 120 missives transcribed and prepared. With such a model the Character Error Rate (CER) used to decrease dramatically for around 7-9% (so we get from 20 to 13%!)

READ_imc-leeds-2017-07-03-6.pngOf course this is some sort of „cheating“, but still medieval documents tend to be repetitive. The missives are similar to diplomatic documents and are rather formulaic in writing.

The increase of the recognition is mostly on standard phrases, that leaves us with wrongly recognized named entities and verbs.

Only from recognized transcriptions, the content is hard to identify. But it can be helpful if you are familiar with the script and you only need to have a short look at a document. That means at the same time that there’s a disadvantage if ones unfamiliar/unable to read the script; the hard to read parts are also mostly wrongly recognized (so it’s not very helpful for beginners, and even less so lay people).

Thanks to the close connection between text and image, you will be able to check on the image and make sure that the transcription is correct. Also you are able to search through collections within Transkribus without exporting the documents (even fuzzy searches are implemented).

Still, with as little as 100 something very short pages, you got a model that helps, giving a first impression of the content of a document but not much more.

Dealing with such a broad range of documents (with its different scribes), it would be interesting to apply writer identification. In order to determine for which documents a specific model works better. Currently, this is being integrated in Transkribus and we expect it to be ready in autumn.


One scribe, one cartulary

Since the demonstrated example led to rather insufficient results, let me briefly bring up an example that worked with a similar input of material for training (see also this blog post).

READ_imc-leeds-2017-07-03-7.png

For Königsfelden abbey, a monastery founded in 1309, a cartulary was produced in 1336. The codex was written in a very regular gothic book-script. Copies of charters given to the monastery were entered in the cartulary.

We prepared for training 25’658 words (that’s almost 4300 very short lines). The result is a character error rate of 10% on average (without language models, so it doesn’t matter whether the charter copied was written in Latin or in the vernacular). Even abbreviations and special characters (such as the latin genitiv ending -rum), can with some reliability be recognized.

The differences between the two examples is obvious:

Whilst one corpus consists of different hands from different times, the other is just the writing of two or three scribes (and to be honest the model only works good for the main hand). Still, it can be said that the more material for training of a single hand is available, the better the recognition is going to be.

Conclusion

For the teaching experience, I can very much recommend the approach taken. Especially the high level of motivation among the students bears witness of the interest to use new technology combined with the ambition to learn the skill-set of an „analogue“ paleographer. Nonetheless, it needs to be emphasized that task of learning to decipher gothic writing is still an arduous one. One of the strengths of the approach, was the need to discuss what text means in regard to a medieval document. And also what is necessary in order to use it for further inquiries.

After all the best case scenario would be that we get a transcription of a folio or a page according to what we trained an algorithm to do.

Regarding technical  aspects, it is in my opinion fair to say, that we will have in short a good recognition of Medieval Handwritings. I’m sure that Character Error Rates around 10% (and even below) will be the standard.

Of course we need to be aware that this is not to be compared to editions and human produced transcripts but still we will be able to access vast amounts of text from the Middle Ages in years to come.

 

[Disclaimer: I’m associated with the project READ as a research associate at the state archives of Zurich]

READ_imc-leeds-2017-07-03-0.png


Interview zu digitaler Edition

Ein bisschen Werbung in eigener Sache…

Während der Tagung «Edition! Wozu? Wie? Und Wieviele?» vom vergangenen November hat Infoclio.ch die Gelegenheit genützt und einige Interviews mit den beteiligten Personen durchgeführt (zur Reporting-Seite mit allen Interviews und Mitschnitten einiger Vorträge). Ich hatte das Glück zusammen mit Christiane Sibille (DODIS) und Gerhard Lauer (Uni Göttingen) kurze Statements abzugeben.

Das Resultat:

Interview zum Panel 4: T. Hodel, C. Sibille, G. Lauer from infoclio.ch on Vimeo.


UZH und das «Fach» Schweizer Geschichte

[Der Beitrag ist – wie der gesamte Blog – unabhängig von meiner Anstellung an der UZH]

Am Historischen Seminar der Universität Zürich wird Schweizer Geschichte auch aus dem Fächerkanon der Nebenfächer gestrichen. Das ist meiner Meinung nach nicht weiter schlimm, im Gegenteil. Die Aufruhr einiger Kantonsräte wird wohl ein Sturm im Wasserglas bleiben (Link: http://zol.ch/ueberregional/kanton-zuerich/UniFach-Schweizer-Geschichte-faellt-weg/story/13333164).

Bislang war es möglich an der Uni Zürich neben Geschichte eine Vielzahl von Untergebieten zu studieren, neben den Zeitbereichen (Alte, Mittlere und Neuere Geschichte) auch Schweizer Geschichte, sowohl auf Bachelor- als auch auf Master-Stufe und mit unterschiedlichen Punktezahlen. Die Streichung der Programme war ein Entscheid der Seminarkonferenz, welcher auch von den Ständen mitgetragen wurden, um nicht eine zu breite, verwirrende Palette an Programmen anzubieten, die nichts zur Profilierung des Seminars beitragen.

Das Problem der nun abgeschafften Studienprogrammen lag in ihrer Konzeption beziehungsweise im Fehlen derselben. Nur wenig wurde investiert um ein Curriculum herzustellen, welches die Programme als eigenständig herausstellte. Dies war grundsätzlich auch problemlos möglich, da eine Vielzahl der ohnehin abgehaltenen Veranstaltungen in den Programmen angeboten werden konnte.

Die in den Programmen erworbenen Skills unterschieden sich entsprechend überhaupt nicht von denjenigen in den „Hauptprogrammen“ (Geschichte als Haupt- bzw. Nebenfach).

Im Umkehrschluss bedeutet dies, dass keine Veranstaltungen durchgeführt wurden, aufgrund der Existenz der Programme.
Demnach stellten die Veranstaltungen weder für Studierende der Programme noch für das Seminar einen Mehrwert dar. Im Gegenteil führte die Existenz eher zu Verwirrungen und Unsicherheiten (und teilweise der Suche nach Veranstaltungen, welche in den Programmen angerechnet werden konnten).

Heisst dies, dass die Schweizer Geschichte an der Universität Zürich nicht mehr gelehrt wird?
Nein, im Gegenteil. Ein Blick, etwa auf die Veranstaltungen im kommenden Frühjahrssemester zeigen, dass eine Vielzahl der Veranstaltungen einen Blick auf Ereignisse mit Bezug der Schweiz herstellen (auch wenn dies teilweise erst mit Blick auf die Programme klar wird).

Auch bei den Abschlussarbeiten fällt auf, dass die Schweiz und ihre Geschichte im Interesse der Studierenden und ihrer Betreuenden liegt. Auch – und insbesondere – an Lehrstühlen mit Fokus ausserhalb der Schweiz sind starke Verknüpfungen zur Geschichte des Landes erkennbar (Siehe bspw. die von Frau Prof. Krüger betreuten Abschlussarbeiten: http://www.hist.uzh.ch/fachbereiche/neuzeit/lehrstuehle/krueger/forschung/lizmasterarbeiten.html)

Folglich erleidet die Schweizer Geschichte an der Uni Zürich keine Schwächung durch die Abschaffung der entsprechenden Studienprogramme, es wird einzig ein administrativer Aufwand ohne Mehrwert entfernt.


Quantifying Witness Lists – An approach doomed to fail

The following is a short paper prepared for a course. Since I was not able to further elaborate on it and put more material around, it remains a miscella with no specific claims.

The Witness Lists of the Cartulary of Holy Trinity, Aldgate

Research concerning persons, especially such dealing with masses of persons, often refers to the use and the possibilities offered by information and computational technologies: Faster re­search and more intriguing results are said to be possible. Through digitization and quantifica­tion more efficient and more precise work by historians ought to be possible. This paper seeks to test these promises by applying quantificational (factive) analyses on a collection of documents of Holy Trinity, Aldgate (London).

Introduction

Witness lists offer the possibility to use an analysis based on quantification. For example it can be useful to distinguish how many witnesses were “needed” for a certain type of docu­ment, if any, and what changes occurred to this type over time. By looking solely at measura­ble fac­tors, I try to find as many possible conclusions that could be compared to results gained by close reading of similar sources. The danger of conclusions based on misinter­preted data is here willingly accepted and a test to check out approaches that are mostly neglected in the field of historical studies. Besides focusing on the use of wit­nesses in medieval documents, the methodology applied is of interest.

The questions posed in this paper are therefore threefold:

  1. How often and in what types of documents did witnesses occur in the cartulary of Holy Trin­ity abbey, Aldgate?
  2. What conclusions regarding typological as well as temporal patterns can be reached?
  3. How useful is the methodology in order to gain insights about matters of document produc­tion and connection between appearances of witnesses, date of document production, and types of documents?

The idea is to neglect the “content” of the entries in order to not be influenced by biases such as how much and what witnesses are to be expected when.

In order to have a quantifiable sample, I chose to analyze the documents copied in the cartu­lary of Holy Trinity. Due to the fact that the cartulary is available in a normalized English form and following a standardized description it is possible to structure the entries as data accordingly and without great effort (linguistical and paleographical). The structuring as well as the ba­sics of the documents will be explained in part 3. Part 4 deals with insights gained by the ap­plied methods. Part 5 offers conclusions as well as a critique of the methods applied, but first a short introduc­tion about Holy Trinity and its cartulary.

Holy Trinity and its Cartulary

Holy Trinity, Aldgate (also called Christ Church) was one of the most important monasteries within the city of London. Founded in 1108 by Queen Matilda (c. 1080—1118), secular can­ons following an Augustinian rule inhabited the site until its dissolution in 1532.[1] Since the founda­tion a strong connection to King and Queen as patrons can be found. Right at the beginning the endowments were invested heavily in buildings, vestments, and other objects of display, lead­ing to a scarcity of food and an involvement of the locals at Aldgate by donat­ing bread to the can­ons.[2] Foremost in London was land acquired and rented to citizens. There­fore most of the in­come stemmed from the city and was only partially augmented by revenues from out of town.[3] Until the dissolution in the 16th century, starting around 1290, the income of the monastery increased.[4]

Detail of structured document, containing seven entries of the cartulary

Figure 1: Detail of structured document, containing seven entries of the cartulary

The cartulary is one of the main sources for the economic and political history of Holy Trin­ity. In the 18th century the manuscript was edited and partially printed.[5] The edition in 1971 by Hodgett follows this tradition and treats the manuscript as a trustworthy collection of docu­ments in possession of the monastery.[6] The production of the cartulary itself is only given a lim­ited account. Although time (between 1425 and 1427) and scribe (Thomas de Ax­bridge) are known, it’s not asked what the reasons for the production of the cartulary could have been, and why the documents were ordered by parish. Similarly, it is not asked, why after the 13th century fewer and fewer documents were copied into the cartulary.

Without being able to consult the manuscript, it is hard to judge what reasons do stand be­hind the production in the 1420. The order of documents suggests connections between the cartu­lary and book of accounts. Bringing together the scattered documents of a parish in one place. Due to the fact that even summation (aggregation) were part of every parish-entry, it’s highly likely that the book was needed in order to defend or execute entitlements. This would explain the differ­ent types of documents and some of the frequencies (outlined below in figure 3).[7] Yet, these are only assump­tions that need further research.

In order to produce a distinct nomenclature in this paper, a “document” refers to the docu­ment that was copied into the cartulary. “Entry” is a part of the cartulary, i.e. a document, but also the notice (similar to a chronicle), or a summation. “Manuscript” meanwhile describes the cartulary as a book.

Structures of the documents – structuring the cartulary

The basis of this analysis is the edition by G. A. J. Hodgett published in 1971. This edition of the cartulary was digitized by British History Online[8] “a digital library containing some of the core printed primary (…) sources for the medieval (…) history of the British Isles.”[9] Most parts of the edition by Hodgett do not consist of full-text transcriptions but modernized and standard­ized summaries of the documents copied into the cartulary. The cartulary itself, as mentioned above, is not executed by the editor, and thus every copy is treated as a sin­gle entry referring to the document that should have existed at the time of the produc­tion of the cartulary.[10] Each entry is numbered,[11] followed by time of production (or time frame if un­sure), and a typological classification.[12] There are many types of documents and sometimes overlap­ping: Grants appear most often, followed by lists of those paying (quit) rents, notes, and summation of parishes.[13] Subsequent to the type of the char­ter follows a description of the act that was attested by the document, outlying the legal act, the involved parties, as well as the amount of money that was part of the agreement. Due to the goals of this paper that parts have been almost entirely ignored. Of far more interest are the lists of witnesses attached to the description of the documents. Although often shortened in cartularies, it seems as if this prac­tice was not followed in Holy Trinity. While this is obviously beneficial for the present study, we still wonder why the witnesses were copied.

figure2

Figure 2: Quantity of entries in cartulary by year.

In order to work with the available material, a structured document was created that can be searched and interpreted using so called “regular expressions”.[14]

Of the 1073 entries contain at least 366 entries one witness. Subtracting the entries of the summations of parish totals (84), the lists of those paying (quit) rent (264), and the chronicle en­tries (22), 703 entries could possibly include witnesses.[15] In slightly more than 50 percent (52%) of the entries at least one witness is mentioned.

A charter mentions on average 3.65 witnesses (if witnesses are in it at all). Of the 1336 witnesses in the cartulary, about 1080 of those are mentioned only once as wit­nesses: 134 appear twice or more.[16]

The use of the structured document appears to be useful in order to determine how many wit­nesses to expect. Therefore it is possible to determine that as a maximum 18 witnesses were listed (in 1193),[17] whilst several entries only name one witness.[18]

Not counted were entries mentioning an undefined number of witnesses (like “and further noble­men”).

The distribution of charters over the years within the cartulary shows that most entries were written between 1147 and 1272. Some years dominate the entries in the cartulary, for reasons that might have to do with the fact that in case of uncertain dating (i.e. postquam dating) the earliest possible date was taken.[19] No differences were made between dates vali­dated with cer­tainty compared to dates only assumed. On average almost 1.06 dated entries can be found in the cartulary per year.[20]

In order to fully understand the appearing witnesses (especially its shifting quantities) it is neces­sary to describe all entries of the cartulary in a similar manner, leading to a typology that was assumed by Hodgett’s edition (and of course a strong point of attack).

Although several checks and controls were conducted there still will be errors in the 3945 lines of the structured files, a caveat to relativize all conclusions to come.

figure3

Figure 3: Distribution of types of charters, including containing witnesses.

The typology of different charters demonstrates that mostly grants were copied into the cartu­lary. Lists of those paying (quit) rents does make the second largest part (together with the grants more than 80 percent. The fact that mostly grants were witnessed is very intri­guing, since it makes claims about the nature of grants possible and strengthens the presupposi­tion that the transmission of grants was one of the main goals of the production of the cartulary.[21] Combined with the appearance of entries of summation and lists of leasehold­ers a system of accounting becomes most likely a “background” of the cartularisa­tion. Regarding the distribu­tion of entries containing witnesses it becomes obvious that no type of document with certainty needed the involvement of named witnesses.[22]

Applied Quantifications – What the Numbers Tell

The next question to tackle concerning the witness lists deals with the distribution of wit­nesses per entry per year, in order to be able to tell whether there was a shift in pure quan­tity of wit­nesses listed in the documents.

Every dot in figure 4 symbolizes the number of witnesses in a given document in a particular year. Looking for patterns it becomes obvious that no development towards a more standard­ized number of witnesses per document can be stated over the long run. On the contrary, alt­hough two or three witnesses seem to become rather “normal” at the end of the 12th century, around 1300 the diversity grows again (maybe also due to the fact that the sam­ple around that time gets thinner). Also between 1190 and 1280 a lot of documents were pro­duced, naming either more witnesses than the two or three, or even less by nam­ing just one. The one witness-entry is only a frequent option between 1215 an 1250, which diminishes at the turn of the 14th century.

Further insights are promised by the analysis of three factors at the same time: date – type – and quantity of witnesses (figure 5). The regular connection between witnesses and grants be­comes once again obvious. As already demonstrated in the typological comparison, grants do mostly come with witnesses (more than 88% of the documents). And they do so steadily over time. Although usually containing about three witnesses, peaks and lows aren’t missing and no connection between time frame and quantity (concerning peak and low) can be found. Concern­ing the overall quantity of witnesses in the documents, there is no pattern or evolu­tion to­wards a consistent quantity detectable, not even for certain types of documents. A tendency towards three or four witnesses on average per grant can perhaps be found between 1230 and 1280.

Figure 4

Figure 4: Distribution of Witnesses by year per document. Grey cross lines stand for two witnesses. The figure is to scale.

For the same time period is also a concentration detectable on using witnesses only in grants (ex­cept for two leases and one release). Before as well as after the time frame the variety of types was broader, although not consisting of the same types of documents before and after. Whereas before, types such as “confirmations”, “letters”, and a “release” can be found, in the later period one “acquittance”, “quitclaims” and others can be stated. In both periods (before 1230 and after 1280) occurred witnessed “exchanges”.

Patterns in Documents produced in the same time – an excursus

Due to the fact that certain years appear more often as dates of entries, the likelihood of pat­terns in appearing witnesses are higher. And analyzing the entries of 1222 (or rather post 1222) shows that certain people and even identical or almost identical combinations of peo­ple appear (42 entries containing witnesses): 7 documents were witnessed solely or accompa­nied with not more than one other named witness by Gilbert Fulc (or “son of Fulk”). Even more intri­guing is the appearance of a combination of witnesses in the same year: William de Alegate, Ralph his brother, Stephen the Tanner, Terricus, Bartholomew (also a brother of William) ap­pear among others (and twice in a different sequence) in 9 entries.

Similar to the insights of McKitterick for Saint Gall, it can be stated that in 13th century Lon­don witnesses were denominated (at least partially) in groups.[24] Due to the fact that the char­ters cannot be dated exactly, it remains questionable whether the issuing of the char­ters happened at the same date or whether the same group was called up on different dates.

Interestingly, the same cannot be concluded for the documents dated 1197 (or rather post 1197). In this group of documents only 3 persons appear more than once.[25] In the same pe­riod it’s also conspicuous that a majority of people with a clerical background are listed as witnesses (es­pecially in comparison to the group of 1222).[26]

Quantifying Witness Lists: a Conclusion and a Critique

The idea of this project was to rely solely on gained “data” (rather than information) of the cartu­lary’s digitized version in order to test how far and in what directions a quantitative analysis could lead. The results are biased:

No constant patterns of when how many witnesses were present in order to produce a char­ter was found. Neither is there, barring grants to a certain degree, a type of document identifiable that had to have witnesses mentioned. Except for the period between 1230 and 1280 there is no evolution or stream-lining of documents detectable. But interestingly right at the beginning of this period, a pattern of groups of witnesses can be stated. These two percep­tions united could belong to an attempt to produce documents a certain way using a certain group of peo­ple. Or it could be a sign of the influence claimed by a certain group in the 1220ies and 1230ies. Between 1222 and 1248 Richard was prior, right at the time that “the greatest business activity took place”[27] according to Hodgett.[28]

These conclusions make two points obvious: A quantitative analysis only makes sense if compared and enhanced with further perspectives that can’t be gained from pure num­bers. Second, one of the main problems of this paper remains or gets even aggravated: The cartu­lary stands like a semi-translucent curtain between the documents and the historian. The uncer­tainty of what is trustworthy and what not remains.[29] For example, the re­peated occurrence of the same group of people could indicate a forgery.

Figure 5

Figure 5: Representation of quantity of witness – type – and time of production. The charter (years) is not to scale! The average value is taken if the same type appeared more than once in one year.

Nevertheless, depictions and quantifications might help to approach questions of why and how witnesses were “used” in documents (and further in medieval societies). Dealing with quantifica­tions might help to detect patterns and modifications that would have gone unno­ticed in close reading. Comparisons are more easily feasible and hone our approaches to differ­ent institutions and settings. Though, of course a wider array of data needed to be collected in order to be able to make more sustainable arguments.

[1] A short introduction to the monastery, its history, and its economic standing is given in the introduction of the edition: Hodgett, G. A. J.: The Cartulary of Holy Trinity, Aldgate: London Record Society 7 (1971), pp. xi-xxi, here: xiii-xvi. The site of the monastery was before its foundation already inhabited by canons, see ibid, p. xiii.

[2] Without having indepth insight, one could argue that this was not done due to the scarce endowment but in order to popularize the newly established monastery. Following the narrative of the scarcity: Ibid, p. xiv.

[3] Ibid, xvi, Hodgett estimates that 60 percent were income from the city. One of the neglected sources of incomes were coming from churches collated to Holy Trinity, cf. Ibid, xvii.

[4] Ibid.

[5] Ibid, xi.

[6] Hodgett claims that the scribe of the book (Thomas de Axbridge) was not negligent but partially ill informed.

[7] See Figure 2, p. 6.

[8] Hodgett: The Cartulary of Holy Trinity, Aldgate: London Record Society 7 (1971). URL: http://www.british-history.ac.uk/report.aspx?compid=64000 [accessed: 15 October 2013].

[9] Cited after self-description: http://www.british-history.ac.uk/Default.aspx [accessed 2013-10-15]. The ressource was created and is maintained by the Institute of Historical Research and the History of Parliament Trust.

[10] Except for page breaks of the cartulary that are mentioned within the documents.

[11] Numbers run from 1 to 1073, baring an appendix.

[12] The classification is only partially stringent, since some of the charters were described rather than classified in length.

[13] As mentioned above (see page 1), the main goal of the cartulary might have been a more severe control of the dues, thus understandably the mentioned parts occur the most often. A list of the most frequent types of documents is to be found in figure 3.

[14] The document is in XML, a markup language that does not define the interpretation of the used tags but demands for a strict hierarchy. The style of the structure is close (but not according) to the quasi-standard of TEI (Text Encoding Initiative) for the structured encoding of texts (especially editions): http://www.tei-c.org/index.xml [accessed: 2013-10-15].

[15] The subtracted entries were either never produced as charters and appear in the cartulary for the first time (such as chronicle entries and summations of parishes), or are traditionally not known to have contained a witness list (such as lists of those paying rents etc.).

[16] There is an uncertainty in these numbers because they were collected by comparison of names, independent of the time of their appearance, thus it is possible, that persons were counted as identical because they had had the same name. Further it is also possible that persons appeared several times as witnesses but were counted as distinct persons, since the spelling of their name varied greatly (small variations were taken into consideration if possible) and/or they were only called by their first name.

[17] Entry n° 270, a grant of Jordan to Holy Trinity.

[18] To be found in the years 1087, 1135, 1136, 1170 (twice), 1180, 1197 (twice), 1215, 1222 (eleven times), 1223, 1228, 1231, 1241, 1243, 1247, 1250 (twice), 1252, 1270, 1303, 1308, as well as five undated entries.

[19] 1222 is mentioned in 45 entries, 1170 in 41, 1197 in 28. The postquam dating could refer to: 1170, assassination of Thomas Becket; 1222, council at Osney.

[20] The average per year is 1.05974 (all years considered).

[21] In this regard a comparison of grant holders and lease paying people could be very fruitful.

[22] „Sales“ and „quitclaims“ do always contain named witnesses, but since they only appear in small numbers, the conclusion would not be steady. There is also no pattern to be found in the grants not containing witness lists.

[23] Grey cross lines stand for two witnesses. The figure is to scale.

[24] McKitterick, Rosamond: The Carolingians and the written word, Cambridge 1989, pp. 98-103.

[25] Roger, the chaplain of St. Edmund (twice); Robert, the chaplain (five times); John, chaplain of St. Michael.

[26] In the group of 1197 slightly more clericus than laicus can be found, whereas in the group of 1222 less than a handful clericus appear.

[27] Hodgett, Cartulary, p. xv.

[28] Assuming this is correct, that means that not the biggest spikes in the production of documents could point to such activity but rather a steady production.

[29] Similar to the observations of: Geary, Patrick J.: Phantoms of remembrance : memory and oblivion at the end of the first millennium, Princeton 1994, pp. 112-114.


Nachlese eigener Beiträge zur DH2014 Lausanne im Netz

Unter weiterer Vernachlässigung des eigenen Blogs habe ich mich auf anderen Seiten des weltweiten Netzes betätigt und zu Diversem der Digital Humanities 2014 in Lausanne geschrieben:

Etwa auf Ordensgeschichte:

Der Beitrag wurde auch auf dem Blog von Geschichte und Informatik (blog.ahc-ch.org) publiziert.

Beiträge für Infoclio, welche mir freundlicherweise den freien Zugang zum Kongress ermöglichten:

Nicht zu vergessen Twitter-RTs von vielem (mehr oder weniger) Relevantem während der DH2014 (siehe Twitterleiste rechts und Link zur Suche unten).


//


Digital Humanities Defined — zu eng?

In seinem Beitrag vom Sonntag bringt Michael Piotrowski (twitter: @true_mxp) ein wichtiges Thema auf den Punkt: Was zum Teufel sind eigentlich DH (Link zu seinem Blogbeitrag).

Vorausschicken muss man, dass Michael in vielen Punkten recht hat: Die DH keine eigene Disziplin ist (sondern innerhalb der „alten Disziplinen“ agieren muss), ein (weites, undefiniertes) Methodenset anbietet, etc. (oder kurz: Beitrag lesen).

Seine Definition der Digital Humanities ist knapp und schlüssig, wobei er zwischen „enger“ (1) und „erweiterter“ (2) Definition unterscheidet.
(1) Die Anwendung von quantitativen, computer-basierten Methoden für geisteswissenschaftliche Forschung (sprich zur Beantwortung geisteswissenschaftlicher Fragestellungen)
(2) Die Anwendung von computer-basierten Tools für geisteswissenschaftlicher Forschung.

Zur Veranschaulichung zählt er etwa digitalen Editionen zu zweiter Gruppe, jedoch nicht zu DH im ersten, engeren Sinn.

Grundsätzlich bin ich sehr dafür, ab und zu kurze Diskussionen über die Definition von Digital Humanities zu führen. Ich bin wie Michael der Meinung, dass in vielen Definitionen zuviel dazu gerechnet wird und der Begriff bereits zu oft als Buzzword (sei es bei Anträgen für Forschungsprojekte oder zur Selbstdarstellung) missbraucht wird. Wie er ausführt hat das Führen eines Blogs wenig mit DH zu tun, ebenso wenig wie Konversationen auf twitter oder Ausschreibungen auf H-Soz-Kult.

Mein Problem an seinen Definitionen ist sein strenge: (1) ist mir zu eng, (2) dagegen zu weit. (hört sich wie eine herrlich endlose scholastische Debatte an…)
Die zweite Definition würde — meiner Meinung nach — eigentlich alle WissenschaftlerInnen zu DHler machen, die nur ab und zu einen Computer anmachen und eine Abfrage auf dem Suchanbieter/Bibliothekskatalog ihrer Wahl durchführen. Obwohl ich noch nicht ganz von der Zweiteilung in weite und enge Definition überzeugt bin, müsste daher hier eine Verfeinerung stattfinden.

Bei (1) der Definition bin ich dagegen völlig d’accord, dass es sich bei allen damit beschriebenen Forschungen um DH handelt. Gleichzeitig gehört aber dennoch mehr in diese Kategorie: Wenn der Begriff Quantifizierung vorgebracht wird, impliziert dies eine statistische Veri-/Falsifizierung. In vielen vorstellbaren Anwendungsfällen der Geisteswissenschaften wird die Datenlage jedoch zu klein sein um jemals zu einem statistisch belastbaren Ergebnis zu kommen (und eine Ausweitung der Stichprobe ist für mich ein äusserst zweischneidiges Schwert, big data ist nicht die Antwort auf alles). Würden solche Forschungen folglich dennoch als „quantifiziert“ gelten?

Ein weiteres Themenfeld das völlig weggelassen wird, sind Visualisierungen. Der versierte Argumentator wird vorbringen, dass es sich dabei nicht um Forschung/“Denken“ im engeren Sinn handelt, zugegeben. Dennoch kann etwa die Darstellung auf einer Karte zu Einsichten führen, die nicht einmal Ortskundigen auffallen würde (ganz zu Schweigen vom Potential der Kombination mit geologischen oder anderen Filtern).

Dies führt mich zum eigentlichen Kern: Meiner Meinung nach geht es in den Geisteswissenschaften um das Verstehen von Menschen, ihren Vorstellungswelten (Wirklichkeiten?), ihren Handlungen und ihren Beeinflussungen. Um dem aber näher zu kommen, bedarf es oft dem Überwinden eigener Vorstellung und Logik („Selbst-Befremden“) und genau dafür kann (und soll) die Maschine auf dem Schreibtisch auch genutzt werden. Mit neu dargestellten, umgeordneten Daten (Quellen) ist es möglich zu anderen (vielleicht besseren) Anregungen gegenüber von Problemgemengen kommen. Und genau solche Impulse gehören zu den Digital Humanities (meiner Meinung nach sogar dann wenn die Auswertung theoretisch ohne Rechenleistung erbracht werden könnte, in dem etwa selbst eine Karte gezeichnet würde). Ansonsten sind wir relativ schnell wieder bei Diskussionen um Quantifizierungen und „Zählbarmachungen“, die bereits in den 70er Jahren die Geschichtswissenschaft zu revolutionieren versuchten und in Retrospektive zahlreiche Grabenkämpfe auslösten.

Die von Michael vorgebrachte Kategorie für Natural Language Processing und ähnliches als Humanities Computing (ausserhalb der DH), könnte als Anstoss gesehen werden über die Einführung von weiteren Kategorien nachzudenken, welche die DH als Umbrella-Term nutzen. Statt einer Zweiteilung hätten wir dann eine Vielzahl von Unterdefinitionen…

Und was bedeutet es nun Digital Humanities zu betreiben?

Der Einsatz von digital(isiert)en Ressourcen und Algorithmen (Programmen?) um geisteswissenschaftliche Fragestellungen zu beantworten. (na ja, auch nicht ganz zufrieden stellend…) [eine Fortsetzung der Diskussion ist wohl unumgänglich :]


6. Schoenberg Manuscript Symposium – take aways

Das in Philadelphia abgehaltene Symposium bringt regelmässig Experten der „alten“ Manuskriptwissenschaften und der „neuen“ digitalen Welt zusammen und natürlich insbesondere auch solche, welche sich in beiden Welten bewegen (die Trennung stellte sich als äusserst dünn heraus). Dass die Zusammenarbeit alle beteiligten Wissenschaftszweige belebt, zeigten die vergangenen Tag und eine Reihe von Verknüpfungen die aus früheren Symposien hervorgingen.

Hier einige Ansätze und Tools die (höchst subjektiv) als aussergewöhnlich oder interessant gewertet wurden:

Zum Anfang ein neuer Ansatz zur Identifikation von Schreiberhänden: Im Gegensatz zum Versuch einzelne Eigenheiten von Schreibern zu identifizieren, wie in DigiPal praktiziert, versucht Elaine Treharne (Stanford), dass die Messung von Zwischenräumen zwischen Buchstaben und Zeilen möglicherweise ein Mittel zur Identifikation sein könnte.

Ganz in eine andere Richtung deutete die Untersuchungen von Kathryn Rudy: Ihre „dirty books“ (vorwiegend Stundenbücher aus dem niederländischen Raum) sind teilweise abgegriffen, dies eröffnet die Möglichkeit den tatsächlichen Gebrauch der Bücher abzuschätzen. Mittels densitometrischer Analyse (eigentlich die Messung der Verfärbung) ist es möglich zu zeigen, dass vorwiegend die Seiten aufgeschlagen wurden, welche die höchsten Ablassleistungen versprachen. Genauer in ihrem Aufsatz von 2010 (hier). In einer Abwandlung beschäftigt sie sich nun mit „osculatory targets“: Illuminationen die geküsst wurden und dadurch teilweise gravierende „Schäden“ erlitten. Wobei sie die Theorie vertritt, dass der Akt aus der Öffentlichkeit in den privaten Umgang mit Schriftstücken eingang fand.

Ganz andere Probleme können mittels T-Pen umgangen (und neu geschaffen werden): Das Tool, dass bereits seit längerer Zeit verfügbar ist, hilft bei der Erstellung von Transkriptionen und Editionen. Mittlerweilen ist das Produkt soweit, dass TEI unterstützt wird oder aber auch eigene Definitionspakete geladen werden können. Dank der Übernahme von frei verfügbaren Handschriften (etwa aus e-codices) ist auch für den nur testenden User genügend Material vorhanden. T-Pen ist als freie Software konzipiert und kann dadurch auch auf eigenen Servern installiert und betrieben werden, was das Produkt für Editionsunternehmungen interessant machen dürfte.

Viel geschlossener und auf Lehre ausgerichtet ist Homer Multitext. Obwohl die bearbeiteten Manuskript der Illias offen zugänglich sind, wird der wirkliche Leistungsumfang nicht auf der Internetseite sichtbar. Jeder beliebige Teil (eines Bildes, eines Textausschnitts etc.) ist zitier- und verknüpfbar. Wörterbücher und Ontologien schreiben dabei vor, welche Werte überhaupt eingegeben werden können, wobei die Eingabe von Schreibvarianten auch berücksichtigt wurde.

Schliesslich sei auf das Projekt von Martin Foys (UCL) hingewiesen, welcher mit dem Virtual Mappa Project unterschiedliche Karten annotiert und Verknüpfungen herstellt zwischen den Karten und zeitgenössischen Texten, die sich mit Geographie beschäftigen (hier beschrieben). Die Software, welche die Verknüpfung möglich macht ist ausgesprochen intuitiv zu bedienen und geeignet um eigene Korpora zu schaffen: DM Tools for Digital Annotation and Linking.

Das Programm des gesamten Symposiums findet sich hier.