Old English Newsletter

The MANCASS C11 Database: A Tool for Studying Script and Spelling in the Eleventh Century

Kathryn Powell, Manchester Centre for Anglo-Saxon Studies

In 2004, the Manchester Centre for Anglo-Saxon Studies published the MANCASS C11 Database as a freely-available web resource. The database is the outcome of a three-year AHRB-funded project, "An Inventory of Script Categories and Spellings in Eleventh-Century English," directed by Donald Scragg and Alex R. Rumble. The aim of the database is to provide users with a tool for studying the range of English spellings in use in the eleventh century and relating those spellings to the wide variety of scribes who used them. The project is online at http://www.art.man.ac.uk/english/mancass/data/index.htm.

Rationale

For many decades now, scholars have sometimes asserted, sometimes taken for granted the existence of a standard written form of English in the eleventh century and have formulated arguments—particularly about the origins and transmission of texts—based on this assumption. Yet there has been no systematic study to establish the existence and the form of such a standard, and Old English grammars still present the language of King Alfred as the standard dialect. It is easy to understand why this is so. While the language of King Alfred survives and can be studied in a manageable number of manuscripts, the language of the eleventh century is represented in the majority of surviving manuscripts written in Old English. Because editions typically cannot be relied upon to report variant spellings from all available manuscripts, any such study of the written form of Old English in the eleventh century would require one to consult at least a representative sample of these manuscripts. By gathering material from such a representative sample together in one place, organizing that material and making it searchable, the MANCASS C11 Database aims to make such a study feasible. Using the database, scholars should be able to investigate what spellings are used most frequently in eleventh-century manuscripts (perhaps frequently enough to constitute a standard), how much variation there is in eleventh-century spelling, and whether certain spelling variants can be associated with a particular scribe, variety of script or historical period. The database is not intended to provide the last word on standard written Old English in the eleventh century. Rather, by bringing together data on the spellings used in a large number of eleventh-century manuscripts with collected and updated information on the scripts found in them, the database should enable scholars to conduct their own research on the written language of the eleventh century and draw their own informed conclusions.

Methodology

The database draws its spellings from text files, marked up in XML, each of which represents the state of a particular text in a particular manuscript written in the eleventh century. So, for example, while the Vercelli Book itself pre-dates the eleventh century, versions of its homilies survive in eleventh-century manuscripts. Our database therefore would not simply include a text file of Vercelli Homily 1, but texts of the versions of this work that appear in eleventh-century manuscripts (such as Oxford, Bodleian Library MS Bodley 340; Cambridge, Corpus Christi College MSS 162 and 198; etc). Within the database, we have referred to the contents of each of these text files as an "item." Not everything from every item gets included in results returned by the database. It can only consider spellings where a whole word is clearly legible and where one can be certain that the original eleventh-century scribe—and not a later corrector—is responsible for that spelling. So, for example, erasures, corrections made in the margins or above the line of writing, and bits of badly charred text in the Cotton manuscripts are included in our text files, but marked up in XML as words to be ignored by the database. The database also ignores very commonly abbreviated words such as ond and þæt because there is no good basis for comparing spellings across all the manuscripts included. Also, we have excluded proper nouns from our results because this work would overlap with or repeat work done by the Prosopography of Anglo-Saxon England project and the English Place-Name Society. However, most other words found in the text files uploaded into the database can be searched for and compared with alternate spellings found in other text files.

In addition to these text files, the database also holds information on eleventh-century manuscripts containing Old English: the dates and (where known) provenances of manuscripts, information about scribes, characteristics of their hands, (in the case of a known scribe) other manuscripts worked on by a particular scribe, and the folio boundaries of individual scribal performances. Such information is drawn from a variety of sources including Ker's Catalogue of Manuscripts Containing Anglo-Saxon, Sawyer's Anglo-Saxon Charters, etc. Additionally, the database contains thumbnail images of various types of letterforms—a horned a, flat-topped a, etc.—drawn from facsimiles of eleventh-century manuscripts. These images can be assembled by the database to form an identikit of any scribe's hand which the user can view and compare with that of other scribes. When a text file is uploaded into the database, it is linked to what we have termed a "sequence"—a continuous piece of writing by a single scribe in a single manuscript. Through this linkage, any word in the text file can then be associated with any information contained in the database about the scribe who wrote it and the manuscript in which it appears.

Once items are loaded into the database and linked to manuscript sequences, they are then automatically broken up into individual words, and any words not already in the database have to be "lemmatized". This is a two-step process, whereby the part of a word that the computer should consider significant when it is comparing spellings has to be identified, and the word then has to be assigned to a lemma group of etymologically related words with which its spelling should be compared. It would be ideal from every perspective if this process could be automated, but it turns out that computers are not particularly trainable when it comes to recognizing Old English words. So the lemmatizing process is a somewhat time consuming one that has to be carried out by the academic team, primarily by Donald Scragg and myself. It is not an easy process and it has required us to make some fairly arbitrary decisions. For example, in the case of compounds, since a single word cannot be counted twice and assigned to two different lemma groups in the database, we can only compare spelling in one element of a compound and we have normally chosen to compare the first element (on the basis that this is usually the element that receives the primary stress).

One final caveat: because the processes of data entry and lemmatizing are being carried out by fallible human beings (sometimes after 3pm on a Friday), they are not immune to error. Members of the project team have tried to check each other's work and have corrected some mistakes, but there must be more that remain to be corrected by users of the database. Any errors users spot and let us know about will help improve the overall quality and usefulness of this resource. Please report any errors you spot via e-mail to mancass@manchester.ac.uk.

Using the Database

So, what can you do with the MANCASS C11 database? At the simplest level, you can easily look up every eleventh-century spelling variant of a particular word, even if you only know one spelling, by following a link from the C11 homepage labeled "Search items for words or stems." For example, the word cyning [king] occurs frequently in Old English and can be spelled in a variety of different ways. If you wanted to see all of its variant spellings, you could click on "Search items for words or stems," which would take you to a search form. You can also ask the database to search for whole words or "stems"—words minus any prefixes, suffixes or inflectional endings. If you simply want to search for every spelling of a particular word, it is best to check "words." You can then type the word cyning into the text field and hit the "show words" button, which takes you to a page of results listing every variant spelling of cyning attested in eleventh-century manuscripts and documents.

On this page, the left-hand column lists every whole word found in the lemma group containing cyning. The second column lists the stem of each of these words. Since words in the list are sorted by stem, it is relatively easy to glance down this list and determine how many different spellings, minus inflectional endings, prefixes, etc., were found by the database. The third column contains statistics on how many times each word occurs and in how many items. The final column contains buttons labeled "show details," which we will return to later. A quick glance at the results page from our search for cyning tells us that, at present, forms of the word occur 812 times in the items that have been entered into the database, and that these occurrences include 45 different forms of the word with 21 different stems. Cyning is by far the most common stem, accounting for 644 of the 812 total instances, which suggests that this may have been a standard spelling of which all the other forms may be understood as variants.

If you would like to know more about where and in what context particular forms of a word occur, clicking on the "show details" button next to a form will take you to a page which lists all the items in which that form occurs and shows you the sentence or sense unit containing that form. For example, if you ask the database to show details of the form cingc, you will find that this form occurs in copies of a king list and the will of King Alfred in London, BL Stowe 944, as well as in two different writs of King Edward, one found in London, Westminster Abbey, W.A.M. XVIII and one in Paris, Archives nationale K.19, no. 6a. Each of the sense units in which the form occurs is listed under the item heading and is preceded by a sense unit number (taken from the Toronto DOE Corpus wherever possible), which should give you some idea of where in the text the word occurs. This sense unit number is hyperlinked; by following this link, you can see if other copies of the same text survive and have been entered into the database and, if the database does contain such copies, compare spellings in parallel sense units. For example, if you ask the database to show details of the form cynninges, you see that the form occurs only once, in a copy of Ælfric's Catholic Homilies I.5 in Cambridge, Corpus Christi College 188, sense unit 39. If you click on the sense unit number, however, you will be taken to a page that displays parallel sense units from every copy of the homily included in the database (eight copies at present). Here, you can see that only CCCC 188 deviates from the dominant spelling cyninges in this sense unit.

The "show details" facility is particularly necessary in the case of homographs. The database has no way of distinguishing between two words that are spelled the same but have different meanings and are etymologically unrelated, such as witan, meaning either "to know" or "to depart." In such cases, unrelated words will appear as part of the same lemma group. By clicking on "show details," however, you can pull up the context of every occurrence of an ambiguous spelling and decide which occurrences are really relevant to your search. In this way and more generally, the ability to access the context in which a word occurs provides a means of double-checking the results returned by the database.

Our hope is that, as more items are uploaded into the database, you will be able to use it to help verify or qualify assertions that others have made about eleventh-century English. Just to provide an example, Kenneth Sisam bases his well-known argument about the compilation of the Beowulf manuscript in part on the distribution of io spellings in the manuscript. Specifically, he writes,

Since Ten Brink presented the evidence from Beowulf and Judith, it has generally been accepted that Beowulf is copied from a manuscript in which io often occurred for Late West Saxon eo of all origins. Beowulf, first hand, has 11 examples of io in 87 manuscript pages; Beowulf, second hand, 115 in 53 pages; Judith, in the second hand, no example in 15 pages. Ten Brink inferred that the second scribe preserved io in his copy, while the first usually eliminated it. He did not know that the three prose pieces are in the first hand of Beowulf, and that the Wonders contains a couple of examples of io, Alexander's Letter has many, Christopher none. This suggests that both Christopher and Judith were added to a collection characterized by io spellings (Kenneth Sisam, 'The Compilation of the Beowulf Manuscript', in Studies in the History of Old English Literature (Oxford, 1953), 65-96, at 67-8).

He goes on to suggest, on this basis and others, that the compilation came together in three stages—that Beowulf and Alexander's Letter have travelled together the longest, that the Wonders joined them at a later stage, and that Christopher and Judith were the most recent additions to the collection represented by this manuscript.

Unfortunately, the database does not currently contain any of Beowulf, but all of the other texts in the manuscript have been uploaded. One can easily use the database to search for io spellings in the stems of words. From the C11 homepage, follow the link, "Search items for words or stems." Check "stems" rather than words so that the database does not return words where io occurs as part of the ending of a verb, for example. Then enter the search "%io%" and hit "show words." The percent signs are wildcards which stand for any number of letters or none. The database also allows you to use an underscore ("_") as a wildcard standing for any single letter. After entering the search "%io%", you should see a page of results showing all the lemma groups containing at least one io spelling, including all of the eo, i and y spellings with which io varies. If we look, for example, at the word digel, we can see that digel and digl are the predominant spellings of this word, and that the database contains only one instance of an io spelling, dioglum. By clicking the "show details" button next to dioglum, we can see that this single example is indeed drawn from the Beowulf manuscript. If you continue working through the results of this search, you find (perhaps unsurprisingly) that Sisam was quite right: the Wonders of the East contains exactly two io spellings (hio in sense unit 48 and sio in sense unit 7) and Alexander's Letter contains 48, while Christopher and Judith contain none. Further, the io forms are hardly predominant even in the Letter, suggesting that the first scribe of the Beowulf manuscript has tended to correct them; for example, while the spelling dior occurs once in the Letter, forms of the more common deor occur thirteen times, and while forms of triow occur eight times, forms of treo(w) occur seventeen times.

So you can use the database to investigate the spellings of a single manuscript. You might well be able to do that sort of work on your own, although it might take you longer than it takes with the help of the database. But the database can also help you correlate your findings in one manuscript across a range of eleventh-century manuscripts. To turn to Sisam again for another example, he notes in the same article that the use of nænig in the Wonders is an unusual feature—specifically, that this form may be more Anglian than West Saxon and that it is abnormal in Ælfric (Sisam, Studies, p. 73). Our database was first assembled around a core of material from Ælfric's Catholic Homilies and so is reasonably well-equipped to demonstrate what is normal and abnormal in Ælfric. If we search for nænig, we find that the word is indeed unusual in the eleventh century and, if we look at the contexts in which it occurs, we find that it never occurs in Ælfric's work. But we also find some other interesting things. First, nænig is not confined to the Wonders in the Beowulf manuscript; it also occurs in Alexander's Letter and, interestingly, in Judith. Thus, it is used by both scribes of Beowulf. Second, the other items in which it occurs do not clearly support the suggestion that the form is more Anglian than West Saxon. Although the majority of occurrences are found in the Beowulf manuscript, which is largely unhelpful, the word also occurs in manuscripts associated with New Minster, Winchester and with Worcester. In other words, nænig is rare in the eleventh century, but not localizable. One would not want to make too much of such findings in isolation, but they are suggestive. And that is what we hope the database will be in general: a tool for discovery, suggesting further areas of research that might not occur to an academic working alone, examining one manuscript at a time in isolation.

In this regard, the database also offers a paleographic catalogue that includes images of letterforms found in eleventh-century manuscripts. You can use this catalogue to discover similar hands in different manuscripts without having a large number of manuscripts or facsimiles in front of you. From the C11 homepage, if you follow the link labeled "Search Paleographic Catalogue," you should see a collection of images of letterforms. Each of these images pictorially represents a particular type of that letterform. So, for example, there are images representing different types of a, c, d, ð, etc., as well as various types of ascenders and ligatures. Each of these images is accompanied by a number that tells you how many sequences in the database contain that letterform. We are still in the process of assigning paleographic images to all of the sequences in the database, so these numbers are not definitive. Nonetheless, they should give you some idea of which letterforms are used most frequently in the eleventh century and which are unusual. Clicking on one of these paleographic images takes you to a list of sequences which contain that letterform. From this page, you can narrow down your search for similar hands by adding more letterforms to your search (e. g., searching for sequences which contain particular forms of a and g in combination). In addition to any research use that this paleographic catalogue might have, it may also prove useful in teaching, since the images allow individuals without any paleographic training to begin comparing letterforms and hands.

The database offers a number of other features as well, including the ability to search for spelling variants involving doubled letters (e.g., all words that can be spelled with both t and tt medially or finally), the ability to search for single-letter substitutions in the stems of words (e.g., all words where y is sometimes substituted for i), and the ability to look up specific manuscripts and find information about their dates, hands, contents, and what items from them have been uploaded into the database. Furthermore, development of the database will continue through the project's end date in April 2005. In addition to uploading more material into the database during this period, we are also in the process of developing new features, including a search for invariant spellings and a search for word frequency by paleographic era, which will offer a statistical summary of the use of a particular spelling during the rough twenty-year periods covered by the database (980-1100). It is hoped that all of these tools will help scholars speculate on and arrive at new conclusions not only regarding the development of the language, but the transmission history of particular texts and manuscripts, thus extending the range of work that can be done by both philologists and literary scholars.