Old English Newsletter

Lexomics for Anglo-Saxon Literature

Michael D.C. Drout, Michael Kahn, Mark D. LeBlanc, Amos Jones, Neil Kathok and Christina Nelson, Wheaton College, Norton, Mass.

The DNA of every organism is composed of millions of base pairs, chemical units that are abbreviated as A, C, T or G. The pattern of these base pairs provides the "recipe" for the development of the organism. The discipline of bioinformatics uses computational approaches to analyze the information encoded in these very long strings of A, C, T and G. Through this kind of analysis, researchers are able to find evidence of organismal history and of the relationships between different organisms. For example, researchers can demonstrate descent through shared transcription errors (even though we have not yet fully deciphered genomic codes).

"Lexomics" is a term originally coined to describe the computer-assisted detection and explanation of patterns in genomes, patterns that without accurate, swift computational and mathematical tools would be undetectable (Dyer, LeBlanc, et al., 2004). Under the auspices of an NEH Digital Humanities Initiative start-up grant, [1] we have developed software tools and analytical methods that allow us to apply lexomic methods to the corpus of Old English, opening an entirely new channel of information about relationships among Anglo-Saxon texts. Our software is available for free download at http://lexomics.wheatoncollege.edu.

Our research began when it became clear to us that investigators in bioinformatics were doing work very similar to that done by textual scholars (cf. Dyer, Kahn, LeBlanc, 2007). Through a teaching collaboration (a "Wheaton connection" called Computing with Texts, which linked Drout's Anglo-Saxon Literature with LeBlanc's Computing for Poets course) we discovered that the same sorts of phenomena that interest literary researchers—shared error, palindromes, stem diagrams—were of great interest to bioinformatics researchers. The "lexomics" approach developed by Wheaton's bioinformatics research group turned out to be directly applicable to Old English texts, and a few promising student projects demonstrated that we could profitably modify the techniques of bioinformatics to literary texts.

The lexomic approach, when applied to texts, looks at patterns in individual word use, with a word being defined as a group of letters bounded by white space. We recognize that this working definition of a word can be problematic in linguistic and philological terms, but our approach has the enormous benefit of making a "word" something simple enough to implement in a computer's language. As far as the lexomics approach is concerned, the word "cyning" is a separate word from "kyning" or "cyninge" or "cyningas." Obviously then, orthographic variants, dialect variations, inflectional endings and strong verbs are all 'read' as different words. This approach goes against the grain of both Anglo-Saxon studies and much work in contemporary linguistics, but it has the enormous advantage of completely avoiding time-consuming and potentially tendentious lemmatizing or other kinds of mark-up, thus allowing the computer to do what it does best—count things—and saving the analysis for the trained minds of the researchers. Furthermore, dialect variants, strong verbs, morphological differences, and inflections are all data (although how relevant they are as data is an area of great controversy in Anglo-Saxon studies), and by not merging any of these categories together, we have allowed subsequent researchers to make their own decisions about the relevance of the difference between "cyning" and "kyning."

The lexomics approach is useful for researchers in Anglo-Saxon in several ways. First, the web-based tools we describe below can allow researchers easily to find rare and interesting words that are shared by only a few texts, potentially illuminating relationships that have not yet been noted. Second, because the tools produce rich data, researchers can apply sophisticated statistical techniques, such as cluster analysis and classification methods, to extract patterns that are otherwise too subtle to be noticed by the unaided mind. Finally, because the programs are open-source and freely downloadable, researchers can use them on other texts that are not part of the Old English corpus; in fact, any electronic text can be examined using our tools with only minimal effort required to modify the programs to fit the corpus being used. Our own research has been enormously enriched by these tools, which we have found can be used to distinguish between individual sections of texts, to shed light on textual relationships (and, perhaps, authorship), and to guide us towards words and phrases that will reward further study. By offering these tools to the community of Anglo-Saxonists, we hope to encourage additional researchers to use them and to continue the project of reconstructing the tattered web of relationships among Old English texts.

Lexomics and the Dictionary of Old English Corpus

Although our lexomics software can be applied to any electronic textual corpus, currently the website is linked to the Toronto Dictionary of Old English (DOE) corpus. [2] Access to the DOE corpus requires either a subscription (institutional or individual) or a digital copy of the corpus (which can be purchased from the DOE), but researchers who do not have access to the Corpus can still use our software to generate reports and perform analysis on the texts in the corpus. The DOE corpus, because it is based upon the standard editions of the various texts, selected by the editors, is not a precise replica of any existing manuscripts (which include errors, variants, and illegible sections) but is instead a compilation of editorial judgments, some of which are disputed. Researchers who purchase the electronic corpus from the DOE can make any modifications they believe necessary and run our programs on those modified files, or they can use their own edited files for analysis. However, the reports generated at the lexomics.wheatoncollege.edu website are based upon the unmodified DOE corpus.

Texts in the DOE corpus are classified into several categories, indicated by the letter at the beginning of the file name: Poetry (file names beginning in A), Prose (B), Glosses (C), Glossaries (D), Runic Inscriptions (E), Non-runic Inscriptions (F), and charters, laws and liturgical materials. Each text in the DOE corpus is given a Cameron Number, which indicates the manuscript and the text within that manuscript. For example, the poem The Fortunes of Men is the twelfth text in MS Exeter, Cathedral Library, 3501 ("The Exeter Book") and is therefore given the Cameron number A3.012 (A3 for the Exeter Book, 012 because "Fortunes" is the twelfth poem). Thus, analysis focused strictly on poetry would use only "A" texts, that focused on prose only "B" texts, and that which looks at both poetry and prose, "A" and "B" texts. At this time, the tools on our website can be used to analyze both poetry and prose; other types of texts will be added later to the web-based application (and researchers can add these texts on their own using the downloadable programs).

Texts in the Old English corpus also contain words in other languages (primarily Latin) and corrections on the manuscripts by contemporary or later hands (i.e., not by modern editors). These are tagged and in the corpus. Our software allows such words either to be included in, or eliminated from statistical analyses with the click of a radio button.

Two special letters and one standard abbreviation create potential problems for researchers using the corpus. Because the þ (thorn) and ð (eth) are generally interchangeable for the voiced and unvoiced interdental fricative, for some purposes it may be useful not to distinguish between the two. For example, a listing of the most common words in any given text is likely to have both þæt and ðæt in the top 20, but the combined totals for þæt and ðæt would rank higher still. On the other hand, some researchers may want to distinguish between orthographic variants of different words and would want to count þæt separately from ðæt. To solve this problem, we have enabled, but not forced, the consolidation of thorn and eth into one letter, thorn, so that the orthographic variants are not counted separately. Researchers can switch between consolidation and separation by clicking one button.

The words and and ond and the abbreviation 7 (Tironian nota) present similar problems. For some purposes they are meaningless orthographic variants, for others, indicators of dialect or date, and Tironian note could be expanded as either and or ond, depending upon the dialect of the scribe. Because scribes were not completely consistent, and because dialect is often disputed, choosing in which form to expand the abbreviation is problematic. Our software deals with this problem by allowing and, ond and 7 to be consolidated into one word if a researcher desires. Figure 1 shows a user-selected set of options for a particular search.

Figure 1. Screen layout showing user-selected options prior to searching, including keeping or discarding tagged words and/or word consolidations. In this case, the user has chosen to discard any tagged words, to convert each eth to thorn, and to treat and, ond, and Tironian nota as identical.

Tools for Analysis

To do lexomics-based analyses, researchers need to be able to sort and tabulate the counts of words in a given text or group of texts. Our web-based tools enable this sorting and tabulation, either for the corpus as a whole or for selected texts within the corpus. Researchers can view the mean, minimum, median, maximum and standard deviation of the counts of each word across the entire corpus, the poetic corpus, the prose corpus, or any specific text or group of texts. These data are stored as Excel spreadsheets, which can then be downloaded, modified and further analyzed.

After researchers have chosen a text or group of texts to analyze and have decided which consolidations to use, they can select the number of words to examine, requesting the top N words in the texts (N can be any number from 1 to the total number of words in the texts). For example, if we select Poetry, and then under Poetry, the Exeter Book (A3), and then the poem "The Fortunes of Men" (A03.012), the program will calculate the following:

Total Words: 536
Unique Words: 364
Words that only show up once: 316

Total Words is simply the number of individual words in the manuscript. Unique Words is the total number of different words (i.e., if "and" appears three times, it is counted three times in the Total Words category but only once in Unique Words. Words that only show up once is a self-explanatory category. The program then follows with the complete list of all words. This list can be sorted either by frequency or by alphabetical order. Figure 2 shows the top ten most frequent words in "The Fortunes of Men" when both consolidations are in effect and tagged words are discarded.

Figure 2. The top-10 most frequent words in "The Fortunes of Men" when both consolidations are in effect and tagged words are discarded. The category "in text" merely displays the word according to the DOE mark-up conventions.

The table presents the word both in readable form (Word) and in the marked-up form as it is found in the electronic corpus (In Text). Each word appears as a link to the DOE website, so clicking on an individual word will automatically run a concordance search on that word (if the individual or institution has a site-license for the DOE). The Count column shows how many times the individual word appears, and the Proportion column shows the frequency of the word in relation to the other words in the text. So, for example, "him" appears 7 times in the 536-word poem, slightly more than 1% of the entire text. These results are presented as a table on the web, and they can, with one click, be downloaded as an Excel spreadsheet which will then allow the researcher to perform additional statistical analysis.

Virtual Manuscript

One of the most exciting aspects of our suite of software and tools is the ability to create what we are calling a "Virtual Manuscript," a user-selected set of electronic files that "binds" any combination of Anglo-Saxon texts, poetry and prose into a single file. To date, it has not been possible to examine unique combinations of texts together with each other and separately from the larger corpus. With this tool, researchers can now create groupings of poetry, prose, or mixed poetry and prose, allowing them to represent an actual manuscript or to create one of their own. Researchers can create virtual manuscripts that contain all putatively "old" texts, or "new" texts, or "Winchester vocabulary" texts or any other grouping and then generate statistics on these collections of texts.

For example, the manuscript London, British Museum, Cotton Tiberius B.i. is made up of four texts: King Alfred's translation of Orosius, the Menologium, Maxims II, and the C-text of the Anglo-Saxon Chronicle. Using our "virtual manuscript" tool, a researcher can perform statistical analyses on the entire manuscript even though the individual texts are separate in the DOE corpus. After the researcher has clicked the appropriate buttons for the Cameron numbers of the different texts in the manuscript (see Figure 3), the program is able to analyze Tiberius B.i.

Figure 3. A user selecting texts to build the manuscript London, British Museum, Cotton Tiberius B.i.

A partial listing of results is shown in Figure 4.

A14_Men_T01540 + B09.002.002_Or_1_T06590 + B09.002.003_Or_2_T06600 + B09.002.004_Or_3_T06610 + B09.002.005_Or_4_T06620 + B09.002.006_Or_5_T06630 + B09.002.007_Or_6_T06640 + B09.002.001_OrHead_T06580 + A15_Max_II_T01550 + B17.007_ChronC__Rositzke__T22040

Total Words: 75249
Unique Words: 10088
Words that only show up once: 5926

Figure 4. A partial listing of word count results when using the Virtual Manuscript tool to construct the manuscript London, British Museum, Cotton Tiberius B.i.

The full table is over 10,000 words long. Above, we show a sample of some of the most frequent and least frequent words, as well as a selection of words in the middle range of frequencies. Note that words that have the same frequency count are alphabetized within that count, explaining the great number of "w" words in this particular excerpt. In the full table, these are preceded by groups of other words with a frequency count of 2. Researchers can use the full table to attempt to identify rare or anomalous words or to identify unusual patterns of word use, for example, by comparing the frequency profile of Tiberius B.i. with other manuscripts. Using the data generated on various "virtual manuscripts," it may be possible to ferret out the tendencies of scribes, authors, schools or scriptoria by noting differences between different manuscripts, and further research may be able to give us new diagnostic tests for provenances or dates of manuscripts and archetypes. An important point behind our design of the Virtual Manuscript tool is not to presume what a user will want and limit possibilities but, instead, to facilitate the investigation of user-centered questions that we have not anticipated.

Uses of Results

The word counts, lists and frequencies are interesting in themselves, potentially shedding new light on tendencies within and among Anglo-Saxon texts, but they can also be employed as shortcuts to or augmentations of traditional analyses. For example, different types of texts may be characterized by the frequency with which certain common words appear (some of our preliminary research suggests that the frequency of the word "and" in a text can be a significant clue as to whether or not that text is poetry or prose). Differences between texts in the appearance of high-frequency words may also potentially be used as a guide to attribution or affiliation.

Rare words are even easier to work with. For example, it is a very simple matter to list the entire corpus in order of word frequency, scroll down to the very long section of words that appear only two times in the corpus, and then start clicking through to the DOE to see in which texts these words appear. Searching these pairs of words for interesting coincidences can suggest further research on the inter-relations of texts or the use of rare words, and such work may allow some enterprising researcher to discover phenomena similar to the "Winchester vocabulary." In fact, Walter Hofstetter's painstaking work in delineating the Winchester vocabulary would have been much less labor-intensive if our tools had been available in the mid-1980s (Hofstetter 1987, 1988). Likewise, it is easy to create a complete list of hapax legomena in any given context (the entire corpus, any group of texts, etc.) and examine these to see how many are merely orthographic variants, how many are likely to be spelling errors, and how many are potentially significant. And although the examination of hapax legomena and dislegomena is both obvious and easy, even more information may be found in the words in the middle of the range, neither rare, nor overly common.

The "Virtual manuscript" tool also provides researchers with the ability to treat actual Anglo-Saxon manuscripts as wholes or to put together and characterize the work of multiple scribes across different manuscripts. For example, a researcher can create two different virtual manuscripts, each representing a different hypothesized center of production, and then compare the two virtual manuscripts not only to each other, but to other texts that might possibly be affiliated with either one. Researchers could then use shared rare words, or percentages of high-frequency words, or some combination of different metrics to support arguments for or against different affiliations.

Because the data compiled by the tools can be downloaded as Excel spreadsheets, more sophisticated mathematical and statistical analyses can also be employed to examine relationships. We use the statistical package R (2008) to examine the relative frequency data of each word's count relative to the total number of words in a text (or subset of a text). Our work using R produces dendrograms, tree-like diagrams of the relationships among texts based upon many small measurements of similarity and difference (i.e., shared and unique words). Our methods have thus far been able to accurately identify, without mark-up or lemmatization, the part of Daniel which is similar to Azarias, and to distinguish Genesis A from Genesis B. The details of this research are forthcoming in JEGP [3] , but, briefly, we divided Daniel into ten "chunks" and then used cluster analysis to determine which chunks of Daniel are most like Azarias. The chunk of Daniel that was identified as being most similar to Azarias consisted of words 1801 through 2250, almost exactly the lines of Daniel (279-361) that are paralleled in Azarias (see Figure 5). Similarly, we divided Genesis into eleven chunks, which we then tested for similarity against each other and, for the purpose of having outside comparanda, against chunks of Juliana, Beowulf, and Andreas. Our analysis accurately identified Genesis B, putting the chunks that contain these lines separate from the rest of the poem.

Figure 5. Dendogram showing the 1064-word poem Azarius (Az) clustered with ten (10) non-overlapping 450-word "chunks" of the poem Daniel (the last chunk, Dan₁₀ is 421 words). As expected, chunk Dan₅ (words 1801-2250 or lines 279-361) clusters most closely with the Azarius poem.

These preliminary results, experiments in which our methods are able to identify known relationships between texts and divisions within texts, work as controls, evidence that our lexomics approach can identify similarities and differences correctly, without lemmatization or parsing. Future research will examine divisions within poems, as well as relationships between poems. It is our hope that additional researchers will modify and refine our methods, using the tools to test different hypotheses about the Old English corpus.

The Software: Free, Open-source, easily modified

In support of, yet independent of the web-based tools, is a group of well-documented Perl scripts that organizes files by predefined genres (e.g., poetry vs. prose), determines the relative frequency of each word in a file, and prepares statistical summaries of word frequency profiles for each individual text, all texts within each manuscript, and all texts, individually and jointly, within the poetry and prose classes. Researchers are free to modify these programs to perform analyses that are not at the present time supported by the website (LeBlanc and Dyer, 2007). For example, a comparison of Anglo-Saxon with Middle English or Latin texts might prove fruitful, and there is also no reason why the software could not be adopted for use with Modern English texts (or, for that matter, texts in any language). The software is freely available from our website by following the "Software" link http://lexomics.wheatoncollege.edu/software.

Some Conclusions

The lexomics tools we have developed are valuable for a variety of reasons. First, researchers can use the frequency profiles of texts to home in on important, rare or surprisingly frequent words in texts. The tools are not a replacement for philological knowledge or literary intuition: they are a prompt for additional research, allowing a scholar to note patterns that might otherwise be too subtle for the unaided eye and mind. Second, researchers can, through the "Virtual Manuscript" tool, analyze the word frequency patterns of multiple texts in ways that have not previously been possible. Finally, the lexomics approach has been able to correctly identify the similarity between Azarias and the relevant passage in Daniel, and it was able to identify Genesis B within the complete poem of Genesis. We are now ready to examine more disputed relationships in the Anglo-Saxon corpus, including the structure of Christ, the canon of Cynewulfian poems, and the relationships between the poems of the major codices. Moving on to the prose, we hope to be able to use the tools discussed above to further examine the "Winchester vocabulary" texts, seeing if we can find additional diagnostic "Winchester words" and determine additional connections between texts.

Lexomics and other computer-assisted programs will not find definitive answers to age-old questions, but powerfully augments the intuition and training of scholars. The software and the approaches enabled by it will continue to evolve as more users find ways to apply them to different texts, and it is our hope that other researchers will expand the tools far beyond what we have imagined, thus increasing our knowledge of Anglo-Saxon texts and further extending the capabilities of the human mind.

Website: http://lexomics.wheatoncollege.edu

Works Cited

Dyer, B.D., LeBlanc, M.D., Benz, S., Cahalan, P., Donorfio, B., Sagui, P., Villa, A., and Williams, G. (2004). A DNA motif lexicon: cataloguing and annotating sequences. In Silico Biology, v4, 0039. http://www.bioinfo.de/isb/2004/04/0039/

Dyer, B.D., Kahn, M.J., and LeBlanc, M.D. (2007). Classification and regression tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of archaea and bacteria. Archaea 2:159–167.

Hofstetter, Walter. Winchester und der spätaltenglische Sprachgebrauch. Munich, 1987.

Hofstetter, Walter. "Winchester and the Standardization of Old English Vocabulary." Anglo-Saxon England 17 (1988): 139-61.

LeBlanc, M.D. and Dyer, B.D. (2007). Perl for Exploring DNA. Oxford University Press.

R Development Core Team (2008). R: A Language and Environment for Statistical Computing. http://www.R-project.org, ISBN: 3-900051-07-0.

Notes

[1] "Pattern Recognition through Computational Stylistics: Old English and Beyond." NEH HD-50300-08.

[2] http://ets.umdl.umich.edu/o/oec

[3] Drout, M.D.C, Mmichael J Kahn, Mark D. LeBlanc and Christina Nelson. "Of Dendrogrammatology: Lexemic Methods for Analyzing the Relationships Among Old English Poems," JEGP (forthcoming in 2010).