Determining Authorship of 1 Henry VI: Bayesian Analysis of the Works of Shakespeare
John M. Cox
Department of English
University of Michigan
April 2004
(NB: This article is part two in a series which attempts to bring technological methods to bear on problems which preoccupy literary scholars. Part one deals with the difficulties inherent in such a process.)
- Abstract
- Introduction
- Bayes' Theorem
- The Texts
- nbc: A Naïve Bayes Classifier
- Results
- Conclusions
- Citations
- Appendix A: The nbc Code
- Notes
1. Abstract
Bayesian analysis is a simple yet powerful technique for classifying texts. Based on the work of the 18th century logician Thomas Bayes and popularized recently by information technology specialists as a method for classifying spam, Bayes' theorem can also be applied in literary studies as an aid for determining authorship. In this paper, we apply Bayes' theorem to the works of Shakespeare and his contemporaries in an effort to obtain statistical evidence for the authorship of The First Part of Henry the Sixth, which some scholars doubt was written by Shakespeare. By doing so, we get an answer to the question: what is the probability that Shakespeare wrote 1 Henry VI?
2. Introduction
Literary scholars have not traditionally used data mining or techniques of statistical inference. This is partly due to their training, which does not emphasize mathematics or computing, and partly due to a reluctance to trust computers, which are notoriously poor with natural languages, to the task of reading or interpretation.
This reluctance is well founded. Computers are, in fact, terrible with human languages. The strength of the computer, which was created during World War II as a tool for decrypting Axis messages, is its ability to do mathematics rapidly and without error. This causes a large problem for anyone interested in using computers with human languages: how can language, with all its eccentricities and variety, be analyzed and operated upon mathematically?
Neal Stephensons Cryptonomicon, set during World War II, examines this problem in detail. The first computers interacted with language on a letter-by-letter basis, with a number assigned to each character in the Roman alphabet (1 for A, 2 for B, and so on)1. Translating letters into numbers allowed Word War II technologists to bring their entire range of mathematical tools to bear on the texts of communiques, which in turn allowed them to decipher secret German and Japanese messages, and to keep Allied messages secret. This problem is an easy one, however, because it does not deal with language as such: computers during World War II were capable of doing mathematics on individual characters, but they were completely incapable of recognizing a sentence of natural language from a string of random alphabetical symbols. The earliest computers were quite able to recognize and process numbers, but they were hopelessly unable to tell the difference between a noun and a verb.
In some respects things are not so different today. Though researchers have made progress in the field of natural language processing, computers still do not have what most literary scholars would consider any great facility with human writing or speech2. Fortunately for this analysis, however, computers are a highly useful tool for literary analysis even without the ability to "read" in the conventional sense. Bayesian analysis is one example of a technique that yields highly useful results but does not require the computer to be able to parse a sentence, identify parts of speech, or extract what we think of as "meaning" from a given text. It has more in common with World War II techniques—that is, somehow translating natural language into numbers and thereby opening a text up to mathematical approaches—than it does with any type of human reading.
3. Bayes' Theorem
Bayes' theorem is a method for inverting conditional probability. This is useful when we can determine the likelihood of event A given event B, but what we really want is the likelihood of event B given event A. The theorem states that:
Or, in plain English, Bayes' theorem tells us that the probability of B given A is equal to the probability of A given B multiplied by the probability of B, with that quantity divided by the probability of A. This gives us probabilities for one occurrence of A and B.
Often, however, we are interested in making decisions based on recurring events. In that case, we use the more elaborate version of the theorem:
In plain English, this version of Bayes' theorem gives us a probability for a certain B with respect to A given all instances of B. It is an extension of the above version, which is useful for analyzing only a single instance.
To illustrate this, let us look at one conventional application of Bayes' theorem: determining if an email is spam or not. For the purposes of this example, let us assume that we have received many email messages to date (called "all instances of B" in the above paragraph), and we know from past experience that 70% of what receive is spam and 30% of what we receive is not. Therefore, B1 = .7 and B2 = .3. Let A be the event that a given message contains the word "free." From past experience, we know that 90% of our spam messages contain the word "free" while 2% of our non-spam messages also contain the word "free." Therefore, P(A|B1) = .9 and P(A|B2) = .02. Now assume that we receive a message containing the word "free." What is the probability that this specific message (the "certain B" mentioned in the above paragraph) is spam? In other words, what is the value for P(B1|A)?
We can then be 99% sure that an email containing the word "free" is spam.
Using the mathematical power of computers, we can do a similar analysis, this time getting a probability for every word in each incoming email based on word frequencies for all messages we have received to date. Doing this analysis for every word increases the chance that we will catch all incoming spam, while simultaneously decreasing the chance that messages we want to read will be incorrectly identified.
For example, the general population rarely wants to read emails containing the word "penis." An urologist, however, is more likely to receive legitimate mail containing that word. This would pose a problem for the urologist if we were only looking at the word "penis." However, if we are looking at all the words in the email, we would start to notice that messages containing "penis" are more likely to be good if they contain the words "vas" and "deferens" and more likely to be spam of they contain the words "free" and "teen." When we analyze prior mail in this way, word by word, we say we are "training" our Bayesian filter.
Paul Graham, a noted programmer and Bayesian spam filter enthusiast, notes that carefully written and trained filters catch up to 99.87% of incoming spam messages with no false positives. He also notes that his own catch rate is 99.84%, meaning that Bayesian filters have the potential to be better than human at classification of spam3.
4. The Texts
We need three texts for this analysis:
- A body of dramatic texts we know were authored by Shakespeare. This is analogous to the body of texts above that we know are spam.
- A body of dramatic texts that we know were not authored by Shakespeare, but are otherwise similar. This is analogous to the body of texts above that we know are not spam.
- The text of 1 Henry VI. This is analogous to the message above that we wanted to classify as spam or not spam.
Texts were obtained from the Gutenberg project, as well as numerous authoritative online repositories. All texts were reviewed and edited by literary scholars. They are similar in quality to what you would find in the Riverside or other established anthologies. For non-Shakespeare texts, I selected works by Dekker, Middleton, Kyd, Jonson, and Marlowe. Availability of good text-only Early Modern drama is incomplete, so some canonical plays such as The Shoemaker's Holiday have not yet been included. These authors were grouped together in one category rather than receiving categories of their own in order to keep the sample sizes consistent. More Shakespeare survives than Kyd, for example, and in order to train our classifier properly the sample size for non-Shakespeare works needed to be the same as the sample size for Shakespeare.
For each text, I removed all apparatus (footnotes, hyperlinks, introductions, lists of dramatis personæ, and so on), leaving only the text of the play proper. Invocations, prologues, and other material the author intended for the audience to weigh along with the text itself was preserved. Each play was marked with its author(s), and categorized either as "Shakespeare" or "Other."
5. nbc: A Naïve Bayes Classifier
The computer program that does the classification is called nbc. I opted to author the program in a language called Perl, which is considered something of a mongrel by programmers but is extremely well suited to text manipulation4. It is what statisticians call a "Naïve Bayes Classifier," meaning that it implements Bayes theorem as described in section 3. Code for nbc can be found in Appendix 1. nbc is far from optimal for the following reasons:
- The perl language is interpreted rather than compiled. This means that it executes slower than languages like C.
- The algorithm for corpus generation is inefficient. See the code for details.
- It runs only on UNIX- or Linux-based machines. It will not run on Windows computers because it expects Unix path name conventions.
In spite of these problems, however, nbc is a solid implementation of a Naïve Bayes Classifier. I mention these issues only to point out that the code should be considered a prototype rather than a finished product. nbc gives good results, but it is extremely machine-intensive and on a desktop computer takes hours to process the amount of data necessary for this analysis.
nbc makes use of the Perl module Algorithm::NaiveBayes, an implementation of Bayesian analysis5.
6. Results
nbc analyzes each play and determines probabilities for two categories:
- Shakespeare
- Other
Probability is expressed as a number between 0 and 1. The higher the number, the higher the probability. A result of 1 for a category means that there is a 100% probability the text belongs to that category. Similarly, a result of 0 means there is a 0% probability. Alternately, a result of 0 can be understood as a 100% probability that the text does not belong to the given category. Finally, a result of 0.5 indicates a 50% probability, which can be understood as complete uncertainty.
As we can see from the following table, nbc assigned a value of 1 to 1 Henry VI, meaning that Bayesian analysis yields a 100% probability the play was written by Shakespeare. In practice, however, we should not assume this probability indicates certainty. See section 7 below.
Figure 1: Results by play (entire corpus)
| Title | Other | Shakespeare |
|---|---|---|
| 1 Henry IV | 0 | 1 |
| 1 Henry VI | 0 | 1 |
| 2 Henry IV | 0 | 1 |
| 2 Henry VI | 0 | 1 |
| 3 Henry VI | 0 | 1 |
| All's Well That Ends Well | 0 | 1 |
| Antony and Cleopatra | 0 | 1 |
| As You Like It | 1.66E-36 | 1 |
| Comedy of Errors | 0 | 1 |
| Coriolanus | 0 | 1 |
| Cymbeline | 0 | 1 |
| Hamlet | 0 | 1 |
| Julius Caesar | 0 | 1 |
| Henry V | 0 | 1 |
| Henry VIII | 0 | 1 |
| King John | 0 | 1 |
| King Lear | 0 | 1 |
| Richard II | 0 | 1 |
| Richard III | 0 | 1 |
| Love's Labours Lost | 0 | 1 |
| Macbeth | 0 | 1 |
| Massacre at Paris | 0 | 1 |
| Measure for Measure | 0 | 1 |
| The Merchant of Venice | 0 | 1 |
| The Merry Wives of Windsor | 0 | 1 |
| A Midsummer Night's Dream | 0 | 1 |
| Much Ado about Nothing | 0 | 1 |
| Othello | 0 | 1 |
| Pericles, Prince of Tyre | 0 | 1 |
| Romeo and Juliet | 0 | 1 |
| Sejanus | 2.83E-158 | 1 |
| The Taming of the Shrew | 0 | 1 |
| The Tempest | 0 | 1 |
| The Duchess of Malfi | 0 | 1 |
| The Jew of Malta | 4.17E-37 | 1 |
| The Spanish Tragedy | 0 | 1 |
| Timon of Athens | 0 | 1 |
| Titus Andronicus | 0 | 1 |
| Troilus and Cressida | 0 | 1 |
| Two Gentlemen of Verona | 0 | 1 |
| A Winter's Tale | 0 | 1 |
| Dr. Faustus | 1 | 5.67E-168 |
| The Bloody Banquet | 1 | 5.84E-305 |
| A Chaste Maid in Cheapside | 1 | 0 |
| A Fair Quarrel | 1 | 0 |
| Anything for a Quiet Life | 1 | 0 |
| A Trick to Catch the Old One | 1 | 0 |
| Blurt, Master Constable | 1 | 0 |
| Cynthia's Revels | 1 | 0 |
| Eastward, Ho! | 1 | 0 |
| Epicoene | 1 | 0 |
| Every Man in His Humor | 1 | 0 |
| Every Man out of His Humor | 1 | 0 |
| Hengist, King of Kent | 1 | 0 |
| No Wit, No Help Like a Woman's | 1 | 0 |
| Tamburlaine, Part I | 1 | 0 |
| Tamburlaine, Part II | 1 | 0 |
| The Alchemist | 1 | 0 |
| The Changeling | 1 | 0 |
| The Family of Love | 1 | 0 |
| The Honest Whore | 1 | 0 |
| The Nice Valour | 1 | 0 |
| The Phoenix | 1 | 0 |
| The Poetaster | 1 | 0 |
| The Puritan | 1 | 0 |
| The Revenger's Tragedy | 1 | 0 |
| The Roaring Girl | 1 | 0 |
| The Second Maiden's Tragedy | 1 | 0 |
| The Witch | 1 | 0 |
| Twelfth Night | 1 | 0 |
| Volpone | 1 | 0 |
| Your Five Gallants | 1 | 0 |
7. Conclusions
According to nbc, we can be certain that Shakespeare authored 1 Henry VI.
We also notice, however, that nbc misattributes certain plays—such as Sejanus, The Duchess of Malfi, and The Spanish Tragedy—to Shakespeare. Two questions arise from this: first, where to these errors come from; and second, how can we trust any of nbc's conclusions?
To address the first: these errors are the result of a bad sample set. There is an old saying in computer science, so aphoristic it has become a cliché. "Garbage in, garbage out," the saying goes. This is not to say that the input here is garbage in and of itself, but rather that it is garbage as a data set6. In this case, a data set skewed heavily away from one writer—meaning that a given writer has a much lower word count in the set compared to the others included—yields bad results because of how the word which are weighted by Bayes' Theorem. For example, our set contains only one work by Kyd, his Spanish Tragedy. If a given play of Shakespeares happens to have a word count with patters similar to that in Kyds play, nbc will tend to view the author as someone other than Shakespeare. My attempt to alleviate this by including a bevy of other authors with a number of words roughly equal to that in Shakespeares corpus was not enough to overcome this aspect of the mathematics.
Now let us address the question of how, in the light of the above limitations, we can trust the output of nbc. Looking at our sample, we note that the misattributed plays are by authors who are poorly represented. Clean text files of plays by Jonson, Webster, and Kyd are much harder to find than those by Dekker, Middleton, and Shakespeare. If we run nbc using only the works of Middleton and Dekker, remembering to exclude random Shakespeare plays until the sample sets are the same size, we see much neater classification. The solution, then, is to provide good data, meaning a data set that is created with the minimum necessary amount of compositing work by multiple authors. To get the best results, we would build each corpus from the works of only one author.
I mention this problem and provide the above table to point to a central caveat: problems in the sample set, even if they seem minor, lead to bad results. Construction of a good data set, then, is governed by the following principles:
- The data itself must be clean, devoid of extraneous information (such as editorial comments).
- The two bodies of work we wish to compare must be of equal word count.
- The two bodies of work we wish to compare should each be wholly comprised of one authors output.
In the case of early modern drama, guideline 3 is not always possible, at least given the current state of carefully edited electronic texts. Currently there is no other author relevant to our current purposes whose output matches Shakespeares in terms of word count. We cannot, then, run an analysis that includes all of Shakespeares dramatic work. We are forced to work with a subset. This has led me to run a test with compromised data, the output of which is much more reasonable:
Figure 2: Results by play (balanced corpus)
| Title | Other | Shakespeare |
|---|---|---|
| 1 Henry IV | 0 | 1 |
| 1 Henry VI | 0 | 1 |
| 2 Henry IV | 0 | 1 |
| 3 Henry IV | 0 | 1 |
| Henry V | 0 | 1 |
| Richard II | 0 | 1 |
| Richard III | 0 | 1 |
| A Chaste Maid in Cheapside | 1 | 0 |
| A Trick to Catch the Old One | 1 | 0 |
| No Wit, No Help Like a Woman's | 1 | 0 |
| The Changeling | 1 | 0 |
| The Honest Whore | 1 | 0 |
| The Nice Valour | 1 | 0 |
| The Puritan | 1 | 0 |
| The Roaring Girl | 1 | 0 |
| The Second Maiden's Tragedy | 1 | 0 |
| Your Five Gallants | 1 | 0 |
With a selection of plays by Shakespeare balanced against an equal amount of text by Middleton, we notice much more sensible results. This underscores the importance of activities like Project Gutenberg7, which provide clean text-only versions of words in the public domain. If analyses like this are to be most useful, good data must be made readily available.
Finally, I would like to offer a few brief notes on the current state of the software used here, and on the role of computers within literary study.
As Ken Williams, author of the implementation of Bayes' theorem used in nbc, notes, results tend to cluster around 0 and 1 as a result of normalization. This does not prevent us from answering our question, but it does remove some of the granularity from our analysis. We do not, for example, get results that are fine enough to determine the difference in authorship probability for texts that are very similar in word count, since as the three parts of Henry VI. Reauthoring nbc with a different Bayesian algorithm would help address this problem, as would using a greater floating point precision during normalization. Additionally, nbc should be reauthored for efficiency and speed (see the appendix for more details).
With respect to the broader question of statistical approaches to textual analysis, it is clear that Bayesian analysis, for all its predictive power, is no replacement for a human reading. At present, techniques of natural language processing do not obviate the necessity of a motivating human intelligence behind textual analysis, especially as it performed in the humanities. Bayesian, and, more broadly, statistical approaches to questions of authorship are highly useful supplements to the field.
I do not pretend, here, to offer a full account of the friction between this methodology and the traditional operations of literary scholars. Such an account would me a rehearsal from both advocates of technology and advocates of conventional literary scholarship of a form dating back at least to the time of Daniel, a form which runs "mene, mene, tekel, upharsin," interpreted either as "it has been counted, weighed, and divided" or "you have been weighed and found wanting." For all my advocacy of counting and weighing, I intend nothing divisive with this work. My intention is not, in other words, that of the author of this form; I do not wish to break up a party or, currently, to offer dire warnings.
When I have presented this work to literary scholars, both in formal and informal settings, I have largely been met with a response I did not expect. Curiosity or hostility to foreign methods are understandable and, thanks to the work of our new historicists, scholars of identity, and analysts of culture, better understood than at any time in the past. The response this work largely garners, a response that is deeply shocking to me and deeply threatening to our progress as literary scholars, is indifference.
My shock comes from a certain amount of naïveté, since I have always been fascinated with the so called "hard" disciplines—a binary categorization which does nothing to illuminate and everything to divide, but which nonetheless would consider logic, science, and mathematics part of the in group—and in the depths of my interest I often forget that this fascination is not universal. The consuming fascination for literary scholars is, of course, reading rather than computation, interpretation rather than prediction. Many of my acquaintance find the products of logic, science, and mathematics useful in their own right, but wanting when applied to texts. I think it reasonable to assume that this preference extends beyond my own circle of acquaintances, both personal and via writing, and to literary scholars as a whole. Why study literature without a love of literature?
Nonetheless, we should not, as literary scholars, be indifferent to the methods of the so-called "hard" disciplines. Certainly these methods should not replace our tools of analysis, tools which were created for good reasons and have kept us going up to this point. I say "up to this point" because we currently find ourselves in the midst of a great change as regards textuality. The internet will receive no lionization from me, for it is a deeply flawed structure, but flaws and all this new publication mechanism is having and will continue to have a dramatic effect on writing, on the text, and on the codex, an effect which I hasten to add is but poorly understood. In the face of this change, and given that our current methods have so far yielded nothing authoritative by way of analysis of this change, we hinder ourselves if we refuse to supplement our methods, if we refuse, in other words, the right tool for a given job. In this case the job is determining authorship, and our tool, though borne out of a discipline many of us find wanting in its approaches to literature, nonetheless has an undeniable merit: it can get us an answer.
This is why I find indifference so shocking. An answer is available. It is one thing to challenge an answer and another to credulously assume its validity. I would certainly welcome the first over the second, since any answer of worth can weather contention. But indifference in matters of questioning is deeply unsettling, since it implies that the question is so lacking in merit that an answer—any answer, regardless of its derivation—is below concern.
8. Citations
I gratefully acknowledge the following authors for their texts, programs, and insight.
- Manning, Christopher, and Schütze, Hinrich. Foundations of Statistical Natural Language Processing. Cambridge Massachusetts: MIT Press, 2003.
- Stephenson, Neal. Cryptonomicon. New York: Perennial, 1999.
- Wasserman, Larry. All of Statistics. New York: Springer, 2004.
- Williams, Ken. Algorithm::NaiveBayes. http://cpan.uwinnipeg.ca/module/Algorithm::NaiveBayes.
9. Appendix A: The nbc Code
Download from johnmcox.org. Please view the file itself for licensing and notes.
10. Notes
1: See pages 235-238 of Neal Stephenson's Crytonomicon for an extended account of how a cryptanalyst can turn a human-readable sentence such as "Two by four boards one hundred count length eight feet" into an encrypted string of characters readable only to those with knowledge of the code (or the mathematical and technical acumen to break said code).
2: For a thorough account of what today's computers are capable of, see Foundations of Statistical Natural Language Processing by Manning and Schutze.
3: See Paul Graham's essay for a more complete account.
4: Many programmers find Perl poorly suited to large-scale applications because of efficiency and maintainability. Often called "the duct tape of the Internet," Perl is most useful for short, simple programs. Perl code may be executed without a manual compilation step, which makes it a good language for prototyping software that will later be rewritten in a more efficient language such as C or C++.
5: Available from http://cpan.uwinnipeg.ca/module/Algorithm::NaiveBayes.
6: The distinction between the value of a thing itself and that things value as data will, sadly, have to wait for another paper. Here it is enough to make a distinction between the aesthetic value of a text and its utilitarian value as a piece of data that helps us develop a statistical prediction. This is one of the many assumptions and methodological choices demanded by this type of analysis that have caused many of the literature scholars to whom Ive presented this project to view computational approaches with distaste or, worse, indifference.
7: An online repository of carefully edited texts, available at http://gutenberg.org.
#!/john/m/cox