Thursday, September 18, 2014

Stylometrics and Edward II by Peter Farey

One piece of stylometric evidence which seems at first sight to throw quite a large spanner into the Marlovian works appears in Shakespeare, Computers, and the Mystery of Authorship, edited by Hugh Craig and Arthur F. Kinney.1 The book itself is not concerned with the Shakespeare authorship question as such, but with whether certain parts of Shakespeare's works or apocrypha are either by him or by a collaborator. For example, Craig argues quite convincingly for Marlowe having made significant contributions to parts one and two of Shakespeare's Henry VI trilogy. 

It is in a chapter called 'The authorship of The Raigne of Edward the Third' (by Timothy Irish West), however, that the item having most significance for the Marlovian theory appears. It is a chart (Fig.6.8, p.130) in which he is simply testing the validity of an approach being used to see whether Shakespeare wrote either the 'Countess' scenes (I.ii–II.ii) or the 'French campaign' scenes (III.i–IV.iii) in Edward III.

In this graph there are 90 shaded circles representing segments of 6000 words each from 27 plays which are taken to be solely by Shakespeare. There are also 236 diamond shapes representing 6000-word segments from 85 other single-author plays written between 1580 and 1619.

The horizontal (X) axis represents the extent to which each segment includes lexical words2 which have been identified as more characteristic of Shakespeare's works than the others'. The vertical (Y) axis shows the extent to which it uses words more typical of the others' works than Shakespeare's.3 Not surprisingly, they divide into two fairly distinct clusters – the circles in one and the diamonds in another.

In this particular chart, West has removed the three 6000-word segments of Marlowe's Edward II from the data, and treated them as if it were all of unknown authorship. Shown as black triangles on the chart, all three fall quite clearly within the 'other authors' cluster, and not within the 'Shakespeare' one. Here is the result. (The caption should of course read 'segments of Edward II', not 'segments of Edward III'.)

Against this, he shows (pp.127–8) segments from Shakespeare's King John, Henry IV (part 1) and Henry V, all of which fall in the 'Shakespeare' cluster, albeit at the edge nearest to the other one.

There is no doubt that this is a fairly strong piece of evidence that the author of Edward II (i.e. Marlowe) was not also the author of the Shakespeare canon. A few points need to be borne in mind, however, before we all admit defeat.

1. Although the data for Edward II are ignored in arriving at the characteristic words for the 'other authors', all of the rest of Marlowe's plays (except Dido, because of the possible input by Thomas Nashe) are included, whereas they would all need to be ignored too if the intention was to assess Marlowe as a Shakespeare authorship candidate – which it wasn't.

2. The three Henry VI plays, Titus Andronicus and The Taming of the Shrew – those which because of time proximity are most likely to have similarities to Marlowe – play no part in this calculation.

3. This also means that there is no play in the 'Shakespeare' set which is known to have been written less than five years or so after Edward II.

4. Furthermore, the Shakespeare set includes plays as late as The Tempest, written some twenty years later.

This question of date is of crucial importance in any stylometric attempt to assign authorship. Let me give an example which is in essence a very much simplified version of the method employed by Craig and Kinney (and West).

Suppose that there are two bodies of work, one which we will ascribe to playwright A and the other playwright B.

We work out that the frequency with which they each use the words 'most' and 'then' differs greatly. In fact, if we add up the total for both words in a play by either of them and find what percentage of them are 'most' we can be fairly sure that:

* if it's less than 40%, it's by playwright A

* if it's more than 40%, it's by playwright B

(In fact this works for all of A's 21 plays bar one, and all of B's 16 except two. You would need to get a bit more complicated to get 100% in each case!)

Now let’s imagine that we have a play where we suspect collaboration between the two playwrights. We find that Acts 1 & 2 are well below 40% (so probably playwright A) and Acts 3 & 5 well above (playwright B). Act 4 is more doubtful at 43%.

So does this tell us anything at all about whether the two playwrights are different people? No. In fact playwright A is Shakespeare before 1600, and playwright B is Shakespeare after 1600. And Twelfth Night (1601?) was the play in question, if you were wondering.

What we can see, therefore, is that to claim that this tells us they were different people is circular reasoning. If you start with an assumption that they are two different people, and take no account of time, then it’s hardly surprising that this is just what the figures will seem to show.

Don't get me wrong, though. This is a powerful piece of evidence against the Marlovian theory, and it would be wrong to think otherwise.

© Peter Farey, September 2014

1Craig, Hugh; Kinney, Arthur F. (2009). Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press.
2According to Craig & Kinney (p.224), "Words can be classified into functional words and lexical words (with just a few doubtful cases). Function words have a grammatical function; examples are the, and, she, before, and of. [...] Lexical words [are] nouns, verbs, adjectives, and adverbs which can be substituted for each other in a given sentence." They give king and mother as examples.
3The most characteristic words for Shakespeare (the horizontal axis) are found by calculating the proportion of 'Shakespeare' segments within which a given word appears and adding this to the proportion of 'others' segments within which it does not appear (giving a theoretical maximum of +2). The 500 words with the highest scores are the ones used. For the vertical axis, the same procedure is followed, but finding those words with the highest combination of proportions 'within the others' and 'not within Shakespeare'.