Bootstrap consensus trees and frequency tables

TalkLiterary Computing

Join LibraryThing to post.

Bootstrap consensus trees and frequency tables

1Petroglyph
May 2, 2022, 1:53 pm

(This is a repost of a comment posted here)

In the interest of testing a method for questions we already know the answer to, I propose today's Lunch Break Experiment (tm) (though technically it involved a Lunch hour and now a pre-dinner baking potatoes hour). If I run R:Stylo on a series of texts whose authorial attributions are known, more or less, does the software produce the expected results, and if not: why not? (In all honesty: I also wanted to generate some Bootstrap Consensus Trees, because they are neat.)

So in today's episode we're going to take a look at some of the texts from the New Testament. Specifically: from the Latin-language Vulgate.


Corpus

I downloaded this corpus containing most of the books of the New Testament; these were originally sourced from The Latin Library. To download the corpus yourself, you need to log in to Github, then press the green "Code" button and select "download zip".

Note: Several texts have been split into two or more files: 1 Corinthians cor i 1 and cor i 2, Revelation (ap 1 and ap 2).


Individual dendrograms

Before I get to the Bootstrap Consensus Trees, I kind of have to explain what they are and under which conditions they are useful.

To begin with, let's generate a series of normal dendrograms (tree graphs, cluster graphs) for the same corpus but at different Most Frequent Words -- or in this case, Most Frequent Character groups of 3 characers.

The settings in Stylo: In Stylo, I set the language to Latin, and on the Features tab I selected to use characters instead of words, in groups of 3. I then set the minimum MFC to 100, the maximum to 1000, and the increment to 300. This means that Stylo will start generating dendrograms using the 100 Most Frequent Character trigrams, and in steps of 300, stop at the 1000 most frequent trigrams.

In other words, four dendrograms were generated, at 100 MFC, 400 MFC, 700 MFC and 1000 MFC. I've put them below. Right now, I'm not going to comment on the individual groupings, but I want to say a few things about the general shape of the trees first. (Open in new tab for larger images)




You can see that the clusters of "leaves" of this tree (the individual texts) don't really change all that much: the synoptics cluster together; the gospel of John is separate from the synoptics, the Pauline letters cluster together; also constant is the initial split, which is between the more narrative texts (gospels, acts, revelation) and the advice-cum-philosophy epistles. But we'll get back to these in a bit.

What does change is the overall shape of the tree -- the way the branchings go from the basic split between narrative/non-narrative texts and the eventual text clusters.

This is due to the fact that 100 MFC (or 100 MFW) measures slightly different things than 400, 700 and 1000 MFC (or MFW). In general, the 100 most frequent words will contain a larger proportion of function words than the larger MFW groups, which will be mainly content words. From the other end, at 1000 MFW more "noise" will be included -- comparatively rare words may have an outsize effect on the calculations. Deciding on the right MFW involves balancing these two extremes. The probabilities change a little at various levels of magnification. But this introduces a subjectivity that may be undesirable.


Bootstrap Consensus Tree

So. In order to solidify some of that unwanted and fuzzy variability, Stylo offers a "Bootstrap Consensus Tree" -- a tree diagram that takes multiple snapshots at various MFW, and keeps only those branchings that are part of the majority of those snapshots. The image below is what that looks like. I generated this image by setting the minimum MFC to 100, the maximum MFC to 1000, and the step to 50. This means that I told Stylo to generate a total of 19 trees, at 100 MFC, 150, 200, 250, ... 1000 MFC. On the Statistics tab, I selected Consensus Tree, and set the consensus strength to 0.7. This means that only those branchings were kept that featured in 70% of the trees. Anything below that is excluded from the final dendrogram. (Open in new tab to embiggen.)



(Important side note: in this type of graph distance does not correlate with similarity/difference.)


Results

Right. Let's see how well the text groupings produced by this method correspond with what we know about authorship in the New Testament. Overall, the tree produced here is entirely in the line of expectations and seems pretty damn accurate.

  1. The three synoptic gospels (mr, mat, luc) form a single tight cluster -- which is entirely to be expected, given how much material they share, sometimes verbatim. The gospel of John (io 1-3 in orange) is very different, and forms a separate branch some distance away from the synoptics.

  2. Traditionally, the "John" who wrote the book of Revelation (ap 1 and ap 2) was identified with "John the apostle", who supposedly wrote the gospel of John. Modern scholarship no longer takes that view. And indeed: the consensus tree shows a very clear separation between the gospel (io 1-3 in yellow) and Revelation (ap 1-2 in green).

  3. Luke-Acts, despite layers of revisions, is likely by the same author. This graph, however, separates Acts (act 1-3 in red) from Luke, though the former is closest the synoptics cluster. I suspect that the reason Luke and Acts aren't on the same branch may be because of all the material Luke shares with Mark and Matthew. But I'm not an NT scholar, so I'll refrain from making claims I cannot possibly back up.

  4. The epistle of james (iac, in black) is on its own branch, indicating a separate author from all the others.

  5. The cluster in the top left contains Romans and first and second Corinthians -- three letters that are pretty universally seen as from the hand of Paul.

  6. Finally: notice how this graph clearly stretches between two higher-level clusters: the more narrative texts at the bottom (gospels, acts, revelation) versus the epistles at the top.



Conclusion

So there you have it: Nearly all of this fits exactly with what we already know to be the case. Conclusion: This method can provide reliable results that we may apply to questions where the answers aren't as well studied.


Just one more thing

Just for shits and giggles I decided to perform one final test.

Because the original corpus did not include any letters that are almost-universally regarded as not by Paul, I decided to download those myself from The Latin Library (same source as the original corpus). I grabbed First Timothy, Second Timothy, and Titus.

Here is that consensus tree (same settings as last time) with those three letters added (clearly marked as not part of the original corpus):



Neat! The software places the three pastoral letters together with the other epistles, close to Paul but not on the same branch, suggesting their style may be similar to / inspired by Paul but not quite by Paul. It looks like James is closer to Paul's letters, at least in terms of MFC trigrams. I'm not going to comment on wether or not those three pastoral letters are by the same author -- I'm not an NT scholar, after all, and I don't really have a dog in this race, either.

2Petroglyph
May 2, 2022, 1:53 pm

I also wanted to add a few notes on two files that Stylo creates in the working directory: table_with_frequencies.txt and wordlist.txt.



Frequency list

As a first step, R:Stylo generates a list of all the words in the entire corpus, and arranges them by frequency from highest to lowest. The 5000 most freqent words are then dumped into a word frequency table. Stylo will save this file to your working folder, and call it "table_with_frequencies.txt" For the NT corpus I used in >1037, the first few rows and columns look like this (right-click to embiggen):



(5000 words is a default that can be adjusted in the GUI > Features > List cutoff.)

This file is the basis for many of the tests that Stylo runs: if you've told Stylo to look at the 100 MFW (or 300 or 1000 or whatever) the programme will take the first 100 rows (or whatever) from this file for its calculations.

Let's look at a few of the most frequent words: et "and", in "in", autem "however", est "he/she/it is", non "not", cum "with", ad "towards", ut "so that"; qui is a relativizer ("who"), quia "because", si "if", eum, eius, vobis, me, vos are pronouns. In places 25 and 26 we get dixit "he/she says" and iesus "Jesus".

That's pretty much what you'd expect: high-frequency function words. It takes a while before the content words show up in any sizeable numbers.

Each file in the corpus gets a column. The numbers indicate how often each of the words occurs in that particular file.

Those numbers are relative: how often does this word occur in that particular text, relative to all the other words? Take the frequency of that word in text act_1, and divide it by the total number of words in that text (all words, not just the top 5000). Put differently: how much of the text in this file, percentage-wise, consists of that word? For act_1, et "and" accounts for 8.3% of all the words. Going down the column for act_1: 4 in every 100 words in that text are in "in", 1.5 words of every 100 words in that text are qui "who". And so on.

Out of curiosity, I added up all 5000 percentages for act_1, and the sum ended up being 88.83%. That means that some 11% of this particular text consists of words not in the 5000 most frequent words corpus-wide. The long tail, if you like. But as I said at the start, that 5000-word cutoff is customizable.

Another very important point is that these "top 5000 words" are not the top 5000 of any particular text in the corpus, but the top 5000 of the entirety of the corpus. This is important for a number of reasons:

  • The list of 5000 scores for each file form a unique fingerprint of that file within this particular corpus: it tells you how, exactly, this particular text relates to the entire corpus.
  • Some of the top 5000 words will simply not occur in a particular text, or perhaps only rarely. That's useful information, too. (example: ergo "therefore", in row 30, column F)
  • When you compare the fingerprints of two texts with each other, the basis against which you do so is that of the entire corpus -- a shared baseline that is the same for all texts. The idea is that, given a large enough representative corpus, the same author's word choices and language patterns will remain consistent enough to show up as a similar signature against that corpus-wide baseline.


By way of an informal little analysis, I'll just point out some things that jump out at me in the files for the book of Revelation (ap_1 and ap_2; columns E and F) :

  • The two files show very similar numbers: et "and" makes up more than a staggering 13% of both files. that is massive! No other text in this corpus comes close to that figure (closest is the gospel of Mark, mr_1 and mr_2, with 11.75% and 9.45%, respectively). This is one feature that suggests these texts are by the same author, and not by any other author in this corpus.
  • Along similar lines: both texts have very, very low frequencies for the word iesus "Jesus" (row 25) -- this is undoubtedly a factor in determining that Revelation is fairly different from, say, the gospels.
  • In row 18, the word me "me" makes up 0.08 and 0.04% of the text in ap_1 and ap_2 -- similar numbers, especially when compared to the surrounding columns. The same goes for columns G and H (the two files making up First Corinthians): their two figures are similar, and different from the other documents in this snippet.
  • Look at the figures for "Jesus" (row 25) in columns B, C and D (the three files making up the Acts): that name occurs much more in the first part of acts than it does in parts 2 and 3 (or in Revelation, for that matter). In a qualitative analysis, you could use this to argue that, as the book of Acts moves away from Jesus' lifetime, the acts and sayings of other people take centre stage.
  • In row 10, both ap_1 and ap_2 feature the word ut "so that" about 0.5 per hundred; look at the three columns to the right (first and second Corinthians), where ut makes up 1% -- that's double the numbers of Revelation. You'd have to consider other subordinating conjunctions as well to see if this is a trend. If so, you can argue that Revelation uses simpler language, with fewer complex sentences. I am not saying that this is definitely so -- only that this is the kind of evidence you would need to argue such a case.


This is a very brief, very incomplete, and very "first impressions" look at differences in word frequencies across texts and what they might imply. I've only listed them to illustrate the kind of similarities that the software will use to judge whether texts are close enough to be by the same author. Only, of course, it will do so systematically, with precise calculations, using every single one of the top 100/300/1000 whatever words, and comparing each of those words in any individual text to those words in every single one of the other texts.

Everything I've said so far about single words can be applied to character trigrams or word bigrams, or whatever parameter you choose to use. R:Stylo will simply create the table_with_frequencies file for those instead of for individual words.



wordlist.txt

Finally: R:Stylo will also create a file in the working directory called wordlist.txt, which contains the 5000 most frequent words (or whatever you've set this limit to) and which looks like this: (in English, this time)



Stylo allows you to set an option "use existing wordlist", which means that for the next analysis Stylo will use the wordlist.txt that is already in your working directory. If you, for instance, would like to run an analysis on only the function words, or some other specific group of words you're interested in (say, only subordinating conjunctions, or the modal auxiliaries), you simply copy&paste a properly formatted list into this file and tell Stylo on the next run-through to use the existing wordlist.