For this week’s data analysis and visualization blogpost, I decided to revisit the University of North Carolina’s Documenting the South’s “North American Slave Narratives” collection. While I previously selected two specific narratives to compare based on word frequencies, this time I chose to examine a much larger selection. This corpus has 294 documents with 11,356,449 total words and 74,752 unique word forms. The longest document with 298,181 words is Reverend William J. Simmons’s “Men of Mark: Eminent, Progressive, and Rising.” The shortest, on the other hand, with only 953 words is the “Life, Last Words and Dying Speech of Stephen Smith, a Black Man, Who Was Executed at Boston This Day Being Thursday, October 12, 1797 for Burglary.”
While we’ve been using R to produce many of our charts and graphs in class, I thought I’d use this blogpost as an opportunity to try something different. The program I have selected is called Voyant Tools, which is an open source web-based reading and analysis environment for digital texts. There are certainly many benefits to R, such as having the ability to fully control how your data is displayed. However, one of the advantages to Voyant Tools is that has a very dynamic quality in that it enables your online audience to interact with the data visualizations you produce. Because of this interactive component, public viewers might be able to discern unidentified patterns and comment on them in the blog. Another advantage to Voyant Tools is the ability to analyze a variety of document formats, including plain text, HTML, XML, MS Word, RTF, and PDF, and it also has the ability to strip text from webpages.
Anyway, let’s get started visualizing and analyzing the text. The first visualization I have produced shows basic trends for the corpus, such as the longest and shortest documents, as well as those with the highest and lowest vocabulary densities. Perhaps, the most intriguing part of this table is the lists of distinctive word generated from single documents in comparison to the entire corpus. In this example, we can see how the program recognizes words of different dialects as distinctive, such as “uv” (you’ve), “wuz” (was), or “dat” (that). This points us to one of the documents entitled “John Jasper: The Unmatched Negro Philosopher and Preacher,” where Reverand William E. Hatcher writes the story of John Jasper, a former slave and prominent preacher in Virginia. A quick look at the text reveals that Hatcher used non-standard aspects of Jasper’s colloquial speech in his book to convey to readers the essence of his character.
The following graphic shows a “word cloud” of the most frequent words within the corpus. This is where the dynamic aspect is helpful as hovering over each word provides the individual counts. The top five most common words are “man” (24,561), time (25,723), “said” (24,561), and “mr” (22,780). “Men” (17,402) also occurs frequently as does “master” (13,626) perhaps suggesting a gendered slant of the corpus. In comparison, “woman” (4,966) and “female” (875) occur much less frequently. Given that the corpus is comprised solely of slave narratives, it is not surprising that “slave” (13,879), “slaves” (13,101) and “slavery” (11,267) occur frequently. The high usage of “god” (13,879) and “church” (11,115) suggest the majority of these texts deal with the role of religion in these persons’ lives. Lastly, another trend visible in the word cloud is the number of terms relating to some temporal aspect, including “day” (16,774), “years” (16,655), “night” (10,011), “long” (8,989), and “soon” (8,766). It also is interesting to note that the mentions of “old” (12,934) and “new” (11,572) are relatively equal.
While the word cloud provides an interesting visualization of word frequencies, one drawback is that it does not show how those words may be connected. The collocates graph, however, conveys a network of high frequency terms that appear in proximity. In this instance, the graph reveals how “man” appears in high proximity to “named,” “mr,” “yes,” “god,” and “said” while “time” appears in proximity to “years,” “mr,” “come,” and “came.”
The next visualization shows a documents grid. It contains the following information: titles of each document; the number of individual words, or tokens, found in each document; the quantity of different types of words in each document; and the ratio of types/tokens expressed as a percentage (with the higher percentage indicating a more diverse vocabulary). The table is useful because the box allows the ability to query certain words or word combinations. A useful step in this process is sort the order of the words, types, or ratios.
The trends graph is perhaps most meaningful when limiting the comparison to a few words, such as “man” and “woman.” While the majority of authors are male names, this graph suggests that the content of the corpus focuses on men more so than women. Another way to view the trends graph is to limit the total number of documents for comparison. One limitation to the program is that the graph is only able to show about 100 documents across the x-axis because showing the entire corpus would make the chart incredibly wide. One useful aspect of the graph is the ability to compare relative and raw data.
Below is another trends graph where we see that “white” occurs relatively as frequent as “negro.” However, “colored” also occurs relatively frequently.
The next table shows keywords in the context of the phrases occurring before and after the keyword. This is useful to examine how terms are used in different contexts. The collocates graph showed us a strong connection between the words “man” and “said” so we can use this table to query this combination.
While most of the tools thus far have provided ways to analyze trends across the entire corpus, the next visualization is more powerful for identifying patterns within individual documents. It shows how high frequency words shift across the beginning of the document to its end. This example is looking at Reverend William E. Hatcher’s “John Jasper: The Unmatched Negro Philosopher and Preacher.” We see “negro” is mentioned more often in the beginning of the book and the focus seems to shift to Jasper and the people to whom he preaches.
The Voyant Tools environment provides numerous other tools to visualize both the corpus and individual documents. While the analysis didn’t provide any groundbreaking insights, it did help confirm some suspected trends. One of the most useful aspects was perhaps the ability to identify texts within the corpus that are distinct, such as in the instance of Hatch’s book.