The graphical user interface or "GUI" of the popular topic modeling implementation MALLET, is a useful alternative to the standard terminal or command line input MALLET frequently uses. Freely downloadable here, it is a quick and easy way to get started topic modeling without being comfortable in command line. To start, simply download the file, once it's finished, open the file TopicModelingTool.jar in order to begin (however, you will need Java installed on your machine in order to run it).W With the program started you will first want to select an input file or directory, this is the file or group of files that you want to run your topic modeling program on.After you select the correct input directory, you'll also want to specify your output directory. This is where MALLET will dump your results, both in a flat-file CSV version and in an html version for viewing in your internet browser. The default option for this is a new folder within your "downloads" but it's helpful if you move it to a more permanent spot. If you want to make your results shareable online, you can specify here where on your online server you'd like the results to go and they will be sent there in fully interactive form. Lastly, before you run the program you will want to click on the "Advanced" button next to the number of topics you want to create and make sure that the "Remove Stopwords" box is checked and that the Mallet Default file is selected in Stopword File tab at the top of the window. If you are working with a corpus that is not modern English i.e. has lots of "thy" or "thou" or regional dialects, the basic English Stopwords list is housed here on GitHub. You can download this file, make your own additions and then select it to use in Mallet instead of the default list. To do this, just select your modified file instead of the default Mallet list when you are in the "Advanced" section. Now you are ready to hit "learn topics" and run Mallet.

 

After Mallet is done learning the topics (it took the program a little shy of 2 minutes to run my directory of 36 volumes totalling around 16,000 words) there are a number of useful results to analyze. Before you do this, it's important to remember that Mallet results will be different every time as the algorithm learns topics differently each time you run it (for further explanation, see David Blei's "Introduction to Probabilistic Topic Models") so the topics are not always stable although they should remain close. Secondly, it's also important to know that the topics produced are not in order of frequency, i.e., topic 1 is not necessarily the most frequent topic within the corpus.

 

Keeping these in mind, I've found that the html output is more useful to analyze than the csv files as the html version allows you to interactively pivot and zoom through different layers of analysis. Starting off in the main Topic Index (all_topics) is generally a good place. Here you have an overview of the 10 or 20 (or however many you chose to look for) topics that you asked the program to find. Although the topics themselves are not organized by frequency, the words within the topics are. For instance, in my topic model of early 19th century Tractarian (Conservative Anglicans) writings, in the first topic "faith" is the most common word, then "justification," "righteousness"  and so on.

Selecting a topic will then lead you into a breakdown of how many times words within that topic occurred in various items. After selecting topic 1, I see that this topic is overwhelmingly concentrated within John Henry Newman's "Lectures on Justification." "Lectures on Justification" contained 23,586 words from topic 1 whereas the next closest was the 5th volume of Newman's "Parochial Sermons" with an occurrence of only 2057 words from topic 1. A quick scan of the rest of my list shows that every other text has less than 10% of the occurring topic 1 words that  "Lectures on Justification," contains. Thus, we can safely assume that "Lectures on Justification" is relatively unique within the Tractarian literature as it itself constituted nearly an entire topic.

To further support this assumption, we can zoom in yet one more level within the Mallet results. Clicking on the link to "Lectures on Justification" yields a sample of the text itself along with a list of the most frequent topics occurring within the volume. Unsurprisingly, we find that topic 1 is far and away the most dominant topic, as 48% of the words within the document were assigned to topic 1 whereas the next closest, topic 20 from our main index "god christ world man lord holy men day faith st" is a distant second at 12%

 

This exercise ably demonstrates some of the limits of topic modeling. Especially in a large corpus, the topics do not necessarily encompass themes bridging various works, but instead are often just the predominant theme of one work or two works within the greater corpus. That is to say, what exactly entails a "topic" on the document or corpus level, is a variable that depends greatly on the corpus itself and requires some analysis to unveil. Another caveat would be that the "top ranked documents" within a topic (and ever perhaps the topic itself) can be skewed based on document length. To use another example from my Tractarian corpus, topic 16: "church scripture doctrine truth system fathers divine doctrines words catholic" has the most occurrences (16,970) in the 5th volume of the Tracts for the Times. But, if you zoom down another level into the breakdown of different topics within that particular volume, you will find that that particular topic constitutes only 14% of the words within the document. Out of  approximately 313,000 words, 14% equals 43,820 words. Juxtapose that against another volume titled "ElucidationsNewman," a shorter pamphlet by Newman written to address the perceived theological heterodoxy of one of his rivals, 35% of the words in "Elucidations" are within topic 16, out of approximately 16,500 words, 35% equals 5,775 words. "Elucidations" though much shorter than the Tracts for the Times, concerns itself much more intimately with ideas of church, scripture, and doctrine than does the lengthy volume of Tracts, both in close reading and in the algorithmically generated percentages. While it may be obvious to say, ordering the importance of documents by word counts within the topics, as opposed to percentages of each topic within a document, automatically favors longer texts.

 

One of the drawbacks to the Mallet GUI system is that it does not tell you about relationships between topics. Especially in a larger corpus with many documents and many topics, it's very difficult to go through by hand and figure out which topics seem to co-occur. Ability to look at co-occurrence of topics throughout the corpus would provide yet another layer into the idea of thematized searching that Topic Modeling embodies. As a corollary to that, it would be helpful if you could read co-occuring topics across documents as well. Of course, you can see on the "top ranked documents" page the documents and how frequently one particular topic occurs in each of them, but it would be interesting to see in which documents two topics tend to co-occur.

Additionally, while Mallet quickly and ably produces topics for one volume of work (it took two seconds to work through Darwin's On the Origin of Species), the rest of its functionality as an analytic tool suffers. For instance, running the program on Origin of Species gives a standard list of 20 topics ranging from "life conditions change ancient animals sterility struggle existence instincts parents" to "plants birds insects productions numbers early area seeds tendency perfectly" but when you click on the topic to see where it appears in the documents you get a dizzying list of over 500 different locations. Clicking on one of the "documents," leads you to a text fragment within the volume. Taken out of context, these text fragments are generally too small to be of much help.

 

For those curious, my results are available online to play with, although it is an ongoing experiment so they may or may not always be up: http://digitalmedia.library.cornell.edu/digital_humanities/output_html/all_topics.html, happy topic modeling!

 

  • No labels