This page is a companion to a guest lecture on text analysis for ASRC 4513 - Science Fiction and the Value of Utopia/Dystopia (instructor: Ricardo Wilson) 3/16/2017 in 701 Olin Library.

Preparation 
 

  1. Please bring a laptop! Bring your own or feel free to check one out at the Olin circulation desk.  No special software will be needed. All exercises will be done through a Web browser, without any special plugins.
  2. Read through the first three sections of Text-mining in Wikipedia.  This will give us a common orientation to text analytics. 
  3. Read Ted Underwood's blogpost  "Seven ways humanists are using computers to understand text".  Make note of items that interest you, or you want to learn more about.   
  4. Optional - Bring samples of text to load into Voyant - Input format can be plain text, a PDF (with OCR), a MS Word Document.

Discussion from the reading (fairly open)

Exercises

All exercises will be demonstrated; we will learn together through "making".  Let's approach this informally.

Google nGrams

Google's nGram Viewer. Google nGrams depict the frequency of a word or word phrase by publication year. Note that many modifications can be made to refine the analysis, so please consider the links below as starting points. Syntax for refinement is found on the About page.

Building a progressive example (American English corpus, 1980-2008).  This example assumes that when writing about a movement, authors also give examples of that movement.  We can track how frequency of specific examples change over time.

Questions for thought -

Voyant

Voyant is a low barrier text analysis tool that delivers a rich, interactive interface and a variety of visualizations.  (A guide is available with extensive documentation.) Input format can be plain text, a PDF (with OCR), a MS Word Document or a URL for HTML analysis. Please feel free to bring your own material for upload, understanding that upload of any material will be subject to the Voyant privacy policy.  Sample texts and URLs for analysis are listed below for experimentation, in case you run low on ideas.

Sample texts for uploading from Project Gutenberg (choose the Plain Text UTF-8 version for download)

Questions for thought -

Immersion

Immersion is a tool for discovering the connections in a corpus of email.  It analyzes the flow data (information found in email headers) and represents these as a network of entities.  The analysis is done in real time on the flow data for which you provide credentials.  The display is rich and  interactive. By design, Immersion collects only header information (From, To, Cc and Timestamp).  However, using the actual flow data from your account may cause concerns regarding privacy - Be sure to read over the FAQs to understand what information you are granting access to, and how it will be used.  If you do not like the terms of the tool, you can experience it with their demo data. 

Questions for thought -

Wrap up

Other resources