This page is a companion to a guest lecture on text analysis for ASRC 4513 - Science Fiction and the Value of Utopia/Dystopia (instructor: Ricardo Wilson) 3/16/2017 in 701 Olin Library.

Preparation 
 

  1. Please bring a laptop! Bring your own or feel free to check one out at the Olin circulation desk.  No special software will be needed. All exercises will be done through a Web browser, without any special plugins.
  2. Read through the first three sections of Text-mining in Wikipedia.  This will give us a common orientation to text analytics. 
  3. Read Ted Underwood's blogpost  "Seven ways humanists are using computers to understand text".  Make note of items that interest you, or you want to learn more about.   
  4. Optional - Bring samples of text to load into Voyant - Input format can be plain text, a PDF (with OCR), a MS Word Document.

Discussion from the reading (fairly open)

  • What thoughts seemed most interesting?
  • For what thoughts might you need clarification?
  • What thoughts merit further exploration?
  • In terms of the continua mentioned in the article. Where do you see your interests in terms of possible areas for exploration? Some areas of possible tension are noted below:
    • automate familiar tasks vs. making new discoveries
    • individual text vs. large corpora
    • modeling to explain vs. modeling to predict
    • supervised vs. unsupervised learning

Exercises

All exercises will be demonstrated; we will learn together through "making".  Let's approach this informally.

Google nGrams

Google's nGram Viewer. Google nGrams depict the frequency of a word or word phrase by publication year. Note that many modifications can be made to refine the analysis, so please consider the links below as starting points. Syntax for refinement is found on the About page.

Building a progressive example (American English corpus, 1980-2008).  This example assumes that when writing about a movement, authors also give examples of that movement.  We can track how frequency of specific examples change over time.

Questions for thought -

  • What kind of thought does each stage elicit?
  • Are there conclusions we can make at any stage?
  • Are there provocative thoughts that we can further test by refining the model?
  • What are the boundaries of feasible or logical hypothesis?
  • Is this a tool for exploring a corpus, proving a theory, or both? 

Voyant

Voyant is a low barrier text analysis tool that delivers a rich, interactive interface and a variety of visualizations.  (A guide is available with extensive documentation.) Input format can be plain text, a PDF (with OCR), a MS Word Document or a URL for HTML analysis. Please feel free to bring your own material for upload, understanding that upload of any material will be subject to the Voyant privacy policy.  Sample texts and URLs for analysis are listed below for experimentation, in case you run low on ideas.

Sample texts for uploading from Project Gutenberg (choose the Plain Text UTF-8 version for download)

Questions for thought -

  • What types of tools or visualizations in this toolset do you find most helpful in providing insight into a text? 
  • What types of applicability does this toolset have?
  • Are there provocative thoughts that we can further test by adjusting the analysis settings?
  • Are there aspects of this toolset that seem misleading?  Can these be mitigated/corrected?

Immersion

Immersion is a tool for discovering the connections in a corpus of email.  It analyzes the flow data (information found in email headers) and represents these as a network of entities.  The analysis is done in real time on the flow data for which you provide credentials.  The display is rich and  interactive. By design, Immersion collects only header information (From, To, Cc and Timestamp).  However, using the actual flow data from your account may cause concerns regarding privacy - Be sure to read over the FAQs to understand what information you are granting access to, and how it will be used.  If you do not like the terms of the tool, you can experience it with their demo data. 

Questions for thought -

  • Are there relationships depicted in this visualization that confirm your intuition?
  • Are there relationships that seem surprising?
  • Are there aspects of the tool and it's visualization that seem misleading?

Wrap up

  • Returning to the blog post, have new questions come up for you?  
  • Has anything changed about with regard to your interests in terms of possible areas for exploration?

Other resources

 

 

 

  • No labels