An overview of computational analysis of text, foundations, and exploration of challenges and strategies.

Preparation

Going down the rabbit hole: anatomy of a digital book

How is a digital book made?  How does the structure relate to its function? What opportunities does this afford us in terms of text analysis?

(1) Consider this book: Alice's Adventures in Wonderland.  Explore the controls on the right side of the page turner.  

(2) Now explore the controls at the top of the page turner. 

(3) What might this page be?  (It also has this view.) Is this also part of the book? When and how might it be used?

(4) Diving deeper into text.  Optical Character Recognition (OCR) processes are not perfect. Consider some areas of special challenge:

Computational analysis of text

We count tokens - What is tokenization?  Why tokenize?  What are some strategies used to tokenize?

(1) Let's look again at the Arapaho gospel of St. LukeSwitch to text view.

(2) Discussion: What are the opportunity points that the structure and arrangement of a book afford?

Introducing Control - Microanalysis with Voyant

Voyant is a low barrier text analysis tool that delivers a rich, interactive interface and a variety of visualizations based on token counts within a single or a few texts.  Input format can be plain text, a PDF (with OCR), an MS Word Document or a URL for HTML analysis. Documentation will help you use this tool and it's many features. Upload of any material will be subject to the Voyant privacy policy. Sample texts and URLs for analysis are listed below for experimentation, but feel free to use other source data that interests you. 

(1) Visualization of derived data

(2) Exerting control

(3) Discussion

Moving from Microanalysis to Macroanalysis (Google nGrams and HTRC bookworm)

nGrams are words or phrases, tokenized and counted in a defined corpus and displayed as a graph showing relative frequencies of those phrases as occurring over publication date. The two tools referenced below provide a basis for exploration of ngrams.  Each tool is bound to secondary data derived from analysis of a different corpus, so results of the same nGram will not necessarily align. 

Google's nGram Viewer. Use the links below as starting points; dynamic modifications can be made at any point. Rules for syntax can be found on the About page.

HathiTrust Research Center (HTRC) Bookworm (Tied to a corpus pre-1923). Again, consider these links as starting points.  Rules for faceting ad controls are available on the HTRC wiki

Discussion

More Macroanalysis: HTRC

HathiTrust Research Center (HTRC) is a collaborative research center (jointly managed by Indiana University and the University of Illinois) dedicated to developing cutting-edge software tools and cyberinfrastructure that enable advanced computational access to large amounts of digital text. A basic orientation of HTRC services is available, and features and steps for each are documented on the HTRC community wiki.  We will be spending time in the HTRC Portal looking at the results of a few algorithms as a sampling of possibilities (links below will not render for all, but are parked to make sharing easier).  Algorithms in the portal can

Discussion

On your own - network analysis and image analysis

Immersion

Immersion is a tool for discovering the connections in a corpus of email.  It analyzes the flow data (information found in email headers) and represents these as a network of entities.  The analysis is done in real time on the flow data for which you provide credentials.  The display is rich and  interactive. 
By design, Immersion collects only header information (From, To, Cc and Timestamp).  However, using the actual flow data from your account may cause concerns regarding privacy - Be sure to read over the FAQs to understand what information you are granting access to, and how it will be used.  If you do not like the terms of the tool, you can experience it with their demo data. 

Image analysis

Ukiyo-e.org is a database and image similarity analysis engine, created by John Resig to aid researchers in the study of Japanese woodblock prints.  The data is over 213,000 digital copies of prints from 24 institutions, and their cataloging metadata.  Metadata is indexed and searchable, as you might expect. (Details are noted in the about page.)  But images are also searchable.  Resig's Image search uses the TinEye matching engine to determine edges in an uploaded sample and compares with analyzed edges in database, returning probable matches (edge analysis).  Use samples below to experiment.  

Resources

"Formatting Science Reports." Academic and Professional Writing: Scientific Reports. University of Wisconsin - Madison, 24 Aug. 2014. Web. 03 June 2016.

Jockers, Matthew L Text Analysis with R for Students of Literature. Cham: Springer International Publishing, 2014.

Underwood, Ted. The Stone and the Shell. Ted Underwood. Web. Accessed 03 June 2016.