Page History

...

Table of Contents

What is HathiTrust (HT)?

A a consortium - international partnership of over 100 institutions.
a digital library containing about 13.5 million books, ~5 million (38%) of which are viewable in full online. All items are fully indexed, allowing for full text search within all volumes. You can login with your Cornell NetID to
- Create create Collections (public or private)
- Download download PDF’s of any item available in full text
a trustworthy preservation repository providing long-term stewardship, redundant robust backup, continuous monitoring, and persistent identifiers for all content.

...

a collaborative research center (jointly managed by Indiana University and the University of Illinois) dedicated to developing cutting-edge software tools and cyberinfrastructure that enable advanced computational access to large amounts of digital text. Let's unpack this:
- "computational access" - computational analysis, algorithmic analysis, distant reading, text-mining
- "cyberinfrastructure" - Data to Insight Center at University of Indiana: supercomputers, data warehouse, SOLR
- "large amounts" - "at scale", the bigger the better (more signal, less noise)
- "cutting edge" - experimental by nature, things can break, things are unfinished/in-development; see the DSPS blog post on HTRC Uncamp 2015 for most recent developments
intended to serve and build community for scholars interested in text analysis; join user group mailing list (send an email to htrc-usergroup-l-subscribe@list.indiana.edu)

...

allows researchers to create a set of text to analyze algorithmically, see Portal & Workset Builder Tutorial for v3.0, "Create Workset" for tutorial
Linked linked from the portal, but also at https://sharc.hathitrust.org/blacklight
really, really helps to use in a second window and operate the portal in the first
worksets can be private (open to your own use and management) or public (viewable by all logged-in HTRC users, management restricted to owner)

...

allows researchers to run ready to use algorithms against specific collections, see Portal & Workset Builder Tutorial for v3.0, "Create Workset" for tutorial
Many many algorithms provided, others can be added by scholars' request as time permits development
Workshop workshop dedicated to these alone (ask and I can give you a tour)
Handout handout available

allows researchers to create a virtual machine environment, configure with tools, and analyze texts, see Portal & Workset Builder Tutorial for v3.0, "Use the HTRC Data Capsule" for details
Requires requires a VNC application for your browser, like VNC View for Google Chrome
designed to be a secure analytical environment that respects access restrictions to text while allowing for computational analysis; maintenance mode / secure mode
not yet tied to worksets
currently restricted to "open-open" (non-restricted) corpus; eventual objective is to allow for access to full HT corpus

...

open source project, same basic functionality as Google nGram Viewer, although graphically faceted
base data is currently "open-open" data; working on legal aspects required for base data to shift to entire HT corpus, regardless of viewability.
plans and allocated grant to develop tie-in to worksets
See see wiki for tutorial