HathiTrust Research Center - Introduction

Note that hands-on use of the HTRC portal and it's tools requires a logon. Please see the information linked from the section titled "The portal", below. Those wishing to experience the tools using a collection of scholarly interest may want to construct such a collection following the tutorial referred to under the section called "Workset builder".

What is HathiTrust (HT)?

A consortium - international partnership of over 100 institutions.
a digital library containing about 13.5 million books, ~5 million (38%) of which are in the public domain. All items are fully indexed, allowing for full text search within all volumes. You can login with your Cornell NetID to
- Create Collections (public or private)
- Download PDF’s of any item available in full text
a trustworthy preservation repository providing long-term stewardship, redundant robust backup, continuous monitoring, and persistent identifiers for all content.

What is the HathiTrust Research Center (HTRC)?

a collaborative research center (jointly managed by Indiana University and the University of Illinois) dedicated to developing cutting-edge software tools and cyberinfrastructure that enable advanced computational access to large amounts of digital text. Let's unpack this:
- "research" - algorithmic analysis, computational analysis
- "cyberinfrastructure" - Data to Insight center at University of Indiana, supercomputers, data warehouse - SOLR indices
- "large amounts" - "at scale", the bigger the better.
- "cutting edge" - experimental by nature - although taking steps to move into production
intended to serve and build community for scholars interested in text anlysis

What specific services does the HTRC offer scholars?

Documentation of offerings on the HTRC User Community Wiki - links to services, user support documentation, meeting notes, elist addresses and sign-up information, and FAQs.

The "portal"

"SHARC" branding in URL - eventually all services will be accessed through the portal.
functionality depends on login; see the Portal & Workset Builder Tutorial for v3.0, "Sign up for an account, and sign in" for details

Workset builder

allows researchers to create a set of text to analyze algorithmically, see Portal & Workset Builder Tutorial for v3.0, "Create Workset" for tutorial
worksets can be private (open to your own use and management) or public (viewable by all logged-in HTRC users, management restricted to owner)

Algorithms

allows researchers to run ready to use algorithms against specific collections, see Portal & Workset Builder Tutorial for v3.0, "Create Workset" for tutorial

Bookworm

Extracted Features datasets

Data Capsule

allows researchers to create a virtual machine environment, confgure with tools, and analyze texts, see Portal & Workset Builder Tutorial for v3.0, "Use the HTRC Data Capsule" for details
not yet tied to worksets
currently restricted to "open-open" (non-restricted) corpus

Page tree