You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

Note that hands-on use of the HTRC portal and it's tools requires a logon.  Please see the information linked from the section titled "The portal", below.  Those wishing to experience the tools using a collection of scholarly interest may want to construct such a collection following the tutorial referred to under the section called "Workset builder".

 

What is HathiTrust (HT)?

  • A consortium - international partnership of over 100 institutions.
  • a digital library containing about 13.5 million books, ~5 million (38%) of which are viewable in full online. All items are fully indexed, allowing for full text search within all volumes. You can login with your Cornell NetID to

    • Create Collections (public or private)
    • Download PDF’s of any item available in full text
  • a trustworthy preservation repository providing long-term stewardship, redundant robust backup, continuous monitoring, and persistent identifiers for all content.

What is the HathiTrust Research Center (HTRC)?

  • a collaborative research center (jointly managed by Indiana University and the University of Illinois) dedicated to developing cutting-edge software tools and cyberinfrastructure that enable advanced computational access to large amounts of digital text. Let's unpack this:
    • " computational access" - algorithmic analysis, computational analysis, distant reading, text-mining
    • "cyberinfrastructure" - Data to Insight center at University of Indiana, supercomputers, data warehouse - SOLR indices
    •  "large amounts" - "at scale", the bigger the better.
    • "cutting edge" - experimental by nature, things can break, things are unfinished/in-development
  • intended to serve and build community for scholars interested in text analysis; join usergroup mailing list (send an email to htrc-usergroup-l-subscribe@list.indiana.edu)

What specific services does the HTRC offer scholars?

Documentation of offerings on the HTRC User Community Wiki - links to services, user support documentation, meeting notes, elist addresses and sign-up information, and FAQs.

The "portal"

  • "SHARC" branding in URL - eventually all services will be accessed through the portal.
  • functionality depends on login; see the Portal & Workset Builder Tutorial for v3.0, "Sign up for an account, and sign in" for details

Workset builder

  • allows researchers to create a set of text to analyze algorithmically, see  Portal & Workset Builder Tutorial for v3.0, "Create Workset" for tutorial
  • Linked from the portal, but also at https://sharc.hathitrust.org/blacklight
  • really, really helps to use in a second window and operate the portal in the first
  • worksets can be private (open to your own use and management) or public (viewable by all logged-in HTRC users, management restricted to owner)

Algorithms

  • allows researchers to run ready to use algorithms against specific collections, see  Portal & Workset Builder Tutorial for v3.0, "Create Workset" for tutorial
  • Many algorithms provided, others can be added by scholars' request as time permits development
  • Workshop dedicated to these alone (ask and I can give you a tour)
  • Handout available

Data Capsule

  • allows researchers to create a virtual machine environment, configure with tools, and analyze texts, see  Portal & Workset Builder Tutorial for v3.0, "Use the HTRC Data Capsule" for details
  • designed to be a secure analytical environment that respects access restrictions to text while allowing for computational analysis
  • not yet tied to worksets
  • currently restricted to "open-open" (non-restricted) corpus; eventual objective is to allow for access to full HT corpus

Extracted Features datasets

  • page level attributes (volume level and page level data) for 4M+ open-open; rationale and features explained 
  • can download full datasets via rsync (Watch out! BIG!)
  • details on leveraging the dataset to select data using a workset and the EF_Rsync_Script_Generator algorithm to download data for just that set.
  • David Mimno's "word similarity tool" is built from the full Extracted Feature dataset

Bookworm

  • same functionality as Google nGrams
  • base data is currently "open-open" data; working on legal aspects required for base data to shift to entire HT corpus, regardless of viewability.
  • plans and allocated grant to develop tie-in to worksets
  • See wiki for tutorial

 

  • No labels