Some sources for text that might provide the foundation for projects in computational analysis of text. Please feel free to add other sources and suggestions to improve this page in the comments section below, and I will integrate them into the page.
Digitized text
- HathiTrust - https://www.hathitrust.org/
- Download PDFs of full-view books (requires login)
- APIs available for bibliographic info - https://www.hathitrust.org/data
- HTRC - https://analytics.hathitrust.org/
- Good for working with in copyright digitized books
- Worksets allow you to assemble a corpus that satisfies your research interests from volumes in HathiTrust
- Datacapsule - allows command line and GUI analysis of worksets of your choice
- Structured data sets - https://analytics.hathitrust.org/datasets
- Internet Archive - https://archive.org/details/texts
- Early English Books Online - https://eebo.chadwyck.com/home
- Library subscribed databases -
- ex: Proquest/EBSCO/Gale - library selectors can negotiate purchase of texts of interest;
- Reference staff can put you in touch with key personnel who can negotiate appropriate extracts for you.
Transcribed text
- Project Guttenberg - https://www.gutenberg.org/
Born digital text
Look for API’s (Application Programming Interface) which will more easily assist you in downloading desired text.
- US Government publishing office bulk data - https://www.gpo.gov/fdsys/bulkdata/
- Online news - look for APIs from news server
- Twitter - https://developer.twitter.com/en/docs/api-reference-index
- Facebook -
- Documentation - https://developers.facebook.com/docs/
- Tools - https://developers.facebook.com/tools-and-support/
- Email - Download an account in mbox format (which will be most open)