An introduction to and examination of the HathiTrust Digital Library corpus and the broader context of the ethics of mass digitization.
This lesson plan introduces learners to the HathiTrust Digital Library corpus and HathiTrust Research Center tools for text analysis, while examining the underlying HathiTrust collection including gaps in the corpus. It also touches on potential structural biases in common tools and algorithms for text analysis. The lesson focuses on two specific and foundational skills for using the HathiTrust corpus for text analysis while also contextualizing the HathiTrust collection within the broader context of the ethics of mass digitization. For the text analysis work, this lesson uses user-created collections of HathiTrust volumes (HTRC worksets) that were created to highlight and center the work of historically under-resourced and marginalized textual communities.
Janet Swatscheno; Digital Scholarship Librarian, HathiTrust; Associate Director for Outreach and Education, HathiTrust Research Center
Felix Oke; Master of Arts in Digital Humanities, Loyola University
Learners will be able to explain the limitations on data from HathiTrust and other mass digitization projects and some of the structural factors that lead to those limitations.
Learners can create worksets and analyze them using algorithms in the HathiTrust Research Center portal.
Learners will be able to explain potential biases introduced by algorithmic analysis.
Researchers who are interested in beginner-level text analysis, upper-level undergraduate students, and graduate students.
This lesson plan is designed for a three-hour, in-person workshop, but could be adapted to fit into an introductory digital humanities or critical digital humanities course, ideally over two class sessions.
The presentation material for this lesson plan builds on years of workshop development by HathiTrust Research Center staff and includes much material that has been adapted and revised over time, especially to include more information about the structural biases inherent to working with collections of mass digitized materials.
The premade worksets which will be used for the hands-on portion of the lesson were created as part of a Mellon-funded project, SCWAReD, which is specifically aimed at centering the work of historically under-resourced and marginalized textual communities. The overall project also examined methods for determining gaps in the HathiTrust collection and potential ways to fill gaps in the collection. Each SCWAReD project itself is worth exploring but is beyond the scope of the lesson. Instead, we will make use of the worksets for analysis. The discussion section addresses what context would be necessary for interpreting the results.
Learners should have some familiarity with text data mining, digital libraries, and algorithms but do not need to have specific experience with coding.
This workshop is best conducted in a synchronous in-person or virtual format.
Instructors need a computer, internet access, and a projector or monitor.
Learners need a computer and internet access.
Learners who want to follow along in real-time should set up HathiTrust and HathiTrust Research Center Accounts ahead of time.
Anyone can create a HathiTrust account at https://hathitrust.org/.
Learners from HathiTrust Member Institutions can log in with their single sign-on account.
Learners who are not affiliated with a Member Institution can log in with a guest account.
Our tools are well documented, including step-by-step instructions and videos on our documentation wiki: https://htrc.atlassian.net/wiki/spaces/COM/overview?mode=global
We recommend instructors browse the wiki before giving the workshop
D’Ignazio, C., & Klein, L. (2020). 6. The Numbers Don’t Speak for Themselves. In Data Feminism. Retrieved from https://data-feminism.mitpress.mit.edu/pub/czq9dfs5
SCWAReD Project Reports (2023): https://htrc.github.io/scwared/
Presentation (Slides)
Introduction to HathiTrust and the HathiTrust Digital Library (20 min)
Introduction to the HathiTrust Research Center (20 min)
Ethical Considerations + Discussion (30 min)
Break (10 min)
Designing and Building Worksets (hands-on activity) (40 minutes) (Slides + Creating Worksets)
SCWAReD Worksets
Analysis (hands-on activity) + Discussion (60 minutes) (Slides + Text Analysis Algorithm Handout)
SCWAReD worksets as an example
Word Cloud
Other canned algorithms
Discussion
Note: we highly recommend instructors practice the hands-on activities ahead of time.
The lesson has two discussion sessions which serve as moments for instructors to check in with learners and discuss the critical topics introduced in the slides. Because this is designed as a one-shot workshop, we hope to capture the learners’ comprehension of the lesson through the discussion as much as possible. Discussion questions are provided in the slides. We also typically provide an optional survey to participants which asks participants to describe the most interesting and/or valuable thing they learned from the workshop.
The preferred method for delivery of this material is in-person or online in a workshop format. The material could easily be adapted to fit in a course, especially if spread out over two-course sessions. If adapted for a course, we also recommend breaking up the material so that the hands-on portions are also spread out over two-course sessions. If adapted for a course, a more robust assessment would need to be designed, potentially using the outputs from the Analysis section of the workshop.
This lesson is the result of years of iteration over introductory workshop material which is very specific to the HathiTrust Research Center and the tools available for analyzing the HathiTrust corpus. For that reason, there are some idiosyncratic terms used to describe different processes and tools. With that in mind, the slides help to put in context why certain terminology is used. Also, over time we have added more and more content that addresses some of the gaps in the collection, bias in digital libraries, and ethical considerations when using a corpus like the HathiTrust Digital Library. Some of those concerns are specific to this collection and described in detail in the SCWAReD project reports, but many of them are relevant to other digital collections. In using digital libraries for research, we are always constrained by what gets published, what gets collected by libraries, what is digitized, and how findable and accessible these items are when compiling data for analysis. We hope that this lesson provides some practical tutorials for using the HTRC tools which are available to all students and researchers, while also asking learners to be critical of the data and methods available for text analysis.
The material for this lesson is adapted and remixed from many different slides and tutorials created by HTRC staff over the years and has been released with a Creative Commons Attribution 4.0 License. The full attributions are available on the slides provided in this lesson. With that in mind, I would like to acknowledge Ryan Dubnicek, Jennifer Christie, Eleanor Koehl, Kaylen Dwyer, all of whom have contributed to the slide deck and tutorials over the years. In addition, I acknowledge the extremely thoughtful work of the SCWAReD teams, principal investigators (J. Stephen Downie, Maryemma Graham, John A. Walsh), and HTRC staff who created the worksets used in this lesson.