Data Design, Organization, & Analysis for Humanities Research
This was taught in two class sessions in an 80-minute bi-weekly senior seminar focused on the use of data in History. This lesson could somewhat easily be condensed into a one-shot workshop with a homework assignment. This lesson, slightly pared down, has been used in a number of different humanities and social science courses.
This section works as a lead-in to a data-focused assignment. In the past, it has been the scaffolding for a brief assignment where students use a tool to visualize and construct an argument about a given dataset, and it has also functioned as an introduction to data-focused work, with later workshops on a specific method or tool for data analysis following.
The primary purpose of this exercise is to teach students how to collect, derive, and organize data for computational analysis. Because of this, the lesson has a few central biases:
The entire lesson bends towards understanding how schemas, formats, and filetypes can serve particular research questions.
While other formats and methods are mentioned and the principles are broadly useful, the lesson encourages a dataset format that is useful for a few specific visualization and analysis tools—RawGraphs.io, Tableau, Palladio.
One important note: Data ethics and geospatial data are both covered in other class periods, so they were left out of this lesson. These may be more desirable in this lesson in other contexts.
Upperclass undergraduates, with many History majors
This course is one of many 400-level senior History seminars. These courses must have a significant portion of the final grade focused on an independent research and writing project. Senior History majors must take three of these courses, but they often attract other majors from the college. This particular course is focused on the use and misuse of data in historical analysis, and students in the course are required to curate a dataset and use computational analysis for their final project.
Be aware of different definitions and understandings of data
Be aware of different formats and filetypes
Critique data design and organization
Design a data set for a specific research use
Use data to create a visualization
This works best if you can customize the examples and problem sets with datasets that are relevant to the course material.
For the data exercise, have a few research questions and links to relevant data resources prepared for the group work.
A projector or large screen for the instructor to use
Accompanying slides (see additional materials below)
Paper and pens/pencils for the audience
A sufficient number of computers — either a computer for everyone or one computer per two audience members. Computers will need an internet connection. Students can either use Google Sheets or spreadsheet software (like Excel or Libre Office).
Quick overview of what we’ll do today, and frame the purposes of this workshop within the context of the class.
What do we mean by data?
Start by asking what comes to mind when they hear the word ‘data’ (I’ve learned the hard way that asking for someone to define data will only result in blank looks).
Introduce some basic definitions that we’ll use going forward.
There are many competing understandings of what is/isn’t data, what is/isn’t structured, what is/isn’t machine readable, but this is intended to set a baseline understanding and frame purpose of schemas/types.
Although this focuses on tabular data (which can come in many different file formats!), introduce the idea that the exact same information can come many different ways.
The example of multiple subject headings shows a limitation of tabular data and an advantage for JSON/XML.
I posed a few hypothetical datasets that were relevant to the course and we had a conversation about what types of fields would be needed to ask or investigate different questions.
This can be pretty confusing to start off — I’ve found that it’s helpful push through to the diagram showing variables/observations and then go backwards in the slides to revisit definitions.
It’s important to keep in mind that ‘Tidy Data’ principles are very helpful, but in the end you need to focus on your individual questions and the units of analysis you need.
Remind students that this is an ongoing process you need to return to —if your research question or analysis software needs different organization, you’ll have to do it again.
Depending on class size, the problems can be done in groups, think-pair-share, or just open questions to the class.
These problems are tricky — many don’t see an issue with sample 2, but it is very common and the results help to clarify the issue.
This should get students to focus on the way each observation is recorded with an eye towards visualization and analysis.
The instructor and students go to app.rawgraphs.io to use some of the sample data there to see what types of visualizations are available and play with the different data types and the impact they have.
Data creation exercise
Students broke into groups of 2-3 and were given slips of paper with research questions and a source they can use to develop a dataset (see addendum).
After making decisions about the schema and data type, they assembled a small dataset and used app.rawgraphs.io to see if they would be able to answer their research questions.
Students shared their datasets with the instructor, and we put them on the projector and discussed their decisions.
This exercise is really open-ended and hypothetical, so the assessment was done as a group to spur conversation. The class as a whole discussed the strengths and weaknesses of the decisions each group made. Many said they would make different decisions next time.
When I taught this, I had a bit more information in that was specific to the course, and we only had about 20 minutes for the students to compile their own datasets on the second day. This wasn’t enough for them to assemble and discuss, so they worked together in groups to assemble and visualize the dataset, and we discussed their datasets the following day. The extraneous material has been cut from the slides, but it still may be a somewhat short amount of time to work on this.
I had also taught this with the Tidy Data portion coming after the data types portion. I think there is probably a case to be made either way for the best logical order, but I have changed them here because the first day was nearly all lecture and discussion and the second day was all activities. I think breaking it up a little bit would help here.
Students said that the data exercise was really helpful — it seemed really easy until they had to develop their own schema and think about the details.
Additional Instructional Materials
Data exercise prompts
Dataset: Early African American films — look at Wikipedia’s “Race films category” for some data
Research Questions: What did the early African American filmmaker community look like? Were there small clusters that worked together all the time, or was it one large community?
Dataset: Early African American films — look at Wikipedia’s “Race films category” & other online resources for some data
Research Questions: Were the production companies black- or white-owned? Were the writers black or white? Do these factors change the storylines at all (think of some factors you may be able to track across movies)?
Dataset: Green Books - travel guides for African Americans published in 1936-1967
Research Questions: Were business locations primarily in large cities, or were there a lot of small towns as well? What are the demographics of the location where they are located?
Dataset: Green Books — travel guides for African Americans published in 1936-1967
Research Questions: How did participation in Green Books change over time? Did participation in particular locations grow or shrink? Did the type of businesses participating change over time?