This lesson introduces relational data for network analysis terms and concepts, how to and structure data for network visualization, and how to use network visualization to answer specific research questions.
This lesson introduces relational data for network analysis terms and concepts, how to structure data for network visualization, and how to use network visualization to answer specific research questions. The lesson also provides a sample dataset, discusses how and what to keep in mind while cleaning the dataset, and demonstrates how to use Palladio to visualize the network.
3.3 Cleaning, organizing, and managing data: This lesson discusses modeling spreadsheet data for network analysis software, and demonstrates different methods for sorting, cleaning, and deduplicating data points without disintegrating the structural integrity of the dataset.
4.3 Critical data visualization: This lesson asks students to critically evaluate their own data visualizations to look for patterns, gaps, and biases.
Information Creation as a Process: Students participate in the data creation process and are responsible for the iterative process of shaping data so that it can be used for visual analysis.
Research as Inquiry: Students begin working with data with a research question in mind, and are asked to keep the question in mind as they work through cleaning their datasets.
Quantitative Literacy (AAC&U VALUE)- Students will become familiar with the different metrics that are used to calculate degree and centrality in relational data.
Upper-level undergraduate or graduate students who do not have prior hands-on experience with relational data. Some experience with using spreadsheets is helpful but not required.
This lesson was developed for an introductory digital humanities course, which is taught as a survey of different computational methods in Mississippi University for Women’s Digital Studies minor. The minor is interdisciplinary, and this course is taught by library faculty with digital scholarship experience.
In this lesson, students used data from an archival set of letters. The letters are part of a collection of local family papers that span the 20th century and are written to and from several family members and close friends. (Learn more about the collection.)
The network analysis lesson is flipped and implemented over two 90-minute meetings. Students review parts 1 and 2 of the lesson before the first class meeting, which is online/synchronous and establishes comprehension of network analysis terms and concepts and familiarity with spreadsheets. Next, students review parts 3 and 4 before the next meeting, which allows time for supervised independent work.
Hillary Richardson, Instructor of Digital Studies and Undergraduate Research Librarian
Russell Brandon, Data Services Librarian
Computers/laptops for each participant
Screen-sharing capabilities (i.e., web-conference software for remote instruction or a projector for in-person instruction)
A web browser for accessing Palladio (https://hdlab.stanford.edu/palladio/)
This lesson has four lessons that take students through the process of structuring, cleaning, and modeling data for network analysis.
Instructors can focus on one or select lesson parts based on time or course level constraints.
This lesson uses a sample dataset of names from an archival letter collection, but any relational dataset can be substituted (e.g., characters from a literary text, scientific data, etc.). Palladio provides sample datasets, as do most of the tools mentioned below.
This lesson uses the Palladio Data Visualization Tool, but other tools can be adopted based on instructor preference. Some include:
Students will be able to:
Explain how network analyses help answer research questions with quantitative and visual components.
Identify parts of a visualized network using terms like “node” and “edge” to describe the location and associations between different items.
Analyze relational data using different modes of centrality to determine the influence of individual nodes within the network.
Clean data according to a specific research question.
Utilize Palladio to customize and interrogate a network of interconnected items or concepts.
Explain the existing relationships, gaps, and potential biases in the network.
Instructors can use the provided sample dataset or prepare one ahead of time using an existing dataset. The prepared dataset needs to be structured but not cleaned, which means arranging the data into tidy columns and rows but avoiding cleaning individual rows.
Provide a viewable link to the Google sheet with your prepared data to students, and instruct them to create their own copies for the lesson.
Instructors should go through the steps to identify obstacles students might struggle with and be prepared to help them work through them.
Predefined dataset that is structured, but not cleaned, with permissions to copy via Google Sheets
Parts I-IV of the lesson is available as a PDF or to fork via https://github.com/hillaryAHR/DLFTeachToolkit3/. The elements provided include:
Images to the visual aids (slides, screenshots) in the lessons
PDF of the lesson
The lesson has various built-in check-points for discussion and student comprehension of terms, concepts, and processes to conduct formative assessment. Students share their final network analyses and a short, written description of the patterns, gaps, and potential biases they notice from working with the data, which allows for summative assessment.
Based on class observations and student feedback, the instructor made the following notes for implementing these lessons:
At the beginning of the lesson, determine students’ comfort level with spreadsheets. For example, some students might feel intimidated by numerical and structured data.
Allow more time for the data cleaning methods in-person session than you think is necessary. If time runs out during the session, provide opportunities for students to meet with you one-on-one for assistance.
Some students need explicit prompting to explore the network visualization beyond what is apparent. For example, the need to be guided to look beyond the basic observation that one node is larger than the rest.
Do not take for granted students’ discomfort with spreadsheets and working with data. When cleaning data, some students were unsure how to reconcile differences in cells, to remove duplicates in a separate spreadsheet, and to sort cells without changing the observations in the rows. For this reason it is important to demonstrate the cleaning methods in front of them, even if you go over these steps in detail during the asynchronous lesson.
A “cleaning method” actually turned out to be a way to see where cleaning needed to happen, not to eliminate data. In addition to the scheduled sessions, I held frequent office hours (and called them “cleaning parties”) to give students a chance to stop by with questions about cleaning. Students who were able to attend these had less trouble going through data.
Also, keep in mind that making the context of the data available will help with questions about who someone is and why their connections outnumber others’ (i.e., the person who donated the collection has the most relationships). The lesson's goal is not to have students make incredibly insightful interpretations of the data but to have them understand the process and become familiar with data through hands-on experience, so the observations they make won’t necessarily be ground-breaking, and that’s ok!
Please note, at the initial time of this writing, Palladio had not yet developed the metrics feature, which allows you to calculate the centralities for different nodes (for uni-partite networks). This lesson could include a section on using the metrics feature for more emphasis on quantitative literacy skills.
The instructor prepares a dataset with columns labeled “Source,” “Target,” and whatever other relational data is being analyzed (e.g., type, location, address, etc.). This data is structured but needs cleaning by the students, which they will do in the latter parts of the lesson. For example, you should structure the data types, but leave inconsistencies in the data, like spelling, spacing, etc., that will differentiate rows from each other.
The sample datasets provided illustrate two examples of structured data. The initial raw data (Name tags_Unstructured-20220110.csv) is a set of names that have been copied and pasted into a spreadsheet and haven’t been rearranged to answer a specific question. The second version of that data (Name tags_Structured-20220110.csv) has been tidied into columns as “sources” and “targets” but hasn’t been checked for inconsistencies.
We recommend that the instructor test the datasets in Palladio (or a similar visualization app) to uncover any issues that might exist from the dataset, and note what stands out, what is missing, or what subsequent questions might arise from a visualization. The lessons are made available to students for asynchronous review before the live lesson
The instructor walks through the purpose, terms, and definitions of network analyses, and gives students a chance to demonstrate comprehension through discussion and “check-ins.” Check-ins are simple comprehension checks, like a quiz or short-answer questions. For example, “Which kind of edge indicates that an interaction between two nodes is reciprocal?” can be answered via multiple choice, aloud in a classroom, etc. They also allow students to discuss what might be examined in a network analysis (e.g., potential connections, gaps, etc.). Students are given the opportunity to ask questions and review parts III and IV before the next session
The instructor demonstrates various cleaning methods (e.g., reconciling several spellings into a uniform spelling, de-duplicating rows, etc.) outlined in the lessons and allows students the time to work individually or in groups to mirror those methods on their own. The instructor also demonstrates using a graphing tool like Palladio to add the cleaned data into a network analysis, and allows students time to mirror that process.
Link to asynchronous network analysis lessons I-IV, via PDF or on GitHub: https://github.com/hillaryAHR/DLFTeachToolkit3/blob/main/network-analysis.md
Heather Froehlich’s presentation on Excel is an excellent spreadsheets tutorial that can be given to students as supplemental learning material