Introduction to Named Entity Recognition: Manual Labor or Magic?
One-shot, 50 to 90 minute session
Undergraduate or graduate students in digitally-inflected literary studies course or an introductory digital humanities course
as an early-in-the-term demonstration of the challenges of processing textual data for computational analysis and/or manipulation, and hence a demonstration both of the limits of humanities computing and why we are motivated to keep pushing those limits.
as a precursor to a literary mapping project.
as an introduction to Named Entity Recognition as a technique for processing textual data and several tools for doing so.
understand, at a basic level, the concept of textual mark-up and how it can be used to make implicit textual information explicit and machine-interpretable.
recognize the interpretive and intellectual labor required to accurately and meaningfully identify place names in textual data.
gain basic experience with manipulating text using web-based tools and a spreadsheet.
use one or more sources of textual data (e.g. HathiTrust, Project Gutenberg).
Instructor should identify example projects and texts relevant to the course.
The project is just for showing one way of using place/spatial data.
Projects: Luchetta (2017) includes a number of examples of literary mapping projects.
The texts are a readily customizable part of this lesson. Anything that you can get both a page view of and a plain text copy of will work.
Instructor should go through the steps of the exercise for personal familiarity and troubleshooting. Interface details change often and may need to be updated.
Students should create a login to HathiTrust Analytics before class. This login is not the same as a login for the HathiTrust Digital Library. Your institution does not need to be a HathiTrust member for students to create a login or be able to use the Algorithms tool in HathiTrust Analytics.
Computers with access to the Internet (enough for each student or group).
Access to Excel for opening and sorting .csv data file OR time for teaching data cleaning using regular expressions and a text editor. Text editors vary so I have not provided instructions for this process. See citations for resources that can help build these instructions.
If desired, slides and an example HathitTrust NER output file can be found in Additional Instructional Materials below.
Show an example of a literary mapping project.
Describe what the map is showing and why the creators chose this method of representing literary texts. Perhaps it was a specific question, or perhaps it was a recognition that place is significant for a particular author in a particular way. In either case, the map creators sought to be able to visualize the relationship between text and place in order to understand that relationship better.
Demo a selected example (see suggestions in Preparation above).
What do you need to make a map?
First, you need spatial data. Eventually you’ll need to have that data in the form that your particular mapping tool understands, but first, you need to know what places are relevant to your text.
Present text as data
How do you get spatial data? If desired: use slides in Additional Instructional Materials below for outlining transformation of text to data.
Pull up reading view of HathiTrust text: example.
Another way of asking this question is, how do you get from a book that a human can read to a book that computers can read?
First, you need the words in plain text format so the computer knows what they are. You can get this through transcription (slow but accurate) or optical character recognition (fast but messy).
Option for longer session: Let’s compare the same text processed both ways.
Have students look at the text of Moby Dick on Project Gutenberg and the text-only view of the HathiTrust copy.
HathiTrust Moby Dick: click on the “Go to the text-only view of this item” in the left hand side of the reading viewer. You will need to click through several pages using the “Next Page” link at the bottom to get to the main text.
Note: Actually, these two are both pretty good from a typo perspective. There are other texts in the HathiTrust that demonstrate OCR errors more glaringly.
Second, you need to give the computer some way to know what those words mean. To do this, we use mark-up. We tag specific words as being specific types of things. But before we can mark things up, we need to identify what words should have specific tags. Like turning books into plain text, there’s two ways to do this: by hand and by machine. We’re going to experiment doing it both ways.
Activity (5-10 minutes): make a list of places in the first 10 pages in Moby Dick. Look at the text, identify places, and write them down (on paper or electronically, your choice).
Instructor notes: This instruction is intentionally ambiguous in that it does not address all of the methodological questions that will quickly come up. It does not specify whether page 1 is the title page or the first page of chapter 1. It does not specify what places “count” eg does the city of publication count? Do the places listed in the Extracts before chapter 1 count? Some students may realize that a simple list isn’t very helpful — shouldn’t we at least know what page the place word was on? Either students will quickly ask these questions or they will make decisions on their own that seem obvious to them but that can later be unpacked as methodological choices. The length of time for this activity is somewhat arbitrary, because it is unlikely they’ll finish. Part of the point is to highlight that this process is laborious.
Debrief: what did you notice about this process? How many pages did you get through? How many pages are there in Moby Dick? How many pages in Herman Melville’s complete works? You can begin to see why people want to automate this process. We’re going to give that a try next.
Using Named Entity Recognition on textual data in HathiTrust Analytics
Log in to HathiTrust Analytics.
Click on Algorithms in the top navigation bar.
Scroll down to Named Entity Recognition and click on Execute right below it.
Enter a job name that means something to you. This will not be displayed publicly.
Select a workset by clicking on the drop down menu and beginning to type Moby Dick. Select the workset titled “MobyDickHarpers@researcher749.”
Specify English as the predominant language.
Run algorithm. This will take a couple of minutes. You can refresh the screen to see progress. When it is finished, you will see a link to the results under Completed Jobs.
Note: to return to your completed jobs at any point, go to Algorithms and click on the Jobs button on the upper right hand corner.
Do a quick review of results in the preview table: what do you notice?
Basically, errors. Also, what is MISC?
Download csv with “Click here to download entities.csv” button.
Open with Excel. Click on upper lefthand corner of the spreadsheet to select all cells. Go to Data > Filter. Using the down arrow that has now appeared on the top of the column, filter for locations only.
Compare the resulting list to your hand-generated list. Examine one discrepancy and try to determine the reason why you and the algorithm disagreed.
What would you need to do to “clean” this data? Can you think of some strategies you could use to do so? Example: control-F for “Queequeg,” a character name that is obviously never a place.
Importing data into Google MyMaps
Frame segment: Now we are going to try to visualize the spatial data we have culled from the text.
Save a copy of your .csv file. Title it so you will be able to differentiate from the original download.
Copy the entity column filtered to location. Paste it into a new sheet. Delete the first sheet. Save.
Go to Google MyMaps
Click “Create a new map.”
Import your entities-only .csv file as a map layer. The entities column will be both your place and place name.
Pretty quickly, you should be looking at a map with a lot of points. Let’s take a closer look.
First off, why did this work? All you have is free text place names, no latitude or longitude. It worked because Google has a massive geocoding dataset and server that it will allow you to make small queries on for free. It is running your place name data through this process in the background. How well did it do? Click on a few points. What do you notice?
Instructor note: pre-select a couple of examples. In the sample data set, for example, the extracted place name “West Coast” is a point somewhere on the coast of Alaska. Is this accurate or misleading?
Secondly, what didn’t work? Click on the error report. What types of terms are causing errors?
Loop back to showing the literary map you began the class with. Ask: in terms of authority and reliability, how do you think the map we created with NER and Google MyMaps compares to this map? How could you find out?
Group brainstorm answers to two questions:
Now that you have done this exercise, what do you see as the values and limits of the Named Entity Recognition tool?
Can you think of a research project where this tool could be helpful or one where it could be misleading?
This lesson is customizable at the level of content and scalable to different sorts of digital literacy goals, depending on the tool you use. On perhaps the most technical side of the spectrum, you could have students set up a full Python environment and us the Natural Language Toolkit. On the least technical side, you could supply pre-created files for every step of the process.
There are a range of tools available to implement Named Entity Recognition on plain text data. Many of them are relatively approachable given enough time and computers you can download software onto. Each of them has its wrinkles for use in a one-shot setting. I selected HathiTrust Analytics for this lesson because it does not require any software download and produces data that is readily downloadable. It is also an invaluable resource for accessing the full text of public domain books and doing basic textual analysis of large corpora without extensive programming.
My preferred method of teaching this lesson would be to have students use the Stanford NER GUI, which is freely available. However, it can be hard to get it up and running, and unless you have a very long class session and very patient students, doing the download and set up during class is likely going to be frustrating. If you have access to a computer lab and labor to get the software set up ahead of time, it’s well-documented and doesn’t require a large amount of technical skill. The basic steps of this lesson could easily be adapted to that option, and Rachel Sagner Buurma’s NER lesson (cited below) is another fabulous resource.
If you want to use different content, some parts of this lesson will change. You will need to find or create a workset with that content within HathiTrust Analytics and direct students to that workset.
Buurma, R.S. People, places, time money: Listing Robinson Crusoe [lesson plan]. Retrieved from https://github.com/rbuurma/rise-2015/blob/master/Assignments/Rise_assignment_1.md
Luchetta S. Exploring the literary map: An analytical review of online literary mapping projects, Geography Compass. 2017;e12303. https://doi.org/ 10.1111/gec3.12303.
Stanford Natural Language Processing Group. Stanford named entity recognizer. https://nlp.stanford.edu/software/CRF-NER.shtml