Skip to main content
SearchLoginLogin or Signup

Thinking About Digital Archives: One Tool at a Time

This lesson plan introduces digital archives to an audience of students, faculty, and cultural heritage professionals through a locally-adaptable lesson kit that includes glossaries, links to sources, individual lessons, and a sample archive.

Published onSep 19, 2019
Thinking About Digital Archives: One Tool at a Time

Session Specifics

Can be used for workshop or for introductory lessons in a semester-long class

Instructional Partners

Archivists, Digital Humanities Librarians, Digital Humanities faculty, discipline-specific faculty


  • In the academic world: mid-track undergraduate students, senior capstone students, graduate student methods, and faculty interested in Digital Collections.

  • Staff working with cultural heritage and archival collections in museums, libraries, and historical societies.

Curricular Context

Archival research, Digital Humanities methods using primary source materials, and digital collection building

Learning Outcomes

Participants/students will have been exposed to a variety of practices and methods applied to digital collection development and exploitation. The lessons can be used to introduce digital collection methods to a variety of audiences including: undergraduates, first year graduate students, and faculty in an academic setting, as well as staff at cultural heritage organizations, archivists, and librarians. Participants can build out a comprehensive Digital Humanities project using an archive or improve skills in any of the areas covered.


The lesson plan is conceived as a kit of resources. It includes glossaries, links to sources, and a sample archive, as well as individual lessons. The instructor should have a baseline comfort with the key concepts and best practices associated with digital collection work. In preparation for teaching these modules, they should familiarize themselves with the key concepts and the tools provided, and understand the wider application of the lessons included. Audience members require no prior knowledge or preparation, but it is assumed that audience members possess basic research skills


All materials are included in the kit. The strategies can then be applied to local archives and digital collections or any other analog or digital collections. It is assumed that the instructor will have prior knowledge of the tools and concepts. The kit is intended to be a first introduction to these tools and concepts.

Session Outline

The kit below includes an introduction, basic definitions, project and tool examples, a sample archive, and exercises that can be performed on those materials using the tools and concepts.

Thinking About a Digital Archive: One Tool at a Time

A road trip is just a road trip until it’s a digital project

Created by

Greta Bahnemann and Jeannine Keefer

October 1, 2018

Table of Contents


  • Using the Kit

  • Thinking About Digital Collections: Purpose & Applications

Techniques and Tools

  1. Text Analysis

  2. Image Analysis

  3. Online Exhibits

  4. Mapping and Geospatial Projects

  5. Sharing with an Aggregate

  6. Creating Metadata

  7. Copyright & Rights Management

  8. Transcription & OCR

A Sample Digitized Archive

  • Resources 1-12: Digital photographs

  • Resources 13-14: Postcards

  • Resources 15: Journal

  • Resource 16-17: Brochures

  • Resource 18: Map

  • Metadata table for sample digitized archive

Metadata Mapping Tables

  • Table 1. Field Mapping to the Dublin Core Metadata Element Set

  • Table 2. Crosswalks: Dublin Core - MODS - VRA CORE 4.0

Exercises and Activities

  • Exercise #1: Text Analysis

  • Exercise #2: Image Analysis

  • Exercise #3: Image Analysis

  • Exercise #4: Online Exhibits

  • Exercise #5: Mapping and Geospatial Projects

  • Exercise #6: Sharing with and Aggregate

  • Exercise #7: Creating Metadata

  • Exercise #8: Creating Metadata

  • Exercise #9: Creating Metadata

  • Exercise #10: Copyright and Rights Management

  • Exercise #11: Transcription and OCR

Introduction: Using the Kit

This resource kit is designed to help users create, explore, and leverage digital collections. It provides a myriad of ideas on pedagogical and practical approaches to the various issues associated with digital collections, including metadata creation, geospatial metadata, copyright, image and text analysis, and more.

The kit offers concrete starting points for skill building, finding vetted and reliable examples, and viewing creative applications in multiple disciplines. Each section includes: a definition, a brief overview of the resource, links to examples of the idea or concept, and provides links to vetted content which provide additional information. Note: The definition is derived from the resource’s website or from current scholarship relating to this resource. The examples show that particular component or part in context of a real-world application. The list of additional resources and tools enables users to learn more about different areas of digital collection practice.

The projects, websites, and software mentioned in the “Examples” and “Additional Resources” sections are provided by the authors to serve as illustrative examples. They are in no way an endorsement of said products, projects, and services.

Thinking About Digital Collections: Purpose & Applications

Digital collections are created to preserve rare or fragile materials, provide access to unique or valuable materials, and to increase the exposure of all collections. But these reasons are just the tip of the iceberg in terms of understanding the usefulness of digital collections. Digital collections can also serve as the content source and provide structure for classes and teaching; exhibits and storytelling; in promotional materials for your department or organization; and serve as the foundations of digital methods of inquiry in small and large scale digital humanities projects.

Techniques & Tools

The following resources and exercises will help users understand the concepts, considerations, issues, and problems involved in building and exploiting digital collections. They will familiarize users with ways in which archive and collection content can be interrogated, shared, and understood.

1. Text Analysis

Text Analysis is a broad term covering the many different processes by which language-based documents examine the various ways in which people make sense of who they are and how they fit into the larger world. Texts are parsed to extract machine-readable data and facts. This creates sets of structured data out of unstructured documents.

Text analysis is performed on text-based materials, which includes everything from books, films, television programs, magazines, advertisements, and more. Text analysis allows users to see both similarities in patterns, as well as differences in how individuals interpret the world around them and the world’s cultures and subcultures. . Text Analysis is is sometimes referred to as Text Mining. Text Analysis tools help users both visualize and interpret the resulting analysis. In addition to tools ready for use, programming languages are often employed to perform text analysis activities including Python and R.


Explores trends and frequency in use of biblical quotations in newspapers mined from Chronicling America: Historic American Newspapers (LoC).

Provides an introduction and explanation of the various tools and visualizations that are possible using the EEBO Text Creation Partnership using the XML/SGML encoded transcriptions of early printed books in Early English Books Online. Presents examples with tools and visualizations, such as the N-gram browser and corpus analysis of English print culture before 1700 in EEBO-TCP.

Text mining, analysis, and topic modeling of Martha Ballard's diary (1785-1812) who documented daily life as a midwife in Maine.

Text mining methods are being used to produce semantic metadata and index, along with visualization, crowdsourcing, and social media, to provide enhanced access to Biodiversity Heritage Library documents.

The full run of the Richmond Daily Dispatch from November 1860 to April 1865 is text mined and presented with topic models based on prominent topics found within articles of this newspaper. Charts display the topic proportions by month in all articles containing that topic, transcriptions of articles can be viewed with additional topics identified

Additional Resources:

Chapter 1, “What is Textual Analysis” guides students away from finding the "correct" interpretation of a text and explains the various nuances and meanings found in text-based materials.

Text analysis Library Guide from Duke University including a concept overview, web scraping, and analysis methods and tools.

Voyant Tools is a web-based digital text reading and analysis environment. It is intended to facilitate reading and interpretive practices for digital humanists and the general public. Users can type or paste in text or URLs and obtain analyzed results.

Java based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction and other machine learning applications to text.


TAPoR 3 is a gateway for tools for sophisticated analysis and retrieval, along with representative texts for experimentation. The gateway has a curated list of tools for Web Scrapers, Data modeling and management, Voyant Tools, social media analysis, and more.

2. Image Analysis

Image Analysis is the extraction of any meaningful data from images. Images can take on a multitude of meanings and subjects. Digital images’ meaning is primarily extracted by means of visual analysis conducted by humans, though there are algorithms that allow a machine to discern patterns such as faces, objects, etc. in digital images.

The human eye, however, can discern layers of meaning and intent beyond the simple content of a digital item. Mechanical image analysis continues to evolve and improve with recent improvement in pattern recognition and facial recognition, but as of this writing it remains an inexact science. While the examples below come out of the discipline of art history, the methods are the same, no matter the content of the image. Cataloguers will likely need to conduct research on the image to accurately describe it. These descriptions can be as simple as “a baby sitting on a woman’s lap” or as nuanced as “Christ sitting on the lap of the Virgin Mary holding a pomegranate and giving the blessing.” Both are correct, but one provides more context and layers of possible meaning.


While this source focuses on the analysis of art, the methods employed to research and describe a work of art can be applied to any object or image. Dr. Glass also covers the object, visual (formal) analysis, style, cultural artifacts, subject, iconography, and function.

Picturing America was a project of the National Endowment for the Humanities (NEH) that brought masterpieces of American art into classrooms and libraries nationwide. The project concluded in 2009; but many of the educational materials created for the program are still available for use by students, teachers, and lifelong learners.

Beyond the Taj is a collection of visual and written materials assembled to support instruction and research on South Asian architectural expression in the joint perspectives of architecture and ethnography. The visual core of the collection consists of approximately 7000 photographs of works of architecture, pilgrimage locales and domestic life taken in India and Sri Lanka by Professors Robert D. Scotty MacDougall (1940-1987) and Bonnie G. MacDougall (1941-2017), an anthropologist and an architect.

Additional Resources:

The selected resources represent free Web sites used by Library of Congress staff to catalog material in a general documentary picture collection. Many of the sites feature visual images for easy comparison to what you are cataloging.

Website is maintained by Zachary M. Schrag, Professor of History at George Mason University. His image analysis advice is designed primarily for undergraduate and graduate students studying United States history.

Includes multiple entries on symbolism in material culture. There are pages devoted to Symbols in Christian Art, Ancient Greek and Roman Symbols, Symbols of Death, Masonic, and Norse Symbols, etc.

Article on face recognition accuracy of forensic examiners, super recognizers, and facial recognition algorithms.

3. Online Exhibits & Digital Storytelling

Online exhibits provide a space to display, interpret, and create a narrative for digital collections. Exhibits can serve any and/or all of the following goals: preservation, access, promotion/awareness, and pedagogy. The platform used to create an online exhibit might be as simple as a web page or as complex as a full digital arts or humanities project. Exhibit platforms range from expensive software for purchase to open source options freely available online. Keep in mind that free, open source options often require maintenance that may result in additional costs as well as demand staff time and expertise.

Online exhibits are just one form of digital storytelling. There are other ways to add context, meaning, and access points to digital content, including collection guides, finding aids, and primary source sets. These methods often result in less complex results than a narrative-based exhibit and are faster and easier to create; but they can engage users with digital content in new and innovative ways.


Stories of national significance drawn from source materials in Libraries, Archives, and Museums across the United States.

Explore various topics in history, literature, and culture via sets of primary source materials developed by educators, librarians, and museum professionals complete with teaching guides for classroom use.

Modeled from the DPLA’s Primary Source Sets, these resources are designed to help students develop critical thinking skills by exploring a variety of topics related to Minnesota history and culture. Using both primary and secondary sources, these sets bring together different resources in new ways to help students better understand historic events and people in their context.

This site was created by students at the University of Oklahoma enrolled in the Presidential Dream Course, “Making Modern America: Discovering the Great Depression and New Deal” during Fall 2015. During the course, students became immersed in history, politics, and culture of the 1930s through lectures, readings, film screenings, and field trips. Note: uses Omeka.

Online exhibit at the National Archives that tells the behind-the-scenes story of how and why the meeting between Elvis Presley and President Richard Nixon in the Oval Office occurred - told through original photographs, letters, and memos.

Additional Resources

A leading open source web publishing platform for digital collections. First funded by the Institute of Museum and Library Services (IMLS) from 2007-2010, Omeka provides an option for museums, libraries, and archives wishing to publish collections and narrative exhibits to the web.

CollectiveAccess is software for describing all manner of things, and allows users to create catalogues that closely conform to user needs without custom programming.

CollectionSpace is a free, open-source collections management application that meets the needs of museums, historical societies, biological collections, and other collections-holding organizations. CollectionSpace is designed to be configurable to each organization’s needs, serving as a gateway to digital and physical assets across an institution.

Open Exhibits is an initiative that looks to transform the way in which museums and other informal learning institutions produce and share computer-based exhibits. Open Exhibits is both a collection of software and a growing community of practice.

Drupal is a content management software. It's used to make many websites and digital applications. Drupal is open source software. Anyone can download, use, work on, and share it with others. It is built on the principles of collaboration, globalism, and innovation. It is distributed under the terms of the GNU General Public License (GPL).

Northwestern University Knight Lab is a community of designers, developers, students, and educators working on experiments designed to push journalism into new spaces. Easy to use storytelling tools such as Storymap, TimelineJS, SoundciteJS, and Juxtapose allow users to create narratives using digital media, text, maps, and timelines.

Created by the Alliance for Networking Visual Culture and housed at the University of Southern California, Scalar is an open source digital authoring and publishing platform created for long-form, born-digital scholarship online. The platform allows authors to utilize nested, recursive, and non-linear narrative formats. Authors can assemble media from a variety of sources including YouTube, Vimeo, and [scholarly] digital collections with minimal technical expertise required. Scalar features a built-in API which allows authors to share their Scalar content with other applications.

4. Mapping and Geospatial Projects

A number of different standards currently exist for creating and capturing geographic metadata. Adding Geographic Metadata or Geolocations to digital collections can aid in user discovery, access, and use of digital items. Adding geolocation metadata can make a digital collection ready for future geospatial based projects and activities. Mapping and Geospatial data can also be gleaned from primary texts, reports, and other scholarly resources. Some mapping and geospatial projects might serve only as research while others can be digital humanities projects.

The tools available for these projects vary greatly in terms of cost, complexity, and user friendliness. Those listed under “Additional Resources” address tools for formatting and creating the metadata to those employing that metadata to make a map. Some mapping programs focus on narrative, while others are powerful engines for processing large amounts of geospatial data into legible and useful maps.


MDL has applied GeoNames to all items in their public collection “Minnesota Reflections.” Each item in Minnesota Reflections has been assigned a geolocation using GeoNames and includes a GeoNames URI.

The MEGA-Jordan project was launched as a collaboration of the Getty Conservation Institute (GCI) and the World Monuments Fund (WMF) with the DoA for the development and implementation of a geospatial information system to inventory and manage Jordan's numerous archaeological sites.

View the world’s 25 busiest airports using geospatial data to obtain a sense of their gargantuan scale and global significance.

Renewing Inequality visualizes the displacement of hundreds of thousands of families under the urban renewal initiatives of the 1950s and 1960s.

Viewing them from above gives a sense of their gargantuan

Additional Resources:

TGN is intended to aid cataloging, research, and discovery of art historical, archaeological, and other scholarly information. However, its unique thesaurus structure and emphasis on historical places make it useful for other disciplines in the broader Linked Open Data cloud. For GIS information, TGN may be linked to existing major, general-purpose, geographic databases.

The GeoNames geographical database is available for download free of charge under a creative commons attribution license. It contains over 25 million geographical names and consists of over 11 million unique features whereof 4.8 million populated places and 13 million alternate names. All features are categorized into one of 9 feature classes and further subcategorized into one out of 645 feature codes.

In July of 2013, the Mountain West Digital Library Geospatial Discovery Task Force convened with the purpose of developing a standard format for recording geospatial metadata.

StoryMapJS is a free tool to help users tell stories on the web that highlight the locations of a series of events.

Mapping and analysis: location intelligence for everyone. ArcGIS Online is a mapping and analysis product. Users can use it on its own or expand their work with other ArcGIS products. Users can share and integrate their work with other ArcGIS users. In addition to interactive maps, you can create StoryMaps to add narrative to your work.

QGIS is a free and open source geographic information system that allows you to create, edit, visualize, analyze, and publish geospatial information on WIndows, Mac, and Linus platforms.

  • CARTO Location Intelligence Software

CARTO is the platform to build powerful location intelligence apps with the best data streams available.

Neatline is a geo-temporal exhibit builder that allows you to create complex maps, image annotations, and narrative sequences from Omeka collections of archives and artifacts, and to connect your maps and narratives with timelines that are more-than-usually sensitive to ambiguity and nuance. You can import these documents (georeferenced historical maps, manuscripts, high-res photographs, etc.) from an existing collection, or create a new digital archive, yourself. Neatline offers the user the ability to not just place dots on maps, but to conceive and create complex meaningful visual representations of scholarship. Neatline only works with the installed version of Omeka. See Nealine Quickstart Guide for information on how to get started.

5. Sharing with an Aggregated Collection

The digital collection landscape is ever changing and evolving. No longer must digital collection projects stand alone. In today’s world more and more institutions and organizations are choosing to participate in aggregations of collections. These aggregations can happen across departments and divisions at large institutions, across organizations to create regional or local collaborations, and can also include large-scale projects such as participating in a statewide aggregation such as the Minnesota Digital Library, the Digital Library of Georgia, the California Digital Library, and many others. Aggregated collections can also cross state, regional, or national boundaries. Examples are: the Mountain West Digital Library, the Digital Public LIbrary of America, and Europeana.


Resource page for DPLA Service and Content hubs as well as prospective members.

The Minnesota Digital Library (MDL) supports discovery and education through access to unique digital collections shared by cultural heritage organizations from across the state of Minnesota. MDL works with 188 contributing partners from cultural heritage organizations across Minnesota.

The Digital Library of Georgia is a GALILEO initiative based at the University of Georgia Libraries that collaborates with Georgia's Libraries, archives, museums, and other institutions of education and culture to provide access to key information resources on Georgia history, culture, and life.

Mountain West Digital Library provides free access to over 960,000 resources from universities, colleges, public libraries, museums, historical societies, and government agencies, counties, and municipalities in Utah, Nevada, Idaho, Montana, Hawaii, and other parts of the U.S. West.

Europeana works with thousands of European archives, libraries, and museums to share cultural heritage for general consumption and research. Uses have access to over 50 million digitised items including books, music, artworks, and more.

Additional Resources:

DPLA’s Terms of Service including DPLA’s Privacy Policy, account creation, restrictions on use, and licensed content.

The DPLA Metadata Application Profile (MAP) is the basis for how metadata is structured and validated in DPLA, and guides how metadata is stored, serialized, and made available through our API in JSON-LD.

6. Creating Metadata

Metadata. No term is more tied to the creation of digital content; and no single term is more confusing, intimidating, or misunderstood. Metadata is essentially information; and refers to any kind of content: analog or digital. Metadata can describe single objects such as photographs and maps, and it can also include describe objects made up of multiple parts such as books, pamphlets, archaeological sites, and the built environment. Other examples of objects made up of component parts include: a set of historical postcards, a set of video-taped course lectures, or an oral history made up of an audio file, transcript, and image of the person being interviewed.

Metadata can be embedded into the file OR it can be in a system that is external to the file, such as a database. It is important to realize that metadata provides users with multiple access points into a digital collection. It identifies and describes content; and itt disambiguates content. Metadata also enables content to be discovered and facilitates user searching and browsing.

Metadata can be further divided into three major categories:

1. Descriptive Metadata
2. Structural Metadata
3. Administrative Metadata -

Administrative Metadata is further differentiated into three sub-parts:

  1. Technical Metadata

  2. Preservation Metadata

  3. Rights Metadata

Metadata Definitions:

  • Descriptive Metadata - Information that refers to the physical attributes and intellectual content of material. Descriptive metadata aids in the discovery and identification of such materials.

  • Structural Metadata - Information about the relationship between the intellectual or physical elements of a digital object. Structural Metadata denote whether a resource is a simple/single page or compound/complex object.

  • Administrative Metadata - Data necessary to manage and use information resources. Typically external to the “informational content” of the resource.

  • Technical Metadata - Data captures the information about the PROCESS of digitizing an item and provides the requirements for using a digital item.

  • Preservation Metadata - Information about an object used to protect the digital object from harm, injury, deterioration, or destruction.

  • Rights Metadata - Information about how, when, and where an object can be used/consumed, shared, reused and repurposed and/or transformed.

Metadata serves many purposes. While it can describe, identify, and disambiguate content, it can also relate information associated with the physical ownership of the originals as well as the rights holders. Metadata can also provide valuable information about the digital production of materials, including file size, equipment used for format conversion, and information on the master files and web access files.

Metadata can also provide more sophisticated forms of access and understanding. It can help users understand the meaning associated with the content. The meaning associated with content can be tied to specific creators, styles, and time periods.

  • Is it a painting of a mother and child OR is it a painting of the Virgin Mary and the infant Christ?

  • Is it a black and white image of a house OR is it a photograph of “Fallingwater,” an example of modern, organic architecture designed by Frank Lloyd Wright?

  • Is it a drawing of a table OR is it a drawing of the altar at St. Stephen’s Cathedral in Vienna, Austria?

A parade is just a parade until we are given the tools to understand more. A user’s understanding of such an image changes when the user knows it is a Temperance Parade or a 4th of July Parade, or a worker’s strike parade, etc.

This understanding of both content and context can inform how we describe the digital resource. Descriptive Metadata can provide clues as to the meaning, the context, and the relationships associated with the analog original. Without metadata, the original meaning, understanding of relationships, and intent are essentially lost.

Metadata is structured into a schema or order. A metadata schema is essentially a logical plan showing the relationships between the metadata elements. It establishes the rules for both the use and management of metadata (note: sometimes a metadata schema is referred to as a “metadata element set” or a “data structure standard”). A metadata schema might be homegrown, or it might conform to one of several standardized schemas available such as Dublin Core, MODS, and VRA Core. Some schema are purposefully broad such as Dublin Core to accommodate all types of content, while others, such as the VRA Core, have been developed with a specific content type in mind (i.e. visual content of art, architecture, urban planning, and cultural objects). These discipline-specific schemas are used to address specific elements needed by a discipline. Decisions about the structure of metadata for a digital collection should be determined by the digital collection, its content, and its users.

Creating a more detailed metadata schema allows metadata creators to create detailed, precise, and unambiguous metadata records. In common speak, creating distinct fields for creators, descriptions, dates, and subjects, etc. in your metadata schema allows metadata creators to know what specific information is populated in each field. Ideally metadata should also conform to controlled vocabularies. Using controlled vocabularies provides standardized language for names, types, formats, and subjects. This use of standardized language vs. natural language enables content to be searched, browsed, and shared. Metadata schemas can be mapped to one another in order to share information between databases, collections, and institutions. See Metadata Mapping Tables: Table 1 and Table 2.

Relational vs. Non-Relational Cataloging

The creators of digital collections can approach the naming, identifying, and the structuring of relationships in two different ways. The first approach is referred to as “Relational Cataloging.” This approach is predicated on the creation of separate work and image records that establish and connect relationships.

Separate records are created for different creators, place names, and repositories, and individual works of art/architecture, etc. This type of cataloging creates a structure that is built by linking these different records to each other. A painting for Vincent van Gogh will utilize a creator record for van Gogh, a work record for the individual painting The Yellow House, and a repository record for the museum which holds the original (in this case the Vincent van Gogh Museum in Amsterdam). This approach results in highly interconnected digital content which can easily show data patterns and relationships. Relational cataloging reuses data so that complex relationships and connections between works and multiple images of that work, as well as creators and repositories can be leveraged and displayed. This type of cataloging lends itself to works of art, architecture, urban planning, decorative arts, archaeology, etc. A schema that provides for relational cataloging is VRA Core and the companion work Cataloging Cultural Objects, which provides thorough explanations on how to use the schema.

The second approach to cataloging digital content is less relationship-based and is more frequently employed by the creators of digital collections of cultural heritage materials. The difference can be partially accounted for by the differences in the collection types. Cultural Heritage digital collections are comprised of a vast amount of unique and unpublished content, such as photographs, diaries, letters, and journals. This content is highly specific to times, places, and communities, with little connectedness to other things (for example: a photograph of Main Street, Richmond, Virginia has little connection to a photograph of Main Street, International Falls, Minnesota).

Finally, Dublin Core does not fully allow for or acknowledge complex and interconnected relationships in its metadata. While it does have a simple relationship field it does not express the interconnected relationships found in disciplines such as archeology and art history. An example that shows that hierarchical relationship is a pair of bronze candlesticks that are displayed on an altar. That altar is located in a chapel, which is in turn one part of a large cathedral. Each of those components should have a metadata record that relates it to the other parts and shows the hierarchy. When using Dublin Core, digital content creators must build relationships in other ways using controlled vocabularies, subject headings, etc. This web of connectedness is created via mindful and intentional metadata work.


The DPLA Metadata Application Profile (MAP) is the basis for how metadata is structured and validated in DPLA, and guides how metadata is stored, serialized, and made available through our API in JSON-LD. The MAP was originally developed in 2012 and has been updated occasionally since. It is based on the Europeana Data Model (EDM), and integrates the experience and specific needs for aggregating the metadata of America’s cultural heritage institutions. The current version is 4.0.

Web page with numerous links to current information on resource description formats and digital library standards.

Metadata standards from the Canadian perspective including specific management and documentation guidelines for various collection types including: Humanities, Art, Visual Resources, Architecture, Ethnological and Archaeological Collections, and Sites and Monuments.

The MDL Metadata Guidelines provide organizations contributing collections to Minnesota Reflections with detailed information and assistance on completing the data entry process of their projects.

The Metropolitan Museum of Art’s online collection includes over 375,000 hi-res images of public-domain works of art which can be downloaded, shared, and remixed without restriction.

“Anthropology is all about what makes us human, our place in nature, our common concerns, and our differences. We explore these ideas through laboratory and collections-based research at the Museum and at field sites throughout the world. We build and maintain the Museum's world class collection, which now includes more than a million and a half objects, documenting the diversity and accomplishments of humankind. Through registration, conservation, collections management, and curation we preserve this collection and its documentation in order to connect communities, researchers, and the public to our shared global heritage.

Due to the size of our collection, please be aware that there may be errors and inconsistencies in the data presented here. We are continually updating and correcting our data. We welcome scholars, members of descendant communities, and others to contact us for confirmation or clarification on data you find here. If you would like to request a confirmation, a correction, or send an update to a record, please contact us.”

These Metadata Guidelines were written to better position UMass Amherst Libraries' Digital Collections for optimal indexing and display in an aggregated environment.

Additional Resources:

The Dublin Core Metadata Initiative, or "DCMI", is an open organization supporting innovation in metadata design and best practices across the metadata ecology. DCMI's activities include work on architecture and modeling, discussions and collaborative work in DCMI Communities and DCMI Task Groups, global conferences, meetings and workshops, and educational efforts to promote widespread acceptance of metadata standards and best practices.

The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.

The MODS schema for a bibliographic element set may be used for a variety of purposes, and particularly for library applications. The standard is maintained by the Network Development and MARC Standards Office of the Library of Congress with input from users.

Encoded Archival Description is an XML standard for encoding archival finding aids. EAD is maintained by the Technical Subcommittee for Encoded Archival Standards of the Society of American Archivists, in partnership with the Library of Congress.

Use Library of Congress Authorities to browse and view authority headings for Subject, Name, Title and Name/Title combinations, and download authority records in MARC format for use in a local library system. This service is offered free of charge.

The Getty vocabularies contain structured terminology for art, architecture, decorative arts, archival materials, visual surrogates, conservation, and bibliographic materials. Compliant with international standards, they provide authoritative information for catalogers, researchers, and data providers. The vocabularies grow through contributions. In the new linked, open environments, they provide a powerful conduit for research and discovery for digital art history.

Information about the rights that may be held in and over a digital item is referred to as a copyright or rights management. In recent years, the digital collections community has moved away from collection-specific statements and towards more standardized statements, such as the work outlined at This move towards understanding and assigning standardized rights statements aims to establish as rights as just another form of standardized metadata. Creating standardized data about the use and reuse of digital materials relates to larger issues of usability and Linked Open Data.

Copyright and rights management are governed and determined by the nation of origin and vary greatly. Copyright in the United States is determined on a case by case basis.


On February 7, 2017, The Metropolitan Museum of Art implemented a new policy known as Open Access, which makes images of artworks it believes to be in the public domain widely and freely available for unrestricted use, and at no cost, in accordance with the Creative Commons Zero (CC0) designation and the Terms and Conditions of this website.

Additional Resources:

The Code of Best Practices in Fair Use was created with and for the visual arts community. Copyright protects artworks of all kinds, audiovisual materials, photographs, and texts (among other things) against unauthorized use by others, but it is subject to a number of exceptions designed to assure space for future creativity. The Code describes common situations in which there is a consensus within the visual arts community about practices to which this copyright doctrine should apply and provides a practical and reliable way of applying it. provides a set of standardized rights statements that can be used to communicate the copyright and re-use status of digital objects to the public. Our rights statements are supported by major aggregation platforms such as the Digital Public Library of America and Europeana. The rights statements have been designed with both human users and machine users (such as search engines) in mind and make use of semantic web technology.

The Digital Image Rights Computator (DIRC) program is intended to assist the user in assessing the intellectual property status of a specific image documenting a work of art, a designed object, or a portion of the built environment. Understanding the presence or absence of rights in the various aspects of a given image will allow the user to make informed decisions regarding the intended educational uses of that image.

Creative Commons is a global nonprofit organization that enables sharing and reuse of creativity and knowledge through the provision of free legal tools. Our legal tools help those who want to encourage reuse of their works by offering them for use under generous, standardized terms; those who want to make creative uses of works; and those who want to benefit from this symbiosis. Our vision is to help others realize the full potential of the internet. CC has affiliates all over the world who help ensure our licenses work internationally and who raise awareness of our work.

8. Transcription & OCR

Converting documents, text, images, and sound files to digital and/or machine-readable formats is a prerequisite for many digital collection and digital humanities projects. Digitization is the process of capturing analog materials as digital images. Both transcription and OCR take digitized files one step further by creating new resources that are more accessible and understandable to a larger audience.

Transcription is the process of translating audio and/or video files into a text format. Transcription can be done manually to recover and display information that is difficult to read in an original document either due to script, printing, or scans that are difficult to read. Another option is to use Optical Character Recognition (OCR) programs to “read” digitized images of book pages, newspaper, etc. and convert them to text-based documents which can be easily and fairly accurately full-text searched. Users can then copy, edit or used this data for computational text analysis, convert to a more readily accessible format, etc. The quality of OCR can vary tremendously and is dependent on a number of factors including: resolution of the scanned document, the format and quality of the original print materials, and how well the OCR program can deal with diacritical marks. Often OCR-ed text must be fixed manually to be fully and accurately searchable. Each collection must balance imperfect OCR against fair and equal access to materials within the context of ADA compliance and how useful the text can be for those using the works for research.


The Collegian is the student newspaper at the University of Richmond. The archive has been digitized and OCRd, but users are invited to manually correct errors as they are searching the archive.

The names of millions of African Americans, slave and free, who lived, worked, worshiped, loved, and died in Virginia, are buried deep in the archival records and manuscript collections housed at the Library of Virginia. Untold Virginia: African American Narrative seeks to find these long silent voices. Whether contained in local court and state government records, private papers and business records, or newspapers and journals from the time, the untold narrative of a people is waiting to be discovered. Users are invited to transcribe the documents online. This is a crowdsourced transcription project and relies on the public to provide initial transcriptions.

Provides and overview for users in the standardized creation of transcripts for oral histories, as well as editing and processing guidelines.

The Columbia Center for Oral History Research (CCOHR) provides a range of education and research services (including the new 2018 Transcription Style Guide). CCOHR has resources to help users learn about oral history, as well as resources to help users design and construct their own projects. CCOHR can also put users in touch with others in their city or community who might be doing similar work. CCOHR can share, upon special request, the “Telling Lives” curriculum guide developed for middle school students.

Additional Resources:

ABBYY FineReader is a robust tool for OCR. ABBYY FineReader works well with digital camera images, unusually structured text (e.g. magazine layouts, newspaper columns), offers automated workflows for conversion, and supports up to 190 different languages.

Tesseract is an open source OCR engine. It can be used directly (via the command line) or with an API. Several third-party graphical user interfaces (GUI) are available for users who would like a drag-and-drop interface. Specialized packages for working with different languages and scripts, such as cuneiform and Vietnamese, are also available.

Google Docs allows users to perform OCR on uploaded images and PDFs. See this blog for a walkthrough and screenshots.

Scripto brings the power of MediaWiki to your collections. Designed to allow members of the public to transcribe a range of different kinds of files, Scripto will increase your content’s findability while building your user community through active engagement.

Software for transcribing documents and collaborating on transcriptions with others.

A Sample Digitized Archive

Resource #1

Title: “The Great Corn Palace!”
Creator: Image by sunnyamyk
Description: Exterior view of the 2013 Corn Palace in Mitchell, SD.

Type: Still Image
Format: Color photographs

Date Created: 2013-08-05
Software/Hardware: Apple iPhone 5
Rights: CC BY-NC-SA 2.0


GeoNames URI:

Subject Headings:

Corn Palace (Mitchell, S.D.);

Crop art;

Concert halls;

Moorish Revival;

Onion domes;


Resource #2

Pink Elephant “Pinkie”, DeForest, WI

Image by: Eric Wilcox, CC BY-NC 2.0

November 17, 2009


Image title: “My Pink Elephant”

Resource #3

Witches Gulch, Wisconsin Dells, WI

Image by: jmehre, CC BY-NC-SA 2.0

May 23, 2008

Image title: “Wisconsin Dells: Witches Falls, immediately followed by Witches Bathtub in Witches Gulch”

Resource #4

Mount Rushmore, Mount Rushmore National Park, Keystone, SD

Image by: Diana Morales, CC BY-NC 2.0

October 21, 2015

Nikon D3100

Image title: “Mount Rushmore: Mount Rushmore National Park”

Color photographs

Resource #5

Jolly Green Giant, Blue Earth, MN

Image by: Mykl Roventine, CC BY-NC-SA 2.0

August 5, 2006

Sony DSC-W30

Image title: “World’s Largest Jolly Green Giant: 55-feet tall - ho ho ho! A creation of F.A.S.T Corp, Blue Earth, Minnesota”

Resource #6

1880 Cowboy Town, Buffalo Ridge, SD

Image by: Nick Sherman, CC BY-NC-SA 2.0

August 30, 2014

Apple iPhone 5s

Image title: “OLD WeST SALOON: 1880 Cowboy Town; Buffalo Ridge, South Dakota”

Resource #7

F.A.S.T. Corporation graveyard, Sparta, WI

Image by: sporst, CC BY 2.0

September 4, 2017

Google Pixel

Image title: “Fast Corporation graveyard of fiberglass molds, Sparta, WI”

Resource #8

Chapel in the Hills, Rapid City, SD

Image by: Jake DeGroot, CC BY-SA 3.0

July 5, 2008

Image title: “The front of the Chapel in The Hills in Rapid City, SD”

Resource #9

Garnet Ghost Town, Garnet, MT

Image by Kyle Freeman, CC BY-NC-SA 2.0

September 7, 2013

Image title: “Garnet Ghost Town - Garnet, MT: Abandoned mining town in the mountains”

Resource #10

Snoqualmie Falls, Snoqualmie, WA

Image by: Jeannine Keefer, CC BY-NC-SA 2.0

June 29, 2014

Canon PowerShot S90

Image title: “Snoqualmie Falls”

Resource #11

The Troll (Fremont Troll), Seattle, WA by Steve Badanes, Will Martin, Donna Walter, and Ross Whitehead

Image by: Jeannine Keefer, CC BY-NC-SA 2.0

June 14, 2018

iPhone SE

Resource # 12

Ballard Locks, Seattle, WA

Image by: Jeannine Keefer, CC BY-NC-SA 2.0

June 14, 2018

iPhone SE

Resource #13 a

Title: Hiram Chittenden “Ballard Locks” postcard (obverse), Seattle, Washington

Creator: Impact Photo Graphics

Description: Multi-view postcard of the Hiram Chittenden Lock and Dam. The Hiram M. Chittenden Locks opened in 1917 and are nicknamed the Ballard Locks. The Lock and Dam links Puget Sound with the freshwater Ship Canel, which connects to Lake Union and Lake Washington. The multi-view postcard includes images of: a small map, dam construction, and a cross-section diagram demonstrating how locks work.

Type: Still Image

Format: Postcards

Materials: paper

Dimensions: 10.5 x 15.7 cm

Subject Headings:

Resource #13b

Ballard Locks Postcard: Reverse

Materials: paper

Measurements:10.5 x 15.7 cm

Resource #14a

Edward Curtis Postcard: Obverse

Materials: coated paper

Measurements: 10.2 x 15.2 cm

Resource #14b

Edward Curtis Postcard: Reverse

Materials: coated paper

Measurements: 10.2 x 15.2 cm

Resource #15a

Road Trip Journal

Creator: Jeannine Keefer

Format: PDF

Description: PDF of original handwritten journal

Resource #15b

Road Trip Journal

Creator: Jeannine Keefer

Format: Transcript

Description: Transcript of handwritten journal

Road Trip Journal

Day 1

�?!Today we set off on our adventuer. We picked up the car and headed west out of Chicago along Route 90. Our destination is Seattle. I think we will take some time so we can see the weird and wonderful things along the way…

Eleven hours later and we managed to make it to Mitchell, SD - home of the Corn Palace! We can’t wait to see it in the daylight. Along the way we made sure to see an number of fiberglass roadside wonders.

Day 2

The Corn Palace - not just any corn palace - “The World’s ONLY Corn Palace”. This Moorish Revival beauty actually functions! While its decoration is made fresh each year, the structure sers as a venue for concerts, sports events, exhibits, and other community happenings. After visiting the Co� rn Palace we did some antiquing and had a meal at Fanny Harris’s Eating Establishment. � Tomorrow we head to Mount Rushmore with a stop at the Cowboy Town.

Day 3-4

Mount Rushmore. I keep expecting to see characters from North by Northwest. We must push on though (will update everyone w/ the pics we have taken�?!). Our next stop is Rapid City to find a hotel and then see the Capel in the Hills tomorrow.�?!

Day 5

Long day on the road as we ran across Montana from Rapid City to Missoula with a stop at the Garnet Ghost Twn. If we had come in April rather than June, we could have rented a cabin there! Oh well, a quick tour will have to suffice this time.

Day 6

Missoula to Snoqualmie, WA. Mountains, Mountains, Mountains. We are stopping in Snoqualmie for a few days before we hit Seattle. Aside from, or rather in addition to, the � Twin Peaks locions (the falls and Twede’s Cafe in North Bend for cherry pie), we plan to take advantage of the spa to revive after the long drive.

Day 8

Seattle! Our itinerary�?! : Fremont Troll, Ballard Locks, Seattle Art Museum… everything we can see before we need to return home.

Resource #16

Mount Rushmore brochure (1965)

Mount Rushmore Brochure

Resource #17

Ginkgo Petrified Forest State Park

Ginkgo Petrified Forest State Park brochure

Resource #18

Map of Interstate 90 in Washington State.

Title: Washington Oregon

Creator: AAA

Date: 2/17-5/18

Mapping to the Dublin Core Metadata Element Set

Table 1. Field Mapping to the

Dublin Core Metadata Element Set

The purpose of this table is to show that field names are not always the same as the field mapping name; and that not every field maps to an established element set. It also shows the distinction between Descriptive, Administrative, and Technical Metadata.

Digital Collection

With User-Friendly Field Names

Mapping to Dublin Core Metadata Element Set, Version 1.1

Descriptive Metadata









Date Created


Publishing Agency



Format – Extent



Physical Format

Format – Medium

Library of Congress Subject Headings




City or Township

Coverage – Spatial


Coverage – Spatial

State or Province

Coverage – Spatial


Coverage – Spatial

Geographic Feature

Coverage – Spatial

GeoNames URI

Coverage – Spatial



Administrative Metadata

Collection Name

Relations – Is Part Of

Contributing Institution


Contact Information


Rights Management


Local Identifier


MDL Identifier


Project Affiliation


Fiscal Sponsor


Technical Metadata

Date Digital

Date – Available

Item Digital Format


Master File Format


Master File Size


Master File Bit Depth


Master File Resolution


Master File Compression


Master File Width


Master File Height


Master File Hardware


Master File Software


Master File System


Master File Checksum




Metadata Crosswalks

Table 2. Crosswalks

Dublin Core - MODS - VRA CORE 4.0

Metadata Crosswalks translate element values from one schema to another. Crosswalks facilitate interoperability between different metadata schemas; and they serve as the core basis for metadata harvesting, as well as exchanging records between different projects, different collections, and different institutions. Table 2.0 shows the field crosswalk/mapping between Dublin Core (DC), Metadata Object Description Schema (MODS), and the VRA Core 4.0.

















Role = Creator










Role = Contributor or Commissioner

























Role = Publisher





































































Exercises and Activities:

Text Analysis

Exercise #1: Using transcriptions of the textual materials in the Road Trip archive or copying the text from the brochures, perform a simple text analysis to find patterns in language and to create machine readable data based on the written texts. Use any two of the tools listed in the “Additional Resources” section of “Text Analysis” to analyze the text. Compare results between platforms. What are the different ways the resulting data can be displayed? Add another text for comparison, for instance compare the journal to the post cards.

Image Analysis

Exercise #2: Image Analysis and Research activity using Resources #5 and #7. Read these images for larger cultural meanings. Are these identifiable characters? With what brands or companies are these images associated? What do company mascots tell us about the company and its brand?

Exercise #3: Look at the Edward Curtis, #14A. Do research to better understand the complexity and controversy surrounding Curtis and his representations of Native American culture. Write up a response to this image using both image analysis and historical research. Discuss how this information would inform the metadata you create for the image in the digital collection.

Online Exhibits

Exercise #4: Create an Exhibit using the images in the resources, including the map you created. Write a story or narrative using the Digitized Resources. If your story needs more information in one area, go to DPLA and find images to add to your exhibit (be sure to credit correctly, also look at rights management and be sure that the resources you selected are free to be shared and reused). Note that the online exhibit might make use of any number of online platforms for presentation, but having no platform in place does not preclude one from creating the exhibit via a storyboard including images and text that will be used to assemble the exhibit.

Consider the following factors:

  • Metadata

  • Rights and Re-use

  • Attribution page

  • Body of text that provides context and understanding (i.e. your exhibit script) as well as image captions

  • Teaching guide with discussion questions and classroom activities

Mapping and Geospatial Projects

Exercise #5: Using Digitized Resource #15a and #15b, create a map of the road trip using KnightLab StoryMap or another mapping tool. Consider how a geo-temporal narrative can provide not only an interpretation of the archive, but also highlight the collection and add context. Feel free to use the other images from the collection as illustrations of items mentioned in the diary. Also feel free to use content from DPLA and the internet, but be sure to note rights and fair use parameters.

Sharing with an Aggregate

Exercise #6: Create a set of criteria and requirements for submitting the digital content in this kit and its associated metadata into an aggregate at the state/regional level (such as Minnesota Digital Library or Digital Library of Georgia, etc.) and/or at the national level (such as Digital Public Library of America).

Consider the following factors:

  • Rights and Re-Use

  • Metadata

Creating Metadata

Exercise #7: Look at the Metadata fields in Digitized Resource #1. Use the same format and apply additional metadata fields to another resource in the collection.

Exercise #8: Ask users to refer to the section on Relational Cataloging and to catalog Digitized Resource #1 in both of the ways outlined: relational vs. non-relational cataloging. Assume first that the work is the Corn Palace and not the photograph of the Corn Palace. The first option will necessitate the creation of a work record for the Corn Palace, as well as a creator record, and an image record. The second approach will require the creation of a single record which should incorporate creator and work information into a single record. Next assume the photograph as the work with the Corn Palace as its subject.

Discuss the advantages and disadvantages of using the work/image model vs. the single record model. Describe the differences in metadata creation in terms of the level of granularity of data, the ability to reuse data, and the clarity of the user experience.

Exercise #9: Using both the Dublin Core and Generic Field schemas on the worksheets below, create metadata for Digitized Resource #1. Compare the specificity of the fields and the data created for each field. Discuss the value of using more fields which capture more granular information versus using fewer fields which capture less granular information. Discuss the display and functional implications of having multiple fields map to the same field name.

Talking Points:


Concept of repeating field mapping

Broad generic schema vs. more granular fields

Digitized Resource #1

DC Field Name

dc: title

dc: Creator

dc: Contributor

dc: Description

dc: Subject

dc: Coverage – Spatial

Digitized Resource #1

Example of a Digital Collection

With User-Friendly Field Names





Library of Congress Subject Headings


City or Township


State or Province


Exercise #10 : A user wants to use all of the images in their publication “The History of the Old West in Modern Photographs.” Can they? Are there any items in the collection that warrant further investigation into rights ownership? How would you go about determining any restrictions? Next write up a rights statement for the collection using

Transcription & OCR

Exercise #11 : Consulting the original document PDF, clean up the transcript of the journal and make a more readable copy. Remove diacritical marks, incorrect punctuation, and correct the spelling mistakes. Determine a policy for “correcting” misspellings from the original text. This policy should account for work parameters and inlcude a discussion of an acceptable margin of error.


There are essentially no right or wrong answers to the kit’s exercises. Each student should make informed decisions regarding the exercises based on their knowledge and experience, as well as their capacity to adopt the tools and resources presented. Exercise outcomes will vary depending on the type of student and the collection’s and/or project’s needs and purposes.

In the real world, different collections deal with the same issues differently depending on a number of factors including the size and scope of the collection, the collection’s intended audience, and the purpose of the digital project. Given the variables that exist in digital collection creation, collection developers will find a myriad of solutions to the problems all digital collections face.


Instructors should lead the participants in discussion of the outcomes of their lesson work. In particular, discussions should focus on how the tools and concepts might be applied elsewhere.

Contact Information

Greta Bahnemann
Metadata Librarian, Minnesota Digital Library
Minitex, University of Minnesota
Wilson Library, Room 60
309 19th Ave. South
Minneapolis, MN 55455
telephone: 612.625.6497
e-mail: [email protected]

Dr. Jeannine Keefer
Visual Resources Librarian
Boatwright Memorial Library
261 Richmond Way
University of Richmond, VA 23173
telephone: 804.289.8275
e-mail: [email protected]