About this Event
Kaylen Dwyer, Digital Media Specialist, The Institute for Digital Research in the Humanities
Jamene Brooks-Kieffer, Associate Librarian / Data Services Librarian & Coordinator of Digital Scholarship, University of Kansas Libraries
Optical Character Recognition (OCR) software converts scanned images of typed, handwritten, or printed text into machine-readable and searchable files. This hands-on workshop will teach participants how to use OCR with printed text to create a text corpus for humanities data analysis. We will look at the basics of OCR, tools for processing, and introduce workflows for building a text-as-data corpus from printed text. We’ll learn how to use Tesseract-OCR, an open-source command line program.
For humanists without experience working in the command line, the first hour of the workshop (12-1pm) will be a primer on the command line interface—what it’s for, how to open it, how to navigate directories, and how to manipulate files.
The second hour (1-2pm) will dig into OCR and how to use Tesseract for extracting structured information from a page.
Participants with experience using the command line are welcome to join us at 1pm.
Register via email to firstname.lastname@example.org and state which parts you will be attending:
Registration is limited to 20 seats. Registration deadline is Wednesday, April 27th.