CSE, Library partnership brings 19th century documents into the 21st
Imagine you are a history scholar. You need historical documents that mention the political structure of a town from the 19th century that no longer exists. You spend months, maybe years, searching thousands of scanned, handwritten documents looking for even one sentence that could help paint the picture.
So you ask yourself, why can’t a computer do this for me? University Librarian Greg Colati posed this very question when he reached out to CSE Associate Professor in Residence Joe Johnson. The two pulled in a few more colleagues, put their heads together and got to work. The result; a project funded by LYRASIS titled “Unlocking the Past: Handwritten Text recognition for 19th Century Manuscripts.”
While Optical Character Recognition (OCR) has been around for over 20 years, its value with handwritten historical documents is limited. Dr. Johnson’s research in neural networks suggests that a computer can be “trained” to recognize a set of handwritten documents from an author, and possibly from a small group of authors who may be influenced by each other in the same time period.
Using seven volumes of John Quincy Adams’ diary, Johnson and Colati are expanding the framework built by CSE undergrad Matt Mulhall in 2019. Matt spent that summer creating a training set of over 16,000 images of 22 different characters. Johnson and Colati hope to use this data to further develop computer recognition of words, then lines, and finally sentences. “Neural networks are all the rage right now,” says Dr. Johnson, “but you need a tremendous amount of annotated historical manuscript data which we just don’t have, hence the staged approach.”
The text recognition project will provide a foundation for developing a large-scale, open source software for handwriting recognition for historical documents. This project lays the groundwork for the ultimate expected outcome of creating improved access to handwritten historical documents, which will have a major impact on research in the humanities.
LYRASIS is a non-profit organization whose mission is to support enduring access to the world’s shared academic, scientific and cultural heritage through leadership in open technologies, content services, digital solutions and collaboration with archives, libraries, museums and knowledge communities worldwide. The grant is part of their Catalyst Fund which provides support for new ideas and innovative projects that explore, test, refine and collaborate on innovations with community-wide impact.
To read more about this project, please visit the UConn library blog: https://blogs.lib.uconn.edu/news/2020/07/#.XylXsUBFwuz