SCROLL DOWN TO READ THE POST
Internet Archive Book Images: Scholar digitizes huge collection of images from
Attention fans of history, historical images, book art, and archives in general:
Georgetown University scholar, Yahoo fellow and developer Kalev Leetaru is currently working to create a searchable database of 12 million copyright-free images, from 600 million library book pages.
The recently launched Flickr-based Internet Archive Book Images currently hosts 2.6 million images that were automatically added with searchable tags. Ranging from 1500 through 1922, most images are beyond the scope of copyright.
With its focus on images as individual items, Leetaru’s project is different from former large library digitization efforts.
In a BBC article posted today, Leetaru notes:
For all these years all the libraries have been digitising their books, but they have been putting them up as PDFs or text searchable works.
They have been focusing on the books as a collection of words. This inverts that.
Stretching half a millennia, it’s amazing to see the total range of images and how the portrayals of things have changed over time. Most of the images that are in the books are not in any of the art galleries of the world – the original copies have long ago been lost.
The Internet Archive had used an optical character recognition (OCR) program to analyse each of its 600 million scanned pages in order to convert the image of each word into searchable text.
As part of the process, the software recognised which parts of a page were pictures in order to discard them.
Mr Leetaru’s code used this information to go back to the original scans, extract the regions the OCR program had ignored, and then save each one as a separate file in the jpeg picture format.
The software also copied the caption for each image and the text from the paragraphs immediately preceding and following it in the book.
Each jpeg and its associated text was then posted to a new Flickr page, allowing the public to hunt through the vast catalogue using the site’s search tool.
Full records show text before and after an image, hyperlinked tags, access to the book in which the image appears and its catalog entry, as well as the ability to gather all the images from the book.
This database is a huge boon for students of history.
It offers so much fodder for multimedia production, for creating galleries for study. With its wealth of journalistic photographs, charts, portraits, headlines, maps, drawings and editorial cartoons, it also offers vast opportunities to teach visual and historical analysis. Images relating to race, religion and gender will also spark conversation.
Science teachers will appreciate its content and diagrams relating to invention, innovation, as well as its beautiful images of natural history.
Art teachers will appreciate the wealth of decorative images and historical design elements.
This is also a huge boon to libraries.
Leetaru plans to share his code and encourages other library to engage in the process with their own books to constantly expand this universe of images.
Thanks again, to @infodocket for this lead!
For more public domain historical images, consider:
- American Memory Collection
- Archive.org (Internet Archive)
- Archive.org Newsreel Search
- NARA (National Archives and Records Administration)
- USA.gov Photos & Images
- National Archives Presidential Libraries and Museums
Filed under: digitization, flickr, images
About Joyce Valenza
Joyce is an Assistant Professor of Teaching at Rutgers University School of Information and Communication, a technology writer, speaker, blogger and learner. Follow her on Twitter: @joycevalenza
SLJ Blog Network
U.S. Gov: ‘All Books Must Have Round Corners’
Review of the Day – Bear and Bird: The Picnic and Other Stories by Jarvis
Review: Swim Team
Write What You Know. Read What You Don’t, a guest post by Lauren Thoman
The Classroom Bookshelf is Moving