LIL at IIPC: The Story So Far

We’re halfway through the International Internet Preservation Consortium’s annual web archiving conference. Here are just a few notes from our time so far:
Jack, Genève, and Matt

Auto-captioned photo of Jack, Genève, and Matt — thanks CaptionBot!

April 12

  • Andy Jackson kicks the conference off with “Have I accidentally committed international journalism?” — he has contributed to the open source software that was used to review the Panama Papers.
  • Andrea Goethals describes the desire for smaller modules in the web archive tool chain, one of her conclusions from Harvard Library’s Environmental Scan of Web Archiving. This was the first of many calls throughout the day for more nimble tools.
  • Stephen Abrams shares the California Digital Library’s success story with Archive-It. “Archive-It is good at what it does, no need for us to replicate that service.”
  • John Erik Halse encourages folks to contribute code and documentation. Don’t be intimidated and just dive in.
  • There seems to be consensus that Heritrix is a tool that everyone needs but no one is in charge of — that’s tough for contributors. A few calls for the Internet Archive to ride in and save the day.
  • We’re not naming names, but a number of organizations have had their IT departments, or IT contractors, seek to run virus scanners that would edit the contents of an archive after preservation. (Hint: it’s not easy to archive malware, but “just delete it” isn’t the answer.)
  • Some kind member of IIPC reminds us of the amazing Malware Museum hosted by the Internet Archive.
  • David Rosenthal notes that Iceland has been called the ”Switzerland of bits”. After being in Reykjavik for only a few days, we sort of agree!
  • Jefferson Bailey of the Internet Archive echoed concerns about looming web entropy: there is significant growth in web archiving, but a concentration of storage for archives.
  • Nicholas Taylor of the Stanford Digital Library is responsible for the most wonderful acronym of all time, WASAPI (“Web Archiving Systems API”).
  • The Memento Protocol remains the greatest thing since sliced bread. (Here we refer to the web discovery standard, not the Jason Bourne movie.)
  • We chat with Michael Nelson about his projects at ODU, from the Mink browser plugin to the icanhazmemento Twitter bot.

April 13

  • Hjálmar Gíslason points out that 500 hours of video are uploaded to YouTube each minute. It would take 90,000 employees working full time to watch it all. Conclusion: Google needs to hire some people and get on this.
  • Hjálmar also mentions Tim Berners-Lee’s 5-Star Open Data standard. Nice goal to work toward for Free the Law!
  • Vint Cerf on Digital Vellum: the Catholic Church has lasted for an awfully long time, and breweries tend to stick around a long time. How could we design a digital archiving institution that could last that long?
  • (Perma’s suggestion: how about a TLD for URLs that never change? We were going to suggest .cool, because cool URLs don’t change. But that seems to be taken.)
  • Ilya Kramer shows off the first webpage ever in the first browser ever, running in a simulated NeXT Computer, courtesy of
  • Dragan Espensch says Rhizome views the web as “performative media” while showing Jan Robert Leegte’s [untitled]scrollbars piece through different browsers in Sometimes the OS is the artwork.
  • Matthew S. Weber and Ian Milligan have been running web archive hackathons to connect researchers to computer programmers. Researchers need this: “It would be dishonest to do a history of the 90s without using web archives.” Cue <marquee> tags here.
  • Brewster Kahle pitches the future of national digital collections, using as a model the fictional (but oh-so-cool) National Library of Atlantis. Shows off clever ways to browse a nation’s tv news, books, music, video games, and so much more.
  • Brewster encourages folks to recognize that there is no “The Web” anymore: collections will differ based on context and provenance of the curator or crawler. (What is archiving “The Web” if each of us has a different set of sites that are blocked, allowed, or custom-generated for us?)
  • Brewster voices the need for broad, high level visualizations in web archives. He highlights existing work and thinks we can push it further.
  • And oh by the way, he also shows off Wayback Explorer over at Archive Labs — graph major and minor changes in websites over time.
  • Bonus: We’re fortunate enough to grab some whale sushi (or vegan alternatives) with David Rosenthal, Ilya Kreymer, and Dragan Espenschied.

Looking forward to the next couple of days …

LIL at IIPC: Noticing Reykjavik

Matt, Jack, and Anastasia are in Reykjavik this week, along with Genève Campbell of the Berkman Center for Internet and Society, for the annual meeting of the International Internet Preservation Consortium. We’ll have lots of details from IIPC coming soon, but for this first post we wanted to share some of the things we’re noticing here in Reykjavik. 

[Genève] Nothing in Reykjavik seems to be empty space. There is always room for something different, new, creative, or odd to fill voids. This is the parking garage of the Harpa concert hall. Traditional fluorescent lamps are interspersed with similar ones in bright colors.


[Jack] I love how many ways there are to design something as simple as a bathroom. Here are some details I noticed in our guest house:

Clockwise from top left: shower set into floor; sweet TP stash as design element; soap on a spike and exposed hot/cold pipes; toilet tank built into wall.

[Matt] Walking around the city is colorful and delightful. Spotting an engaging piece of street art is a regular occurrence. A wonderful, regular occurrence.



[Anastasia] After returning from Iceland for the first time a year ago, I found myself missing something I don’t normally give much thought to: Icelandic money is some of the loveliest currency I have ever seen.


The banknotes are quite complex artistically, and yet every denomination abides by thoughtful design principles. Each banknote’s side shows either a culturally-significant figure or scene. The denomination is displayed prominently, the typography is ornate but consistent. The colors, beautiful.


But what trumps the aesthetics is the banknotes’ dimensions. Icelandic paper money is sized according to amount: the 500Kr note is smaller than the 1000Kr note, which in turn is outsized by the 5000Kr note. This is incredibly important — it allows visually impaired people to move about more freely in the world.

In comparison, our money looks silly and our treasury department negligent, as it is impossible to differentiate the values by touch alone. And, confoundingly, there don’t seem to be movements to amend this either: in 2015 the department made “strides” by announcing it would start providing money readers, little machines that read value to people who filled out what I’m sure is not a fun amount of paperwork, instead of coming up with a simple design solution.


The coins are a different story. When I first arrived the clunky coins were a happy surprise — they’re delightfully weighty (maybe even a little too bulky for normally non-cash-carrying types), adorned with beautifully thoughtful designs. On one side of each of the coins (gold or silver), the denomination stands out in large type along with local sea creatures: a big Lumpfish, three small Capelin fish, a dolphin, a Shore crab.

On the reverse side the four great guardians of Iceland gaze intensely. They are the dragon (Dreki), the griffin (Gammur), the bull (Griðungur), and the giant (Bergrisi), that each protected Iceland from Denmark invasion in turn, according to the Saga of King Olaf Tryggvason. On the back of the 1 Krona, only the giant stands, commanding.

And that’s it. No superfluous information. No humans, either, only mythology and fish.

Returning home is good things, but sometimes it also means re-entering a world where money is just sad green rectangles (and oddly sized coins) full of earthly men.

How We’re Freeing the Law, Part 1: Books

1c60debAdi Kamdar is a 1L at Harvard Law School and our embedded reporter on the Free the Law project. In this first post, he tracks the progress of a casebook through our scanning process from start to finish.

Harvard Law Library is one of the few collections with nearly every law reporter—roughly 40,000 books in total. The Free the Law project’s goal is to put the court decisions inside these volumes online, so anyone can access the precedents that shape the American legal system. Right now, the project is about halfway through, and within the next couple years they’ll have completed this monumental task.

But how exactly does a book become a byte? And what happens to these physical texts after they’ve been digitized?

Harvard Depository

The project begins each week with a book order—a 600 book order, to be exact, for law reporters that chronicle U.S. legal history since the country’s inception.

The law reporters are held in a sprawling warehouse 30 miles away from the law school—the Harvard Depository. With over 200,000 square feet of storage space, the climate-controlled Depository’s mission is pure efficiency: each book—and there are over nine million—is sorted and stored by size, rather than by name or author, in order to maximize space.

But it turns out law reporters are the packing peanuts of the Harvard Depository. When the reporters were first sent over to the warehouse, instead of being stored normally, they kept the volumes around in the packaging room. Whenever they filled a cardboard box with other books for storage, they would throw in a reporter or two if there was any extra space that needed to be filled. No one thought the print reporters would be that useful anymore, so making them easily available in bulk was a low priority. Plus, the library had decided to cancel print runs of reporters in 2010, saving valuable shelf space, especially when digital copies were easily available online.

Because of this tactic, law reporters are spread all throughout the Depository. Asking for, say, Michigan’s volumes isn’t as simple as pulling out a handful of boxes—it’s a hunt.

Langdell Library

Every Wednesday, the team receives the 600 volumes of case reporters. They line the hallway of the ground floor of Langdell, filling shelf after shelf. One by one, each book is examined before it can be taken apart. (Some books—for example, volumes with marginalia—are flagged for archival purposes.) Each volume is then catalogued and given a unique barcode so it can be tracked throughout the whole process.


The books are then taken to the Prep Room where, ironically, they’re repaired before they’re chopped up. Damaged pages are taped together, book bindings are cut off by hand, and the remaining sheets are taken over to a guillotine. Once aligned, the operator has to press two separate buttons underneath the cutting table at the same time to make sure her hands aren’t under the blade. The result? Cleanly cut pages.


View post on

View post on

Next, the bundle of pages is hauled over to the Scanning Room. Here, six employees work overlapping shifts to ensure that pages are being scanned every day, 14 hours a day. Roughly 200 documents per minute are fed through the machine, which has a camera on top and bottom to image both sides of the page.


View post on

View post on

View post on

Now that the books are chopped and scanned, what happens to the physical pages? After all, the purpose of this project is to digitize the law. Plus, according to circulation records, very few people were reading the old reporters anyway. Rebinding them and keeping them in the library would be a waste of space, time, and money. But just in case anyone questions the authenticity of the scans, Harvard decided it would be valuable to have the physical copies accessible. So the project decided to vacuum seal the pages. Once the pages are jogged together (using a state-of-the-art paper-jogging machine) and placed back inside their book jacket, the volumes are taken over to one last room—where they will be put inside a meat packing device. Yes, it turns out that the meat industry unwittingly stumbled across the best way to preserve books. The machine shrink wraps the pages, maintaining the integrity of the volume while handily adding an extra layer of protection from mold, humidity, and bugs.

View post on

The re-bound volumes are then re-shelved, where they await being shipped off to…


View post on

Louisville, Kentucky

Because of the Harvard Law Library’s limited shelf capacity, the newly packaged pages will soon be loaded onto trucks and shipped down to Louisville.

Why Kentucky? Well, because of Underground Vaults & Storage, a company that has been storing all manner of things in Louisville’s old limestone mines. The sealed books will be stored there (where they will “fear no tornado, wildfire, flood or other natural disaster”) until the rare instance that they need to be recalled.

And that’s the story of these legal volumes—from one massive depository to another, by way of a guillotine, a scanner, and a meat packer. In our next post, we’ll explore what happens after they become digital images, and how Free the Law is building the largest free database of legal opinions in the world.

Free the Law Wintersession Sprint

USB stick labeled 'all of the caselaw'

TL;DR: We are running a two week data mining sprint from January 4-15, 2016, open to current Harvard students, based on early access to a brand new data set of American caselaw. To apply, send a resume and brief statement of interest to jcushman at law dot harvard dot edu.



We recently announced Free the Law, our project to scan every legal decision ever published in the United States. We’re generating the first consistent, comprehensive, and open database of American law, from the colonial era right up to 2015. You can read the New York Times coverage of the project here.

By the end of this project we’ll have millions of cases in the dataset — no one knows exactly how many. We’re scanning and processing tens of thousands of pages a day, and will soon have entire states completed.

Now it’s time to start exploring what to do with all that data. What new questions can we ask with millions of cases?

The answers cross every discipline at Harvard:

  • Can a spam filter be retrained to guess which torts cases make the most interesting stories?
  • How much money are we willing to fight over — and does the answer offer an alternate inflation index?
  • How has the use of Latin in the law changed over time — are judges writing more or less like regular people?
  • How have defendants’ choice of murder weapon changed? The gender balance of litigants? The reliance on scientific evidence?
  • Can we trace a family’s history through the cases they were involved in?

Caselaw is the historical record of applied moral philosophy under the law. Unlocking its secrets will have an incredible impact on scholarship of all kinds.

The Challenge

Hence our challenge: pick a question you think caselaw might help you answer, perhaps drawn from one of your classes. Build a tool to help answer it – whether that means loading up your favorite ML library, configuring an off-the-shelf statistical tool, or writing code from scratch. In a two-week sprint, do your best to answer the question, and to generalize your tool to help other researchers answer similar questions. We’ll help share the discoveries you make and the tools you build.

The Data

The data set we will share with participants will include a single state’s complete published caselaw. The data includes: (1) TIFF and JPEG2000 images for each scanned page; (2) ALTO XML files for each scanned page; and (3) structured XML files for each case.


December 2015: Application period.

January 4, 2016: Delivery of data set to participants.

January 4, 6, 8, 11, 13, 15: The group will check in three times a week, either in person or remotely, to share notes, progress updates, and requests for help.

Week of January 18: demo day (date TBD).

To Apply

Send your resume and brief statement of interest (such as a general idea of what sort of project you would like to work on) to jcushman at law dot harvard dot edu. If you would like to work with others, feel free to apply as a group.

Link roundup November 30, 2015

This is the good stuff.

The Irony of Writing About Digital Preservation

The Original Mobile App Was Made of Paper | Motherboard

The most Geo-tagged Place on Earth

The Illustrated Interview: Richard Branson

Why is so much of design school a waste of time?

Link roundup November 16, 2015

This is the good stuff.


Rebellious Group Splices Fruit-Bearing Branches Onto Urban Trees | Mental Floss

Idea Sex: How New Yorker Cartoonists Generate 500 Ideas a Week – 99u

Google Cardboard’s New York Times Experiment Just Hooked a Generation on VR

Link roundup November 2, 2015

This is the good stuff.


French Vending Machines Dispense Short Stories Instead Of Snacks | Mental Floss

Swiss Style Color Picker

Chicago Ideas Week

Link roundup October 21, 2015

A little late to the party, but happy to be using the taco emoji!

The Internet’s Dark Ages

An Error Leads to a New Way to Draw, and Erase, Computing Circuits

Searching the world for original Pizza Hut buildings

Will digital books ever replace print?

Hiring! Devops energy wanted.

The Harvard Library Innovation Lab is looking for a devops engineer to help us build tools to explore the open internet and see deep into the future of libraries.

Our projects range in scope from fast-moving prototypes to long-term innovations. The best way to get a feel for what we do is by looking at some of our current efforts.


image01, a web archiving service that is powered by libraries


H2O, a platform for creating, sharing and adapting open course materials


Awesome Box, an alternate returns box used by hundreds of libraries


What you’ll do

Own the production infrastructure that ensures Lab applications are responding quickly to people and bots on the internet

Write code that will monitor systems and develop logic that will automate common deployment and maintenance tasks

Act as a core member of our fun and dynamic team by helping us shape ideas and efforts in libraries, technology, and law. We’re freewheelin’. We fully encourage the pursuit of interests and opportunities


We’re hiring a person and not a skillset, but our current stack of keywords might be helpful

Heroku, AWS, S3, Python, Django, Fabric, git and GitHub, Ruby, Rails, MySQL, PostgreSQL, Apache, NGINX, Elasticsearch, Redis, UNIX, Bash, Rollbar, Splunk


Find details and apply using the Harvard Recruitment Management System. If you have questions, email us directly at .

Link roundup October 1, 2015

Homogeneously contributed

Why Preserving Old Computer Games is Surprisingly Difficult | Mental Floss

Get Peanutized | Turn Yourself into a Peanuts Character

This Camera Refuses to Take Pictures of Over-Photographed Locations | Mental Floss