Announcing Summer Fellows

We’re beyond thrilled to introduce our first cohort of LIL Summer Fellows. The following seven brilliant minds will be working here in the Harvard Law School Library (HLSL) for the next 12 weeks, exploring new pathways in technology, law, and libraries.

group

Fellows, staff, and interns in the Langdell reading room

 

Neel Agrawalneelagrawal.com  Neel comes from the Los Angeles County Law Library where he managed one of the world’s largest collections of foreign and international legal materials. He’s also a dedicated world percussionist. Neel will spend the summer making significant progress on his African drumming laws project, https://africandrumminglaws.org/ Email Neel at neel.k.agrawal@gmail.com

Jay Edwards@meangrape  Jay will spend his weeks in the Lab this summer analyzing web archives in Perma.cc and recently digitized cases in Free the Law. Jay was the ninth employee at Twitter and was the lead database engineer for Obama for America in 2012. Potentially more exciting 🙂 for the HLSL community is Jay’s eight year old daughter who will be popping in to share her Hebrew cataloging skills. Email Jay at jay@meangrape.com

Sara Frug, @ssfrug  Sara is the Associate Director at the Legal Information Institute, housed at Cornell Law School. She runs the engineering team and helps design tools to make legal texts more accessible, usable and valuable. Her time this summer will be spent applying the techniques she uses with her team here in the Lab, and learning some new ones that she can take back to LII. Email Sara at sara@liicornell.org

Ilya Kreymer, @ikreymer  Ilya currently leads the development of Webrecorder, a tool designed to allow any user to create high-fidelity web archives of any content simply by browsing the web through this tool. Ilya will spend his fellowship working to improve Webrecorder, and working together with the Perma.cc team to solve some of the more difficult problems facing web archiving today. Email Ilya at ilya.kreymer@rhizome.org

Muira McCammon, @muira_mccammon  Muira will spend her fellowship building on her Guantánamo Bay Detainee Library research by writing a narrative nonfiction book about her journey probing the Guantánamo Bay Detainee Library, organizing a two-day, interdisciplinary international colloquium on GiTMO (marking the 15th anniversary of the opening of the detention camp) for Feb 2017, and interviewing a good number of GiTMO defense attorneys, journalists, veterans, and civilian book donors. Email Muira at muira.n.mccammon@gmail.com

Alexander Nwala@acnwala  Alex is a PhD student in the Computer Science department at  Old Dominion University in Norfolk, VA. Alex will spend the summer studying and building solutions in the personal and event centered digital archives space, http://www.cs.odu.edu/~anwala/ Email Alex at anwala@cs.odu.edu

Tiff Tseng, @scientiffic  Tiff will spend her time in the program working with makerspaces in libraries to help patrons skill share and connect over common interests using Spin, a documentation tool she created as part of her PhD work at the MIT Media Lab. Email Tiff at ttseng@mit.edu

 

The fellows have had a whirlwind first week sharing their research plans, running the first ever LIL fellows hour, and touring the HLSL.

jay-muira

Muira and Jay

lunch

Jack, Ilya, Paul

standing-table

Sara, Alex, Anastasia, Neel

Adam talks LIL on the Lawyerist Podcast

This May, Managing Director Adam Ziegler was a guest on the Lawyerist podcast, discussing recent goings-on at the Library Innovation Lab.

Sam Glover and Adam discuss the future of law, its challenges and how the Innovation Lab endeavors to address these. Perma.cc is chiefly discussed, along with H2O and the Free the Law project.

Listen here!

Screen Shot 2016-06-09 at 12.43.58 PM

The Lawyerist Podcast is a weekly show about lawyering and law practice hosted by Sam Glover and Aaron Street.

 

IIPC: Two Track Thursday

2016-04-14x

A protester throwing cookies at the parliament.

Here are some things that caught our ear this fine Thursday at the International Internet Preservation Consortium web archiving conference:

  • Tom Storrar at the UK Government Web Archive reports on a user research project: ~20 in person interviews and ~130 WAMMI surveys resulting in 5 character vignettes. “WAMMI” replaces “WASAPI” as our favorite acronym.
    • How do we integrate user research into day-to-day development? We’ll be chewing more on that one.
  • Jefferson Bailey shares the Internet Archive’s learnings ups and downs with Archive-It Research Services. Projects from the last year include .GOV (100TB of .gov data in a Hadoop cluster donated by Altiscale), the L3S Alexandria Project, and something we didn’t catch with Ian Milligan at Archives.ca.
  • What the WAT? We hear a lot about WATs this year. Common Crawl has a good explainer.
  • Ditte Laursen sets out to answer a big research question: “What does Danish web look like?” What is the shape of .dk? Eld Zierau reports that in a comparison of the Royal Danish Library’s .dk collection with the Internet Archive’s collection of Danish-language sites, only something like 10% were in both.
  • Hugo Huurdeman asks an important question: what exactly is a website? Is it a host, a domain, or a set of pages that share the same CSS? To visualize change in whatever that is, he uses ssdeep, a fuzzy hashing mechanism for page comparison.
  • Let’s just pause to say how inspiring this all is. It’s at about this point in the day that we started totally rethinking a project we’ve been working on for months.
  • Justin Littman shares the Social Feed Manager, his happenin’ stack to harvest tweets and such.
  • We learned that TWARC is either twerking for WARCs or a Twitter-harvesting Python package — we’re not entirely sure. Either way it’s our new new favorite acronym. Sorry, WAMMI.
  • Nick Ruest and Ian Milligan give a very cool talk about sifting through hashtagged content on Twitter. Did you know that researchers only have 7-9 days to grab tweets under a hashtag before Twitter only makes the full stream available for a fee? (We did not know that.)
  • We were also impressed by Canada’s huge amount of political social media engagement. Even though Canada isn’t a huge country,[Ian’s words not ours] 55,000 Tweets were generated in one day with the #elxn42 tag.
  • Fernando Melo of Arquivo.pt pointed out that the struggle is real with live-web leaks in his research comparing OpenWayback and pywb. Fernando says in his tests OpenWayback was faster but pywb has higher-quality playbacks (more successes, fewer leaks). Both tools are expected to improve soon. We say it’s time for something like arewefastyet.com to make this a proper competition.
  • Nicola Bingham is self-deprecating about the British Library’s extensive QA efforts: “This talk title isn’t quite right because it implies that we have Quality Assurance Practices in the Post Legal Deposit Environment.” They use the Web Curator Tool QA Module, but are having to go beyond that for domain-scale archiving.
  • We’re also curious about this paper: Current Quality Assurance Practices in Web Archiving.
  • Todd Stoffer demos NC State’s QA tool. A clever blend of tools like Google Forms, Trello, and IFTTT to let student employees provide archive feedback during downtime. Here are Todd’s [snazzy HTML/JS] slides.

 
TL;DR: lots of exciting things happening in the archiving world. Also exciting: the Icelandic political landscape. On the way to dinner, the team happened upon a relatively small protest right outside of the parliament. There was pot clanging, oil barrel banging, and an interesting use of an active smoke alarm machine as a noise maker. We were also handed “red cards” to wave at the government.
 

Now we’re off to look for the northern lights!

LIL at IIPC: The Story So Far

We’re halfway through the International Internet Preservation Consortium’s annual web archiving conference. Here are just a few notes from our time so far:
Jack, Genève, and Matt

Auto-captioned photo of Jack, Genève, and Matt — thanks CaptionBot!

April 12

  • Andy Jackson kicks the conference off with “Have I accidentally committed international journalism?” — he has contributed to the open source software that was used to review the Panama Papers.
  • Andrea Goethals describes the desire for smaller modules in the web archive tool chain, one of her conclusions from Harvard Library’s Environmental Scan of Web Archiving. This was the first of many calls throughout the day for more nimble tools.
  • Stephen Abrams shares the California Digital Library’s success story with Archive-It. “Archive-It is good at what it does, no need for us to replicate that service.”
  • John Erik Halse encourages folks to contribute code and documentation. Don’t be intimidated and just dive in.
  • There seems to be consensus that Heritrix is a tool that everyone needs but no one is in charge of — that’s tough for contributors. A few calls for the Internet Archive to ride in and save the day.
  • We’re not naming names, but a number of organizations have had their IT departments, or IT contractors, seek to run virus scanners that would edit the contents of an archive after preservation. (Hint: it’s not easy to archive malware, but “just delete it” isn’t the answer.)
  • Some kind member of IIPC reminds us of the amazing Malware Museum hosted by the Internet Archive.
  • David Rosenthal notes that Iceland has been called the ”Switzerland of bits”. After being in Reykjavik for only a few days, we sort of agree!
  • Jefferson Bailey of the Internet Archive echoed concerns about looming web entropy: there is significant growth in web archiving, but a concentration of storage for archives.
  • Nicholas Taylor of the Stanford Digital Library is responsible for the most wonderful acronym of all time, WASAPI (“Web Archiving Systems API”).
  • The Memento Protocol remains the greatest thing since sliced bread. (Here we refer to the web discovery standard, not the Jason Bourne movie.)
  • We chat with Michael Nelson about his projects at ODU, from the Mink browser plugin to the icanhazmemento Twitter bot.

April 13

  • Hjálmar Gíslason points out that 500 hours of video are uploaded to YouTube each minute. It would take 90,000 employees working full time to watch it all. Conclusion: Google needs to hire some people and get on this.
  • Hjálmar also mentions Tim Berners-Lee’s 5-Star Open Data standard. Nice goal to work toward for Free the Law!
  • Vint Cerf on Digital Vellum: the Catholic Church has lasted for an awfully long time, and breweries tend to stick around a long time. How could we design a digital archiving institution that could last that long?
  • (Perma’s suggestion: how about a TLD for URLs that never change? We were going to suggest .cool, because cool URLs don’t change. But that seems to be taken.)
  • Ilya Kramer shows off the first webpage ever in the first browser ever, running in a simulated NeXT Computer, courtesy of oldweb.today.
  • Dragan Espensch says Rhizome views the web as “performative media” while showing Jan Robert Leegte’s [untitled]scrollbars piece through different browsers in oldweb.today. Sometimes the OS is the artwork.
  • Matthew S. Weber and Ian Milligan have been running web archive hackathons to connect researchers to computer programmers. Researchers need this: “It would be dishonest to do a history of the 90s without using web archives.” Cue <marquee> tags here.
  • Brewster Kahle pitches the future of national digital collections, using as a model the fictional (but oh-so-cool) National Library of Atlantis. Shows off clever ways to browse a nation’s tv news, books, music, video games, and so much more.
  • Brewster encourages folks to recognize that there is no “The Web” anymore: collections will differ based on context and provenance of the curator or crawler. (What is archiving “The Web” if each of us has a different set of sites that are blocked, allowed, or custom-generated for us?)
  • Brewster voices the need for broad, high level visualizations in web archives. He highlights existing work and thinks we can push it further.
  • And oh by the way, he also shows off Wayback Explorer over at Archive Labs — graph major and minor changes in websites over time.
  • Bonus: We’re fortunate enough to grab some whale sushi (or vegan alternatives) with David Rosenthal, Ilya Kreymer, and Dragan Espenschied.

Looking forward to the next couple of days …

LIL at IIPC: Noticing Reykjavik

Matt, Jack, and Anastasia are in Reykjavik this week, along with Genève Campbell of the Berkman Center for Internet and Society, for the annual meeting of the International Internet Preservation Consortium. We’ll have lots of details from IIPC coming soon, but for this first post we wanted to share some of the things we’re noticing here in Reykjavik. 

[Genève] Nothing in Reykjavik seems to be empty space. There is always room for something different, new, creative, or odd to fill voids. This is the parking garage of the Harpa concert hall. Traditional fluorescent lamps are interspersed with similar ones in bright colors.

image07

[Jack] I love how many ways there are to design something as simple as a bathroom. Here are some details I noticed in our guest house:

iceland-wc
Clockwise from top left: shower set into floor; sweet TP stash as design element; soap on a spike and exposed hot/cold pipes; toilet tank built into wall.

[Matt] Walking around the city is colorful and delightful. Spotting an engaging piece of street art is a regular occurrence. A wonderful, regular occurrence.

image05

image03

[Anastasia] After returning from Iceland for the first time a year ago, I found myself missing something I don’t normally give much thought to: Icelandic money is some of the loveliest currency I have ever seen.

image01

The banknotes are quite complex artistically, and yet every denomination abides by thoughtful design principles. Each banknote’s side shows either a culturally-significant figure or scene. The denomination is displayed prominently, the typography is ornate but consistent. The colors, beautiful.

20160412_083711

But what trumps the aesthetics is the banknotes’ dimensions. Icelandic paper money is sized according to amount: the 500Kr note is smaller than the 1000Kr note, which in turn is outsized by the 5000Kr note. This is incredibly important — it allows visually impaired people to move about more freely in the world.

In comparison, our money looks silly and our treasury department negligent, as it is impossible to differentiate the values by touch alone. And, confoundingly, there don’t seem to be movements to amend this either: in 2015 the department made “strides” by announcing it would start providing money readers, little machines that read value to people who filled out what I’m sure is not a fun amount of paperwork, instead of coming up with a simple design solution.

20160412_083819

The coins are a different story. When I first arrived the clunky coins were a happy surprise — they’re delightfully weighty (maybe even a little too bulky for normally non-cash-carrying types), adorned with beautifully thoughtful designs. On one side of each of the coins (gold or silver), the denomination stands out in large type along with local sea creatures: a big Lumpfish, three small Capelin fish, a dolphin, a Shore crab.

On the reverse side the four great guardians of Iceland gaze intensely. They are the dragon (Dreki), the griffin (Gammur), the bull (Griðungur), and the giant (Bergrisi), that each protected Iceland from Denmark invasion in turn, according to the Saga of King Olaf Tryggvason. On the back of the 1 Krona, only the giant stands, commanding.

And that’s it. No superfluous information. No humans, either, only mythology and fish.

Returning home is good things, but sometimes it also means re-entering a world where money is just sad green rectangles (and oddly sized coins) full of earthly men.

How We’re Freeing the Law, Part 1: Books

1c60debAdi Kamdar is a 1L at Harvard Law School and our embedded reporter on the Free the Law project. In this first post, he tracks the progress of a casebook through our scanning process from start to finish.

Harvard Law Library is one of the few collections with nearly every law reporter—roughly 40,000 books in total. The Free the Law project’s goal is to put the court decisions inside these volumes online, so anyone can access the precedents that shape the American legal system. Right now, the project is about halfway through, and within the next couple years they’ll have completed this monumental task.

But how exactly does a book become a byte? And what happens to these physical texts after they’ve been digitized?

Harvard Depository

The project begins each week with a book order—a 600 book order, to be exact, for law reporters that chronicle U.S. legal history since the country’s inception.

The law reporters are held in a sprawling warehouse 30 miles away from the law school—the Harvard Depository. With over 200,000 square feet of storage space, the climate-controlled Depository’s mission is pure efficiency: each book—and there are over nine million—is sorted and stored by size, rather than by name or author, in order to maximize space.

But it turns out law reporters are the packing peanuts of the Harvard Depository. When the reporters were first sent over to the warehouse, instead of being stored normally, they kept the volumes around in the packaging room. Whenever they filled a cardboard box with other books for storage, they would throw in a reporter or two if there was any extra space that needed to be filled. No one thought the print reporters would be that useful anymore, so making them easily available in bulk was a low priority. Plus, the library had decided to cancel print runs of reporters in 2010, saving valuable shelf space, especially when digital copies were easily available online.

Because of this tactic, law reporters are spread all throughout the Depository. Asking for, say, Michigan’s volumes isn’t as simple as pulling out a handful of boxes—it’s a hunt.

Langdell Library

Every Wednesday, the team receives the 600 volumes of case reporters. They line the hallway of the ground floor of Langdell, filling shelf after shelf. One by one, each book is examined before it can be taken apart. (Some books—for example, volumes with marginalia—are flagged for archival purposes.) Each volume is then catalogued and given a unique barcode so it can be tracked throughout the whole process.

 

The books are then taken to the Prep Room where, ironically, they’re repaired before they’re chopped up. Damaged pages are taped together, book bindings are cut off by hand, and the remaining sheets are taken over to a guillotine. Once aligned, the operator has to press two separate buttons underneath the cutting table at the same time to make sure her hands aren’t under the blade. The result? Cleanly cut pages.

 

View post on imgur.com

View post on imgur.com

Next, the bundle of pages is hauled over to the Scanning Room. Here, six employees work overlapping shifts to ensure that pages are being scanned every day, 14 hours a day. Roughly 200 documents per minute are fed through the machine, which has a camera on top and bottom to image both sides of the page.

 

View post on imgur.com

View post on imgur.com

View post on imgur.com

Now that the books are chopped and scanned, what happens to the physical pages? After all, the purpose of this project is to digitize the law. Plus, according to circulation records, very few people were reading the old reporters anyway. Rebinding them and keeping them in the library would be a waste of space, time, and money. But just in case anyone questions the authenticity of the scans, Harvard decided it would be valuable to have the physical copies accessible. So the project decided to vacuum seal the pages. Once the pages are jogged together (using a state-of-the-art paper-jogging machine) and placed back inside their book jacket, the volumes are taken over to one last room—where they will be put inside a meat packing device. Yes, it turns out that the meat industry unwittingly stumbled across the best way to preserve books. The machine shrink wraps the pages, maintaining the integrity of the volume while handily adding an extra layer of protection from mold, humidity, and bugs.

View post on imgur.com

The re-bound volumes are then re-shelved, where they await being shipped off to…

 

View post on imgur.com

Louisville, Kentucky

Because of the Harvard Law Library’s limited shelf capacity, the newly packaged pages will soon be loaded onto trucks and shipped down to Louisville.

Why Kentucky? Well, because of Underground Vaults & Storage, a company that has been storing all manner of things in Louisville’s old limestone mines. The sealed books will be stored there (where they will “fear no tornado, wildfire, flood or other natural disaster”) until the rare instance that they need to be recalled.

And that’s the story of these legal volumes—from one massive depository to another, by way of a guillotine, a scanner, and a meat packer. In our next post, we’ll explore what happens after they become digital images, and how Free the Law is building the largest free database of legal opinions in the world.

Free the Law Wintersession Sprint

USB stick labeled 'all of the caselaw'

TL;DR: We are running a two week data mining sprint from January 4-15, 2016, open to current Harvard students, based on early access to a brand new data set of American caselaw. To apply, send a resume and brief statement of interest to jcushman at law dot harvard dot edu.


 

Background

We recently announced Free the Law, our project to scan every legal decision ever published in the United States. We’re generating the first consistent, comprehensive, and open database of American law, from the colonial era right up to 2015. You can read the New York Times coverage of the project here.

By the end of this project we’ll have millions of cases in the dataset — no one knows exactly how many. We’re scanning and processing tens of thousands of pages a day, and will soon have entire states completed.

Now it’s time to start exploring what to do with all that data. What new questions can we ask with millions of cases?

The answers cross every discipline at Harvard:

  • Can a spam filter be retrained to guess which torts cases make the most interesting stories?
  • How much money are we willing to fight over — and does the answer offer an alternate inflation index?
  • How has the use of Latin in the law changed over time — are judges writing more or less like regular people?
  • How have defendants’ choice of murder weapon changed? The gender balance of litigants? The reliance on scientific evidence?
  • Can we trace a family’s history through the cases they were involved in?

Caselaw is the historical record of applied moral philosophy under the law. Unlocking its secrets will have an incredible impact on scholarship of all kinds.

The Challenge

Hence our challenge: pick a question you think caselaw might help you answer, perhaps drawn from one of your classes. Build a tool to help answer it – whether that means loading up your favorite ML library, configuring an off-the-shelf statistical tool, or writing code from scratch. In a two-week sprint, do your best to answer the question, and to generalize your tool to help other researchers answer similar questions. We’ll help share the discoveries you make and the tools you build.

The Data

The data set we will share with participants will include a single state’s complete published caselaw. The data includes: (1) TIFF and JPEG2000 images for each scanned page; (2) ALTO XML files for each scanned page; and (3) structured XML files for each case.

Schedule

December 2015: Application period.

January 4, 2016: Delivery of data set to participants.

January 4, 6, 8, 11, 13, 15: The group will check in three times a week, either in person or remotely, to share notes, progress updates, and requests for help.

Week of January 18: demo day (date TBD).

To Apply

Send your resume and brief statement of interest (such as a general idea of what sort of project you would like to work on) to jcushman at law dot harvard dot edu. If you would like to work with others, feel free to apply as a group.

Link roundup November 30, 2015

This is the good stuff.

The Irony of Writing About Digital Preservation

The Original Mobile App Was Made of Paper | Motherboard

The most Geo-tagged Place on Earth

The Illustrated Interview: Richard Branson

Why is so much of design school a waste of time?

Link roundup November 16, 2015

This is the good stuff.

pixelweaver

Rebellious Group Splices Fruit-Bearing Branches Onto Urban Trees | Mental Floss

Idea Sex: How New Yorker Cartoonists Generate 500 Ideas a Week – 99u

Google Cardboard’s New York Times Experiment Just Hooked a Generation on VR

Link roundup November 2, 2015

This is the good stuff.

ITUNES TERMS AND CONDITIONS: The Graphic Novel

French Vending Machines Dispense Short Stories Instead Of Snacks | Mental Floss

Swiss Style Color Picker

Chicago Ideas Week