Warc.games at IIPC

At IIPC last week, Jack Cushman (LIL developer) and Ilya Kreymer (former LIL summer fellow) shared their work on security considerations for web archives, including warc.games, a sandbox for developers interested in exploring web archive security.

Slides: http://labs.rhizome.org/presentations/security.html#/

Warc.games repo: https://github.com/harvard-lil/warcgames

David Rosenthal of Stanford also has a great write-up on the presentation: http://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html

IIPC 2017 – Day Three

On day three of IIPC 2017 (day 1, day 2), we heard more about what I see as the two main themes of the conference: archives users and metadata for provenance.

On the user front, I’ll point out Sumitra Duncan’s talk on NYARC Discovery; like WALK, presented yesterday, this project aggregates search across multiple archives, improving access for users. Peter Webster of Webster Research & Consulting and Chris Fryer from the Parliamentary Archives spoke about their study of the archive’s users: the questions of what users want and need, and how they actually use the archive, are fundamental. How we think archives should or could be used may not be as pertinent as we imagine….

On the metadata front, Emily Maemura and Nicholas Worby from the University of Toronto spoke about the ways in which documentation and curatorial process affect users’ experience of and access to archives – the staffing history of a collecting organization, for example, could be an important part of understanding why a web archive contains what it does. Jackie Dooley (OCLC Research), Alexis Antracoli (Princeton University), and Karen Stoll Farrell (Frick Art Reference Library) presented their work on developing web archiving metadata best practices to meet user needs – and it becomes clear that my two main themes could really be seen as one. OCLC Research will issue their reports in July.

I’ll also point out Nicholas Taylor’s excellent talk on the legal use cases for archives, and, of course, LIL’s Anastasia Aizman and Matt Phillips, who gave a super talk on their ongoing work on comparing web archives. Thanks again, and hope to see you all next year!

IIPC 2017 – Day Two

Most of us attended the technical track on day two of IIPC 2017. (See also Matt’s post about the first day.) Andrew Jackson of the British Library expanded on his talk the previous day about workflows for ingesting and processing web archives. Nick Ruest and Ian MIlligan described WALK, or Web Archiving for Longitudinal Knowledge, a system for aggregating Canadian web archives, generating derivative products, and making them accessible via search and visualizations. Gregory Wiedeman from University at Albany, SUNY, described his process for automating the creation of web archive records in ArchivesSpace and adding descriptive metadata using Archive-It APIs according to DACS (Describing Archives: A Content Standard).

After the break, the Internet Archive’s Jefferson Bailey roared through a presentation of IA’s new tools, including systems for analysis, search, capture (Brozzler!), and availability. Mat Kelly from Old Dominion University described three tools for enabling non-techical users to create, index, and view web archives: WARCreate, WAIL, and Mink. Lozana Rossenova and Ilya Kreymer of Rhizome demonstrated the use of containerized browsers for playback of web content that is no longer usable in modern browsers (think Java applets), as well as some upcoming features in Webrecorder for patching content into incomplete captures.

Following lunch, Fernando Melo and João Nobre from Arquivo.pt described their new APIs for search and temporal analysis of Portuguese web archives. Nicholas Taylor of Stanford University Libraries talked about the ongoing rearchitecture of LOCKSS (Lots of Copies Keep Stuff Safe), expanding its role from a focus on the archiving of electronic journals to a tool for preserving web archives and other digital objects more generally. (In the Q&A, LOCKSS founder David Rosenthal mentioned the article “Familiarity breeds contempt: the honeymoon effect and the role of legacy code in zero-day vulnerabilities”.) Jefferson Bailey returned, along with Naomi Dushay, also from the Internet Archive, to talk about WASAPI (the Web Archiving Systems API) for transfer of data between archives.

After another break, LIL’s own Jack Cushman took the stage with Ilya Kreymer for a fantastic presentation of warc.games, a tool for exploring security issues in web archives: serving a captured web page is very much akin to hosting attacker-supplied content, and warc.games provides a series of challenges for trying out different kinds of attacks against a simplified local web archive. Mat Kelly then returned with David Dias of Protocol Labs to discuss InterPlanetary Wayback, which stores web archive files in IPFS, the InterPlanetary File System. Finally, Andrew Jackson wrapped up the session by leading a discussion of planning for an IIPC hackathon or other mechanism for gathering to code.

Thanks, all, for another excellent day!

IIPC 2017 – Day One

 

Untitled

It's exciting to be back at IIPC this year to chat Perma.cc and web archives!

 

The conference kicked off at on Wednesday, June 14, at 9:00 with coffee, snacks, and familiar faces from all parts of the world. Web archives bring us together physically!

 

Untitled

So many people to meet. So many collaborators to greet!

 

Jane Winters and Nic Taylor welcomed. It’s wonderful to converse and share in this space — grand, human, bold, warm, strong. Love the Senate House at University of London. Thank you so much for hosting us!

Leah Lievrouw, UCLA
Web history and the landscape of communication/media research

Leah told us that computers are viewed today as a medium — as human communication devices. This view is common now, but hasn’t been true for too long. Computers as a medium was very fringe even in the early 80s.

We walked through a history of communications to gain more understanding of computers as human communication devices and started with some history of information organization and sharing.

Paul Otlet pushed efforts forward to organize all of the world’s information in the late 19th century Belgium and France.

The Coldwar Intellectuals by J Light describes how networked information moved from the government and the military to the public.

And, how that network information became interesting when it was push and pull — send an email and receive a response, or send a message on a UNIX terminal to another user and chat. Computers are social machines, not just calculating machines.

Leah took us through how the internet and early patterns of the web were formed by the time and the culture — in this case, the incredible activity of Stanford, Berkley. Mileu of the Bay Area — bits and boolean logic through psychedelics. Fred Turner’s From Counterculture to Cyberculture is a fantastic read on this scene.

Stewart Brand, Ted Nelson, the WELL online community, and so on.

We’re still talking about way before the web here. The idea of networked information was there, but we didn’t have a protocol (http) or a language (html) being used (web browser) at large scale (the web). Wired Cities by Dutton, Blumer, Kraemer sounds like a fantastic read to understand how mass wiring/communication made the a massive internet/web a possibility!

The Computer as Communication Device described by J.C.R. Licklider and Bob Taylor was a clear vision to the future — we’re still not at a place where computers understand us as humans, we’re still are fairly rigid with defined request and responses patterns.

The web was designed to access, create docs, that’s it. Early search engines and browsers exchanged discrete documents — we thought about the web as discrete, linked documents.

Then, user generated content came along — wikis, blogs, tagging, social network sites. Now it’s easy for lots of folks to create content and and the network is even more powerful as a communication tool for many people!

The next big phase came with mobile — about mid 2000s. More and more and more people!

Data subject (data cloud or data footprint) is an approach that has felt interesting recently at UCLA. Maybe it’s real-time “flows” rather than “stacks” of docs or content.

Technology as cultural material and material culture.

 

Untitled

University of London is a fantastic space!

 

Jefferson Bailey, Internet Archive
Advancing access and interface for research use of web archives

Internet Archive is a massive archive! 32 Petabytes (with duplications)

And, they have search APIs!!

Holy smokes!!! Broad access to wayback without a URL!!!!!!!

IA has been working on a format called WAT. It’s about 20-25% the size of a WARC and contains just about everything (including title, headers, link) except the content. And, it’s a JSON format!

Fun experiments when you have tons of web archives!!! Gifcities.org and US Military powerpoints are two gems!

 

Digital Desolation
Tatjana Seitz

A story about a homepage can be generated using its layout elements — (tables, fonts, and so on). Maybe the web counter and the alert box mark the page in time and can be used to understand the page!

Analysis of data capture cannot be purely technical, has to be socio-technical.

Digital desolation is a term that describes abandoned sites on the web. Sites that haven’t been restyled. Sites age over time. (Their wrinkles are frames and table !!?? lol)

Old sites might not bubble to the top in today’s search engines — they’re likely at the long tail of what is returned. You have to work to find good old pages.

 

Untitled

The team grabbing some morning coffee

 

Ralph Schroederucla, Oxford Internet Institute
Web Archives and and theories of the web

Ralph is looking at how information is used and pursued.

How do you seek information? Not many people ask this core question. Some interesting researcher (anyone know?) in Finland does thought. He sits down with folks and asks “how do you think about getting information when you’re just sitting in your house? How does your mind seek information?”

Googlearchy — a few sites exist that dominate !

You can look down globally at which websites dominate the attention space. The idea that we’d all come together in a one global culture, that hasn’t happened yet — instead, there’s been a slow crystallization of different clusters

It used to be an anglo-ization of the web, now things may have moved to the south asian – Angela Wu talks about this.

Some measurements show that American and Chinese devote their attention to about the same bubble of websites — it might be that Americans are no more outward looking than are Chinese

We need a combined quantitative and qualitative study of web attention — we don’t access the web by typing in a URL (unless you’re in internet archive) we go to google

It’s hard to know about internet as a human right
Maybe having reliable information about health could be construed as civil rights
And unreliable, false information goes against human rights

 
 Untitled
London is a delightful host for post-conference wanderings

 

Oh, dang, it’s lunch already. It’s been a fever of web archiving!

We have coverage at this year’s IIPC! What a fantastic way to attend a conference — with the depth and breadth of much of hte Perma.cc team!

Anastasia Aizman, Becky Cremona, Jack Cushman, Brett Johnson, Matt Phillips, and Ben Steinberg are in attendance this year.

 

Caroline Nyvang, Thomas Hvid Kromann & Eld Zierau
Continuing the web at large

 

The authors conducted a survey of 35 master thesis from University of Copenhagen found that there were 899 web refs, 26.4 web refs on avg, 0 min, 80 max.

About 80% of links in theses were not dated or loosely dated — urls without dates are not reliable for citations?

Students are not consistent when they refer to web material, even if they followed well known style guides.

The speakers studied another corpus — 10 danish academic monographs and found similar variation around citations. Maybe we can work toward a good reference style?

Form of suggested reference might be something like

 

Where page is the content coverage, or thing the author is citing. Fantastic!

What if we were to make the content coverage in a fragment identifier (the stuff after the # in the address? Maybe something like this,

web.archive.org/<timestamp>/<url>#<content coverage>

 

Untitled

And totally unrelated, this fridge was spotted later that day on the streets of
London. We need a fridge in LIL. Probably not worth shipping back though.

 

Some Author, some organization

The UK Web Archive has been actively grabbing things from the web since 2004.

Total collection of 400 TB of UK websites only, imposing a “territorial” boundary –
.uk, .scot, .cymru, etc.

Those TLDs are not everything though — if the work is made available from a website with a uk domain name or that person is physically based in uk

 

Untitled

Fantastic first day!! Post-conference toast (with a bday cheers!)!!

Untitled

Recap, decompress, and keep the mind active for day two of IIPC!

The day was full of energy, ideas, and friendly folks sharing their most meaningful work. An absolute treat to be here and share our work! Two more days to soak up!

 

LIL Talks: The 180-Degree Rule in Film

This week, Jack Cushman illustrated how hard it is to make a film, or rather, how easy it is to make a bad film. With the assistance of LIL staff and interns, he directed a tiny film of four lines in about four minutes, then used it as a counter-example. Any mistake can break the suspension of disbelief, and amateurs are likely to make many mistakes. Infelicities in shot selection, lighting, sound, wardrobe and makeup, set design, editing, color, and so on destroy the viewer’s immersion in the film.

An example is the 180-degree rule: in alternating shots over the shoulders of two actors facing each other, the cameras must remain on the same side of the imaginary line joining the two actors. Breaking this rule produces cuts where the spatial relationship of the two actors appears to change from shot to shot.

After some discussion of the differences between our tiny film and Star Wars, Jack gauged his crew’s enthusiasm, and directed another attempt, taking only slightly longer to shoot than the first try. Here are some stills from the set.

LIL Talks: Synthesizer

This week Ben Steinberg took us on a strange and magical trip through the world of synthesizers.

Ben wasn’t talking about those Casio keyboards we all had as kids:

No, he was talking about this kind of thing … a self-built modular synthesizer:

Before showing off the hardware, Ben first asked us to ponder: what is sound? For Ben, it’s the neurological, psychological, cultural, social phenomenon that occurs when waves of compression and attenuation hit the insides of your ears.

And then the sounds began, starting with a simple sine wave. Ben showed us what happens as you adjust the frequency on the sine wave, and explained that humans can hear frequencies ranging from 20 to 20,000 Hz.

(Did you know? There’s a device businesses (like malls) can use to emit “a high-pitched sound that drives teens crazy but can’t be heard by most adults over 25.” It’s called “The Mosquito.”)

Ben then introduced us to control voltage, envelopes, voltage-controlled oscillators (VCOs), voltage-controlled amplifiers (VCAs), and low frequency oscillators (LFOs). Look them up if you want to know more, but here’s the kind of sounds they make:

 

All of these effects conspire to produce the “timbre,” which is the quality of the sound produced by the distribution of frequencies within it. And you can add filters – like “low-pass,” “high-pass,” “bandpass,” and “notch” filters – to create even more interesting sound effects, like the effect created by a wah-wah pedal. And if that’s not enough, Ben showed us a cool, minimalistic interface called a monome grid, which you can use to trigger different sound patterns and effects:

Ben wrapped up with a discussion of the “most interesting” sounds. To him, these are the ones that aren’t simulating other sounds and don’t sound like anything else. They sound like “machines playing themselves.”

For a sample, visit Ben’s own sound machine on the web: http://partytronic.com/.

Thanks Ben!

LIL Talks: Seltzer!

In this week’s LIL talk, Matt Phillips gave us an effervescent presentation on Seltzer, followed by a tasting.

We tasted

  • Perrier – minerally, slightly salty, big bubbles with medium intensity
  • Saratoga – varied bubble size, clean… Paul says that this reminds him of typical German seltzers
  • Poland Springs – soft, smooth, sweet and clean
  • Gerolsteiner – Minerally with low carbonation
  • Borjomi – Graphite, very minerally, small bubbles, funk

Of course, throughout the conversation, we discussed the potential for the bottles affecting our opinions. We agreed that for a truly objective comparison, we’d transfer the samples to generic containers.

Though our tech and law talks are always educational and fun, our carbonated water talk was a refreshing change.

LIL Talks: A Small Study of Epic Proportions

(This is a guest post by John Bowers, a student at Harvard College who is collaborating with us on the Entropy Project. John will be a Berktern here this Summer.)

In last week’s LIL talk, team member and graduating senior Yunhan Xu shared some key findings from her prize-winning thesis “A Small Study of Epic Proportions: Toward a Statistical Reading of the Aeneid.” As an impressive entry into the evolving “digital humanities” literature, Yunhan’s thesis blended the empirical rigor of statistical analysis with storytelling and interpretive methods drawn from the study of classics.

The presentation dealt with four analytical methodologies applied in the thesis. For each, Yunhan offered a detailed overview of tools and key findings.

  1. 1. Syntactic Analysis. Yunhan analyzed the relative frequencies with which different verb tenses and parts of speech occur across the Aeneid’s 12 books. Her results lent insight into the “shape” of the epic’s narrative, as well as its stylistic character in relation to other works.
  2. 2. Sentiment Analysis. Yunhan used sentiment analysis tools to examine the Aeneid’s emotional arc, analyze the normative descriptive treatment of its heroes and villains, and differentiate—following more conventional classics scholarship—the tonality of its books.
  3. 3. Topic Modeling. Here, Yunhan subjected existing bipartite and tripartite “partitionings” of the Aeneid to statistical inquiry. By applying sophisticated topic modelling techniques including Latent Dirichlet Allocation and Non-Negative Matrix Factorization, she made a compelling case for the tripartite interpretation. In doing so, she added a novel voice to a noteworthy debate in the classics community.
  4. 4. Network Analysis. By leveraging statistical tools to analyze the coincidence of and interactions between the Aeneid’s many characters, Yunhan generated a number of compelling visualizations mapping narrative progression between books in terms of relationships.

 

In the closing minutes of her presentation, Yunhan reflected on the broader implications of the digital humanities for the study of classics. While some scholars remain skeptical of the digital humanities, Yunhan sees enormous potential for collaboration and coevolution between the new way and the old.

Lil Talks: 1924 Democratic Convention by Caitlin

On May 5th, 2017 Caitlin went in depth on the intricacies of the 1924 Democratic Convention (also known as the ‘Klanbake’).


In the 20s the democratic primary had a significantly different process than it does today. Back then only 12 state had primaries, the rest of the delegates were selected through state-level caucuses and conventions that were tightly controlled by political machines.

Going into the Primary, the ecosystem of the United States was divided and often heated. At a glance:

  • Prohibition had been in effect since 1920 and there were 20,000+ violation cases. Coincidentally grape juice sales skyrocketed during this time.

    Unknown. (October 1931). Labor union members in Newark, New Jersey march against Prohibition. [Photograph]. Retrieved from http://khooll.com/post/35407667831/ready-for-the-saturday-night
    John Binder Collection. (date unknown). <span”>Anti-Saloon League rally. [Photograph]. Retrieved from http://www.pbs.org/kenburns/prohibition/popup/S0964/
  • Coolidge signed the immigration act of 1924 which limited the number of immigrants admitted into the US to 2% of people from that country that were living in the US as of the 1870 census. This was primarily aimed at southern & eastern europeans ( ie: Italians and Jews ). Immigrants from Africa and Asia were outright banned.
  • The Ku Klux Klan was at its peak with an estimated 3 – 8 million members. The Klan’s platform at that time was to have a country that was white, Protestant and immigrant-free.

There were two front-runners of the Primary.

William McAdoo

Harris & Ewing. (Between 1905 and 1945). William G. McAdoo, half-length portrait, facing slightly left. [Photograph]. Retrieved from http://www.loc.gov/pictures/item/00652553/

He was the former treasury secretary in Wilson’s administration and Wilson’s son-in-law. He had the popular vote, was favored by the labor unions and formally accepted the Klan’s support. His supporters were generally: southern, western, rural, Protestant & dry (pro-prohibition)

Governor Al Smith

Harris & Ewing. (Between 1905 and 1945). Smith, Alfred. Honorable. [Photograph]. Retrieved from http://www.loc.gov/pictures/item/hec2009008185/

He was the NY Governor at the time and had entered the race primarily block McAdoo for the western & urban political base. He was backed by the NY political machine Tammany Hall, and his supporters were generally: northern, urban, Catholic & wet ( anti-prohibition)

The convention was held at Madison Sq Garden on June 24th, 1924. A ⅔ vote was needed to select a candidate and in order to accomplish that the convention lasted for 16 days and 103 ballots until a consensus was reached. The convention was PACKED and the Washington Post described it as full of “Tammany shouters, Yiddish chanters, vaudeville performers, saga Indians, hulu dancers, street cleaners, firemen, policemen, movie actors & actresses, bootleggers, 1,098 delegates and 15 presidential candidates.”


Underwood & Underwood. (June 20, 1924). Transfigured Interior of Madison Square Garden Ready for Biggest Convention in History!. [Photograph]. Retrieved from http://www.paragonauctionsite.com/lot-2251.aspx

There were fist flights on the floor between pro- and anti-Klan delegates. The Tammany Machine stacked the crowd with paid protestors filling the are with the sounds  of thousands of people with drums, tubas, trumpets and electric fire sirens in support of Smith after FDR gave his nominating speech.


Apic/Getty Images. (June 1924). Convention nationale democrate. [Photograph]. Retrieved from http://www.gettyimages.ca/license/112077749

After 16 days neither McAdoo or Smith won. John W. Davis, a candidate from West Virginia, was the eventual compromise.

However in the end it was all for nothing; Calvin Coolidge won the 1924 presidential election and Davis only captured about 26% of the total vote.

LIL Talks: Parsing Caselaw

In last week’s LIL talk, expert witness Adam Ziegler took the stand to explain the structure of legal opinions and give an overview of our country’s appellate process.

LILers listening to Adam lecturing

First on the docket was a general overview of our country’s judicial structure, specifically noting the similarities between our federal and state systems, which both progress from district courts, to appellate courts, to supreme courts.

Animated Gif of Adam Ziegler Lecturing

Next, we dissected several cases which would eventually be heard by the US Supreme Court. While some elements, such as a list of attorneys and the opinion text, are standard in all cases, each court individually decides how their cases will be formatted. They are, however, often forced to work within the guidelines and workflows specified by their contracted publishers.

LILers listening to Adam while eating their lunch.

In our Caselaw Access Project, we’re working on friendlier, faster, totally open, and more data-focused systems for courts to publish opinions. For more information, please send an email to: lil@law.harvard.edu