Open Repositories Developer Challenge: Microservices

As part of the Developer Challenge at Open Repositories 2011, Jessie Keck, Michael Klein, Bess Sadler and I submitted the emergent community for Ruby-based curation microservices. While I had written some initial code in late 2010, I only intended to experiment with the California Digital Library microservices and explore how the microservices model could be used within an application, so it was never intended to be "production" ready. Taking inspiration from Jim Jagielski's opening keynote "Open Source: It’s just not for IT anymore! (pdf)", we wanted to help foster a community around the microservices, and so we took a number of initial steps to convert the various implementations of ruby microservices into a better community-driven, collaborative project:
  1. Created a microservices "organization" on github to hold the community-driven source code repositories. Before, the projects were held under a personal account that had a diversity of projects in various states of use and support. By creating a topic-driven organization, we hope to attract contributors and promote easier discovery of these projects
  2. Created a mailing list to record decisions, answer questions, and collaborate.
  3. Agreed to a set of standards and practices for microservices projects to ensure consistency and quality across these projects:
    1. Basic "meta" files -- like README, TODO, LICENSE, etc -- should be present and contain enough information to help people get started using and contributing to the projects
    2. Clarified source code licenses, and standardized on the Apache Public License 2.0 for each project.
    3. Vastly improved the source code testing and documentation coverage, and standardized around rspec and yard. Projects are now subject to continuous integration to ensure tests pass, documentation is built, and test coverage remains high.

Open Repositories '11 presentation

slides (pdf) Managing digital media content adds different challenges to file management than traditional text and images. The content is time based, and therefore more complex. Even the metadata needed to describe all aspects of the content to support better access is more complicated. Even after media materials have been cataloged, digitized, and stored in a repository or database, scholars and archivists lack the tools to manage and expose the data to the world. Significant workflow challenges exist to go from large, preservation-quality digital files to media appropriate for delivery across the Internet. The WGBH Media Library and Archive department (MLA) manages a collection of over 750,000 items dating back to the late 1940’s. As an educational foundation and the creator of a valuable collection of media resources, WGBH has embraced new developments in online media in its efforts to bring its archived materials to a broader audience and to serve the needs of the academic community. WGBH is successful in exposing content for the public through national production websites such as American Experience, FRONTLINE and NOVA, whose customized and carefully constructed features and services create added value for end users by encouraging the dissemination and use of WGBH- owned content. Like many similar institutions, this has been supported by the deployment of many ad-hoc, silo-ed content management systems on a project-by-project basis with each portal maintaining unique metadata and media assets, making it difficult to create new, innovative interfaces and services with the underlying content. In 2000, in partnership with a vendor, WGBH developed a DAM architecture for media access and published reference architecture documentation for other media organizations to replicate the work. The preservation DAM system is based on a proprietary system from the publishing and creative industries with limitations for metadata structure and interface. The vendor tended to develop the system toward what they saw as market trends and viable business sales. WGBH has found that although the system works, it is not flexible to the changing needs of the media industry, and the vendor is unable to tailor the software to our particular user needs without significant additional investment. In addition, upgrades are costly and time consuming, and all of the site-specific customizations built around the software need simultaneous upgrading by internal teams (e.g. extensive customizations to support media ingestions of large video files requiring limited technical knowledge). The customization links often break and need to be rewritten with every upgrade.

Access

To address the need to expose archival content in a sustainable manner, for a variety of audiences, and to encourage innovation within media archives, WGBH created Open Vault2, which provides a digital access portal into a cross-section of material from the WGBH Media Library and Archives. Although designed as an access portal, a secondary objective in creating Open Vault was to explore the potential for the system to fit within the multifaceted content management ecosystem for both access and preservation use. WGBH Open Vault is built using Blacklight3, Solr and the Fedora repository. Beyond the Open Vault user interface, we exposed a number of APIs, either for internal use or to support existing data exchange projects, including Atom/RSS feeds, unAPI4, oEmbed5, and OAI-PMH. By taking advantage of existing open-source solutions as much as possible, we were able to focus our efforts towards domain-relevant issues. This has proven a reliable platform, and we have since deployed similar technology for a couple cross-institutional, data-intensive projects. In 2006, WGBH launched Open Vault, an access repository based on CWIS. This site combined clips of media assets from four different series (three of which had separate finding aid websites created earlier). In 2008/9, WGBH MLA and Interactive completed an Andrew W. Mellon Foundation funded project which allowed us to work closely with humanities scholars researching their needs and habits in using digital media in their work. We developed a prototype, dubbed "Open Vault Research", using Fedora and a PHP front-end. One discovery was scholars lack tools for working with media, while traditional scholarship is still focused on citing textual resources. To address this, we created a number of tools for working with media material: - aligned transcripts, which allows the user to rapidly scan the transcript of an interview, and seek immediately to a section of interest; - annotations + tags, which allows the user to segment and describe media fragments and refer back to those notes later; - fragment addressing, which allows the savvy user to deep-link into a particular point in an object. Taking these user needs into account, we developed Open Vault v2 using Blacklight and the Fedora repository. Finally, we are about to deploy a new iteration of Open Vault using Blacklight 3.0 (and, as a footnote, although our application has significantly different behavior, the customizations are only about 3500 lines of code, more than half as HTML templates). Although the Hydra framework as matured significantly since the beginning of the project, because the management of the media and metadata is still performed in external systems, we continue to access the Fedora APIs directly. In this redesign, we looked at usage patterns over the collection and re-organized and re-prioritized elements of the user experience. - The majority of our users entered the website at a record page from an external search engine (with about a 50% bounce rate). However, if a user stayed and watched a video, often they would navigate the website to "related content" (exposed using solr more like this) - Subject browse was used more frequently than expected to give an overview of the materials in the collection

Re-use

Digital Commonwealth is an on-going project, to which we began contributing material from our first iteration of Open Vault using OAI-PMH. Project Vietnam was a collaboration with the Columbia Center for New Media, Teaching and Learning, that embedded material from Vietnam: A Television History that we exposed on Open Vault. We spent a significant amount of time figuring out how to exchange media and metadata with CCNMTL and settled on a handful of open standards. The Mozilla Foundation/WebMadeMovies also wanted to work with us around HTML5-based media experiments. To develop a quick demonstrator, Mozilla, with little assistance or guidance from the Open Vault team, was able to successfully build a javascript-based discovery interface using our OpenSearch API, and integrate both our video content and TEI-encoded transcripts, into their popcorn.js environment.

Technology

For our media player environment, we needed a technology that supports several requirements:
  • the ability to jump into any point of an item, which is especially important when serving hour long raw interviews (which excludes standard delivery (over HTTP, or otherwise) of the content),
  • an open source, or low cost, delivery platform,
  • a robust javascript API that allows us, at a minimum, to programmatically adjust the playhead (which we use to provide media/transcript synchronization, "deep linking" into a video, and annotation of media fragments)
We finally settled on a Flash-based player (which provides a more consistent user experience) with an HTML5-based fallback (to support iOS devices and others). For delivery, we're using h.264 pseudostreaming, which is delivered over HTTP (and is fully compatible with traditional HTTP delivery for clients that don't support h.264 pseudostreaming, which makes alternate uses easier). To support ease of reuse, we adopted principle that the “Website is the API”, and in so doing, were able to bake in discoverable standards and approaches that are replicable to other holdings and implementations. This approach included semantic markup, adding additional contextual information (as alternative link relations), and exposing user state information (e.g. the location of the playhead) within the page content. To support advanced usage, we also expose a number of auto-discoverable APIs which expose structured information to recompose page elements without needing to parse web page HTML.

OAI-PMH

To support "traditional" aggregation, like the Digital Commonwealth project, we have an OAI-PMH endpoint. (see also why OAI-PMH should die)

OpenSearch (blacklight)

For other aggregation efforts, we provide an OpenSearch endpoint that allows simple machine-to-machine discovery in a standard way.

Atom/RSS (blacklight)

All search results expose a discoverable Atom/RSS feed. Blacklight also provides functionality through the Document Extension Framework that allows clients to request specific representations of objects as part of the content of the feed.

unAPI (blacklight plugin)

The unAPI endpoint allows applications to discover structured information based on an identifier and a content type.

oEmbed (blacklight plugin)

oEmbed, rather than forcing implementors to discover media assets (through page scraping or unAPI), allows a client to discover the embeddable properties for an asset (and construct a player) in a standard way. oEmbed provides an easily parseable set of metadata required for embedding and, possibly, a pre- generated player implementation;

HTML <meta> tags

While encouraging re-use of materials, we documented possible improvements to make ad-hoc innovation and mash-up creation significantly easier, including:
  • oEmbed, rather than forcing implementors to discover media assets (through page scraping or unAPI), introspect the assets for technical metadata, and then construct a player, oEmbed provides an easily parseable set of metadata required for embedding and, possibly, a pre- generated player implementation;
  • additional information in the Atom/RSS feeds, in particular ensuring the data contained within the feed representations is comparable to the normal user interface;
  • and, exposing additional information on the page for developer-use, which, in the case of technical or rights metadata, is less relevant to our primary audience, but may be essential to building third-party interfaces to content.

jQuery UI Autocomplete and LiquidMetal

By default, the jQuery UI Autocomplete widget filters the source data using a very basic regular expression match:
  filter: function(array, term) {
		var matcher = new RegExp( $.ui.autocomplete.escapeRegex(term), "i" );
		return $.grep( array, function(value) {
			return matcher.test( value.label || value.value || value );
		});
	}
source
While this works, it doesn't provide relevancy ranking or near-matches, both of which are important when selecting from long lists of values, some of which are not well-known or contain a significant number of obscure items. To address this, I added a custom data source to the Autcomplete widget that uses the LiquidMetal library, which is a refinement of the Quicksilver scoring algorithm.
		source: function(request, response ) { 
			var arr;

			if(request.term == "")  {
			return response(data);
			}

			arr = $.map(data, function(value) {
				var score = LiquidMetal.score(value, request.term);
				if(score < 0.5) {
				  return null; // jQuery.map compacts null values

				}
				return { 'value': value, 'score': LiquidMetal.score(value, request.term) };
			});

			arr = arr.sort(function(a,b) { return a['score'] < b['score'] }) ;
		  	return response( $.map(arr, function(value) { return  value['value']; }) );
		} 
demo
Surprisingly easy.

Blacklight OAI Demonstrator

I recently put together a simple Blacklight-based OAI-PMH harvester (https://github.com/cbeer/blacklight-oai-demo). Created primarily as an experiment, it was prompted by Ed Corrado's Code4Lib-L thread " Simple Web-based Dublin Core search engine?" and some recent inquiries to the Blacklight community about using Blacklight with non-MaRC metadata. The whole experiment was surprisingly easy, thanks to ruby-oai. In its current form, you can configure OAI providers (with metadata formats and sets, using XSL transforms to convert the OAI-PMH record into Solr-ingestable XML), set up harvesting schedules, and use the standard Blacklight discovery framework. Finally, there is a minimal test suite (using VCR to mock OAI-PMH requests to the Library of Congress) [gallery]

Useful Standards for Public Media Projects: Linkbacks

In the age of real-time web crawling, third-party comment services (e.g. Disqus), and a relatively standardized set of "engagement" platforms (Twitter, Facebook), the Linkback standards are probably less relevant than it used to be, but I believe they are still an easy way to add a meaningful layer of serious communication across a variety of platforms. For all the architectural and social flaws behind the standards, collecting link data is trivial to implement and gives you control of some of the most important pieces of information one can collect: how people are discovering, discussing or re-using your content. One possible advantage is that both the Trackback and Pingback are more-or-less opt-in standards, giving the participants some control over how broadly they want to advertise their discussion. As the public media community keeps pushing user engagement, this is just another tool in the toolbox (and already present in most established content management systems) I've heard a rumor that trackback urls used to be a standard part of the PBS web site infrastructure years ago -- I'd be very interested to know what, if anything, was learned from collecting that data.