Open Repositories Developer Challenge: MicroservicesAs part of the Developer Challenge at Open Repositories 2011, Jessie Keck, Michael Klein, Bess Sadler and I submitted the emergent community for Ruby-based curation microservices. While I had written some initial code in late 2010, I only intended to experiment with the California Digital Library microservices and explore how the microservices model could be used within an application, so it was never intended to be "production" ready. Taking inspiration from Jim Jagielski's opening keynote "Open Source: It’s just not for IT anymore! (pdf)", we wanted to help foster a community around the microservices, and so we took a number of initial steps to convert the various implementations of ruby microservices into a better community-driven, collaborative project:
- Created a microservices "organization" on github to hold the community-driven source code repositories. Before, the projects were held under a personal account that had a diversity of projects in various states of use and support. By creating a topic-driven organization, we hope to attract contributors and promote easier discovery of these projects
- Created a mailing list to record decisions, answer questions, and collaborate.
- Agreed to a set of standards and practices for microservices projects to ensure consistency and quality across these projects:
- Basic "meta" files -- like README, TODO, LICENSE, etc -- should be present and contain enough information to help people get started using and contributing to the projects
- Clarified source code licenses, and standardized on the Apache Public License 2.0 for each project.
- Vastly improved the source code testing and documentation coverage, and standardized around rspec and yard. Projects are now subject to continuous integration to ensure tests pass, documentation is built, and test coverage remains high.
Open Repositories '11 presentationslides (pdf) Managing digital media content adds different challenges to file management than traditional text and images. The content is time based, and therefore more complex. Even the metadata needed to describe all aspects of the content to support better access is more complicated. Even after media materials have been cataloged, digitized, and stored in a repository or database, scholars and archivists lack the tools to manage and expose the data to the world. Significant workflow challenges exist to go from large, preservation-quality digital files to media appropriate for delivery across the Internet. The WGBH Media Library and Archive department (MLA) manages a collection of over 750,000 items dating back to the late 1940’s. As an educational foundation and the creator of a valuable collection of media resources, WGBH has embraced new developments in online media in its efforts to bring its archived materials to a broader audience and to serve the needs of the academic community. WGBH is successful in exposing content for the public through national production websites such as American Experience, FRONTLINE and NOVA, whose customized and carefully constructed features and services create added value for end users by encouraging the dissemination and use of WGBH- owned content. Like many similar institutions, this has been supported by the deployment of many ad-hoc, silo-ed content management systems on a project-by-project basis with each portal maintaining unique metadata and media assets, making it difficult to create new, innovative interfaces and services with the underlying content. In 2000, in partnership with a vendor, WGBH developed a DAM architecture for media access and published reference architecture documentation for other media organizations to replicate the work. The preservation DAM system is based on a proprietary system from the publishing and creative industries with limitations for metadata structure and interface. The vendor tended to develop the system toward what they saw as market trends and viable business sales. WGBH has found that although the system works, it is not flexible to the changing needs of the media industry, and the vendor is unable to tailor the software to our particular user needs without significant additional investment. In addition, upgrades are costly and time consuming, and all of the site-specific customizations built around the software need simultaneous upgrading by internal teams (e.g. extensive customizations to support media ingestions of large video files requiring limited technical knowledge). The customization links often break and need to be rewritten with every upgrade.
AccessTo address the need to expose archival content in a sustainable manner, for a variety of audiences, and to encourage innovation within media archives, WGBH created Open Vault2, which provides a digital access portal into a cross-section of material from the WGBH Media Library and Archives. Although designed as an access portal, a secondary objective in creating Open Vault was to explore the potential for the system to fit within the multifaceted content management ecosystem for both access and preservation use. WGBH Open Vault is built using Blacklight3, Solr and the Fedora repository. Beyond the Open Vault user interface, we exposed a number of APIs, either for internal use or to support existing data exchange projects, including Atom/RSS feeds, unAPI4, oEmbed5, and OAI-PMH. By taking advantage of existing open-source solutions as much as possible, we were able to focus our efforts towards domain-relevant issues. This has proven a reliable platform, and we have since deployed similar technology for a couple cross-institutional, data-intensive projects. In 2006, WGBH launched Open Vault, an access repository based on CWIS. This site combined clips of media assets from four different series (three of which had separate finding aid websites created earlier). In 2008/9, WGBH MLA and Interactive completed an Andrew W. Mellon Foundation funded project which allowed us to work closely with humanities scholars researching their needs and habits in using digital media in their work. We developed a prototype, dubbed "Open Vault Research", using Fedora and a PHP front-end. One discovery was scholars lack tools for working with media, while traditional scholarship is still focused on citing textual resources. To address this, we created a number of tools for working with media material: - aligned transcripts, which allows the user to rapidly scan the transcript of an interview, and seek immediately to a section of interest; - annotations + tags, which allows the user to segment and describe media fragments and refer back to those notes later; - fragment addressing, which allows the savvy user to deep-link into a particular point in an object. Taking these user needs into account, we developed Open Vault v2 using Blacklight and the Fedora repository. Finally, we are about to deploy a new iteration of Open Vault using Blacklight 3.0 (and, as a footnote, although our application has significantly different behavior, the customizations are only about 3500 lines of code, more than half as HTML templates). Although the Hydra framework as matured significantly since the beginning of the project, because the management of the media and metadata is still performed in external systems, we continue to access the Fedora APIs directly. In this redesign, we looked at usage patterns over the collection and re-organized and re-prioritized elements of the user experience. - The majority of our users entered the website at a record page from an external search engine (with about a 50% bounce rate). However, if a user stayed and watched a video, often they would navigate the website to "related content" (exposed using solr more like this) - Subject browse was used more frequently than expected to give an overview of the materials in the collection
TechnologyFor our media player environment, we needed a technology that supports several requirements:
- the ability to jump into any point of an item, which is especially important when serving hour long raw interviews (which excludes standard delivery (over HTTP, or otherwise) of the content),
- an open source, or low cost, delivery platform,
OAI-PMHTo support "traditional" aggregation, like the Digital Commonwealth project, we have an OAI-PMH endpoint. (see also why OAI-PMH should die)
OpenSearch (blacklight)For other aggregation efforts, we provide an OpenSearch endpoint that allows simple machine-to-machine discovery in a standard way.
Atom/RSS (blacklight)All search results expose a discoverable Atom/RSS feed. Blacklight also provides functionality through the Document Extension Framework that allows clients to request specific representations of objects as part of the content of the feed.
unAPI (blacklight plugin)The unAPI endpoint allows applications to discover structured information based on an identifier and a content type.
oEmbed (blacklight plugin)oEmbed, rather than forcing implementors to discover media assets (through page scraping or unAPI), allows a client to discover the embeddable properties for an asset (and construct a player) in a standard way. oEmbed provides an easily parseable set of metadata required for embedding and, possibly, a pre- generated player implementation;
HTML <meta> tagsWhile encouraging re-use of materials, we documented possible improvements to make ad-hoc innovation and mash-up creation significantly easier, including:
- oEmbed, rather than forcing implementors to discover media assets (through page scraping or unAPI), introspect the assets for technical metadata, and then construct a player, oEmbed provides an easily parseable set of metadata required for embedding and, possibly, a pre- generated player implementation;
- additional information in the Atom/RSS feeds, in particular ensuring the data contained within the feed representations is comparable to the normal user interface;
- and, exposing additional information on the page for developer-use, which, in the case of technical or rights metadata, is less relevant to our primary audience, but may be essential to building third-party interfaces to content.