Open Repositories '11 presentation

slides (pdf) Managing digital media content adds different challenges to file management than traditional text and images. The content is time based, and therefore more complex. Even the metadata needed to describe all aspects of the content to support better access is more complicated. Even after media materials have been cataloged, digitized, and stored in a repository or database, scholars and archivists lack the tools to manage and expose the data to the world. Significant workflow challenges exist to go from large, preservation-quality digital files to media appropriate for delivery across the Internet. The WGBH Media Library and Archive department (MLA) manages a collection of over 750,000 items dating back to the late 1940’s. As an educational foundation and the creator of a valuable collection of media resources, WGBH has embraced new developments in online media in its efforts to bring its archived materials to a broader audience and to serve the needs of the academic community. WGBH is successful in exposing content for the public through national production websites such as American Experience, FRONTLINE and NOVA, whose customized and carefully constructed features and services create added value for end users by encouraging the dissemination and use of WGBH- owned content. Like many similar institutions, this has been supported by the deployment of many ad-hoc, silo-ed content management systems on a project-by-project basis with each portal maintaining unique metadata and media assets, making it difficult to create new, innovative interfaces and services with the underlying content. In 2000, in partnership with a vendor, WGBH developed a DAM architecture for media access and published reference architecture documentation for other media organizations to replicate the work. The preservation DAM system is based on a proprietary system from the publishing and creative industries with limitations for metadata structure and interface. The vendor tended to develop the system toward what they saw as market trends and viable business sales. WGBH has found that although the system works, it is not flexible to the changing needs of the media industry, and the vendor is unable to tailor the software to our particular user needs without significant additional investment. In addition, upgrades are costly and time consuming, and all of the site-specific customizations built around the software need simultaneous upgrading by internal teams (e.g. extensive customizations to support media ingestions of large video files requiring limited technical knowledge). The customization links often break and need to be rewritten with every upgrade.

Access

To address the need to expose archival content in a sustainable manner, for a variety of audiences, and to encourage innovation within media archives, WGBH created Open Vault2, which provides a digital access portal into a cross-section of material from the WGBH Media Library and Archives. Although designed as an access portal, a secondary objective in creating Open Vault was to explore the potential for the system to fit within the multifaceted content management ecosystem for both access and preservation use. WGBH Open Vault is built using Blacklight3, Solr and the Fedora repository. Beyond the Open Vault user interface, we exposed a number of APIs, either for internal use or to support existing data exchange projects, including Atom/RSS feeds, unAPI4, oEmbed5, and OAI-PMH. By taking advantage of existing open-source solutions as much as possible, we were able to focus our efforts towards domain-relevant issues. This has proven a reliable platform, and we have since deployed similar technology for a couple cross-institutional, data-intensive projects. In 2006, WGBH launched Open Vault, an access repository based on CWIS. This site combined clips of media assets from four different series (three of which had separate finding aid websites created earlier). In 2008/9, WGBH MLA and Interactive completed an Andrew W. Mellon Foundation funded project which allowed us to work closely with humanities scholars researching their needs and habits in using digital media in their work. We developed a prototype, dubbed "Open Vault Research", using Fedora and a PHP front-end. One discovery was scholars lack tools for working with media, while traditional scholarship is still focused on citing textual resources. To address this, we created a number of tools for working with media material: - aligned transcripts, which allows the user to rapidly scan the transcript of an interview, and seek immediately to a section of interest; - annotations + tags, which allows the user to segment and describe media fragments and refer back to those notes later; - fragment addressing, which allows the savvy user to deep-link into a particular point in an object. Taking these user needs into account, we developed Open Vault v2 using Blacklight and the Fedora repository. Finally, we are about to deploy a new iteration of Open Vault using Blacklight 3.0 (and, as a footnote, although our application has significantly different behavior, the customizations are only about 3500 lines of code, more than half as HTML templates). Although the Hydra framework as matured significantly since the beginning of the project, because the management of the media and metadata is still performed in external systems, we continue to access the Fedora APIs directly. In this redesign, we looked at usage patterns over the collection and re-organized and re-prioritized elements of the user experience. - The majority of our users entered the website at a record page from an external search engine (with about a 50% bounce rate). However, if a user stayed and watched a video, often they would navigate the website to "related content" (exposed using solr more like this) - Subject browse was used more frequently than expected to give an overview of the materials in the collection

Re-use

Digital Commonwealth is an on-going project, to which we began contributing material from our first iteration of Open Vault using OAI-PMH. Project Vietnam was a collaboration with the Columbia Center for New Media, Teaching and Learning, that embedded material from Vietnam: A Television History that we exposed on Open Vault. We spent a significant amount of time figuring out how to exchange media and metadata with CCNMTL and settled on a handful of open standards. The Mozilla Foundation/WebMadeMovies also wanted to work with us around HTML5-based media experiments. To develop a quick demonstrator, Mozilla, with little assistance or guidance from the Open Vault team, was able to successfully build a javascript-based discovery interface using our OpenSearch API, and integrate both our video content and TEI-encoded transcripts, into their popcorn.js environment.

Technology

For our media player environment, we needed a technology that supports several requirements:

the ability to jump into any point of an item, which is especially important when serving hour long raw interviews (which excludes standard delivery (over HTTP, or otherwise) of the content),
an open source, or low cost, delivery platform,
a robust javascript API that allows us, at a minimum, to programmatically adjust the playhead (which we use to provide media/transcript synchronization, "deep linking" into a video, and annotation of media fragments)

We finally settled on a Flash-based player (which provides a more consistent user experience) with an HTML5-based fallback (to support iOS devices and others). For delivery, we're using h.264 pseudostreaming, which is delivered over HTTP (and is fully compatible with traditional HTTP delivery for clients that don't support h.264 pseudostreaming, which makes alternate uses easier). To support ease of reuse, we adopted principle that the “Website is the API”, and in so doing, were able to bake in discoverable standards and approaches that are replicable to other holdings and implementations. This approach included semantic markup, adding additional contextual information (as alternative link relations), and exposing user state information (e.g. the location of the playhead) within the page content. To support advanced usage, we also expose a number of auto-discoverable APIs which expose structured information to recompose page elements without needing to parse web page HTML.

OAI-PMH

To support "traditional" aggregation, like the Digital Commonwealth project, we have an OAI-PMH endpoint. (see also why OAI-PMH should die)

OpenSearch (blacklight)

For other aggregation efforts, we provide an OpenSearch endpoint that allows simple machine-to-machine discovery in a standard way.

Atom/RSS (blacklight)

All search results expose a discoverable Atom/RSS feed. Blacklight also provides functionality through the Document Extension Framework that allows clients to request specific representations of objects as part of the content of the feed.

unAPI (blacklight plugin)

The unAPI endpoint allows applications to discover structured information based on an identifier and a content type.

oEmbed (blacklight plugin)

oEmbed, rather than forcing implementors to discover media assets (through page scraping or unAPI), allows a client to discover the embeddable properties for an asset (and construct a player) in a standard way. oEmbed provides an easily parseable set of metadata required for embedding and, possibly, a pre- generated player implementation;

HTML <meta> tags

While encouraging re-use of materials, we documented possible improvements to make ad-hoc innovation and mash-up creation significantly easier, including:

oEmbed, rather than forcing implementors to discover media assets (through page scraping or unAPI), introspect the assets for technical metadata, and then construct a player, oEmbed provides an easily parseable set of metadata required for embedding and, possibly, a pre- generated player implementation;
additional information in the Atom/RSS feeds, in particular ensuring the data contained within the feed representations is comparable to the normal user interface;
and, exposing additional information on the page for developer-use, which, in the case of technical or rights metadata, is less relevant to our primary audience, but may be essential to building third-party interfaces to content.