Digital Asset Management for Public Broadcasting: Fedora Commons Repository (Part 1 of ??)

In my previous post, I provided a broad overview of the challenges and opportunities for developing an open source digital asset management system within the public broadcasting community, and described some fundamental technology that is already being developed and deployed within institutions. In this post, I want to look specifically at the role the Fedora Commons repository architecture can play in this environment. Additional reading is available from the Fedora Commons wiki, especially the Getting Start with Fedora article, which articulates some of the strengths of their approach in the abstract. The Fedora Commons data model is built on top of the Kahn/Wilensky Architecture, which describes a data structure for primary digital objects (irrespective of the data or formats contained within). Already, this is an improvement over some systems, which differentiate between content types, relegating some content formats to second-class citizenship. By providing a single, fundamental data type, one can build consistent user experiences on top of the discoverable components and interact with the digital objects to GET THINGS DONE. Within digital objects are datastreams, which may include both data and metadata about the object, and are treated equally (more or less...) Datastreams can carry revision information, integrity checks, and other provenance information. By not distinguishing between "digital" assets (for which data (e.g. the media files) are available electronically) and other kinds of assets (physical tapes, abstract entities, etc), an asset management system can encompass the full range of materials within an active media archive. Digital objects can be assigned content model types, which stipulate the required (and optional) component datastreams, as well as define the services that operate on objects of that type. These content types are simply structured digital objects within the repository, allowing repository managers (and content creators, given a sufficient interface) to define the structure of their content rather than structuring their content to meet the needs of the digital asset management system. Types of datastreams natively supported include Inline XML datastreams, Managed Content, Externally Referenced Content, and Redirects. The datastream types do not speak to the format of content stored within them (except for inline XML), which allows content creators to easily provide content to the repository without first worrying about transcoding materials or other barriers to accessioning content (which is certainly not to say that standardizing content types archived within the repository is problematic -- just that it shouldn't interfere with getting the materials in the first place). This variety of types allows content to be stored and managed in the most appropriate places, rather than arbitrarily requiring centralization or "physical" ownership of content. Within a distributed organization like public broadcasting, this could be a powerful concept that allows content creators to control and manage their content at various stages of distribution (and, while this could be accomplished within traditional database driven systems, it would require custom application logic to do, which is likely not scalable across a wide variety of applications, frameworks, and languages). While all datastreams are equal, there are four (or more?) that are more equal than others: - AUDIT, which stores the history of the digital object as it is modified. - DC, a Qualified Dublin Core datastream, that provides a minimal level of interoperability for the most generic of repository management interfaces. This is also the only fundamentally required datastream (without specifying required elements within it), and really is the bare minimum of information necessary to assert the existence of an object (if it doesn't have a title, identifier, or description, what is it we're talking about exactly?) - RELS-EXT (and INT), an RDF-XML datastream in which one can assert relationship to other digital objects (which may exist within the repository, but may also exist (or not exist) elsewhere). These relationships can be from any vocabulary and reference any type of object, which is handy when you are dealing with complex relationships between media archives assets. This datastream is also generally indexed in an RDF triple-store to provide relationship querying. - POLICY, which stores XACML security policies for the digital object, which can be used to restrict access to the datastreams, services, or the object based on whatever the security needs are. Within the digital asset management context, this could also be used to restrict access to only media files, while still providing the metadata (so one could assert and describe the existence of an object, without actually sharing it for whatever reason, which seems atypical for some commercial solutions) By default, these datastreams (and the digital object wrapper) are stored on the file system in relatively comprehensible ways, which is a bonus to implementors who can set up underlying hardware or other technology in traditional ways and just begin to use the software without too much fuss. There is ongoing development to build in support for additional and evolving standards around digital object storage, serialization, access, and other services which should only help with making the process as transparent as possible. All of this technology and flexibility comes "free" with the repository architecture and doesn't try to interfere with actually making use of the assets (except as restricted by security policies, of course), which allows different use cases to be expressed in the most logical and straightforward way (rather than trying to bend the use cases or system in an attempt to mimic some of the elements the user needs). As a starting point for developing a digital asset management solution for media, I believe it offers a good balance of flexibility and requirements that can ensure user needs are met without sacrificing durability. So, how can Fedora be applied in a digital asset management context for public broadcasting? First and foremost, Fedora provides a trusted platform for managing and maintaining content for many different contexts (production, long-term archiving, etc) on top of a variety of hardware and standards. By managing metadata and data together, physical and digital assets can be revealed in a common interface (when appropriate) to meet the needs of researchers and scholars (for whom the knowledge of the existence of the asset is more essential than on-demand access). Finally, by offering a stable API to a variety of resources, use-case driven interfaces can be developed, shared, and maintained to meet different needs sensibly.

Digital Asset Management for Public Broadcasting (Part 0 of ?)

Digital asset management is hard. Many people have solved many parts of the problem, but for a reasonably complex use-case, many of the existing solutions just aren't there yet, especially within a vendor-driven world for a niche market within a niche market, which is concerned with all levels and life-cycles of an asset (from production, to reuse, to archiving and back again), which is almost certainly not a profitable market given public broadcasting budgets. I believe this is an ideal area for the development of open source solutions based on some existing works of open source software. The "easy" part in the DAM ecosystem, I would argue, is archiving the material and ensuring its long-term preservation (and accessibility!). I've done a couple projects and prototypes now based on the Fedora Commons repository architecture, and it seems to be a promising platform for this kind of development. Objects and datastreams are stored on the file-system, which IT staff are traditional prepared to manage (vs some unique database structure almost certainly obfuscated in layers of (de-)normalization). Fedora will happily manage security policies, object relationships, data transformation services, and (shortly) more advanced file system interactions, which exposing a (relatively) consistent HTTP interface. Discovery interfaces are probably the next easiest piece, having been examined and developed out of the information sciences communities. Using a combination like Solr and Blacklight (deployed successfully for WGBH's Open Vault website), one can rapidly create interfaces to the underlying content that satisfy the many use cases. With Solr, you get a bunch of discovery mechanisms and options, including relevancy, term highlighting, faceting, etc. From here, we start getting into the hard parts. Ingest and metadata editing is difficult to solve well in a content- and use-case- agnostic way, which is the approach most Systems seem to take. While the need for a generic asset management view is important (and solved!), if the collection of services fail to meet the needs of the users, encouraging adoption (nicely) is problematic. By using infrastructure elements with open and well-documented APIs, developers can extend and customize the user experiences to match the underlying data and processes. This is an area for which the adoption and support for open source projects can encourage sustainable development of these interfaces. It seems like, after clearing these obstacles, many systems fail to account for the use and re-use of these objects within the media communities. Few systems account for batch encoding video and audio for web distribution, one-click publishing systems to blogs, social networking sites, or video portals, integration into broadcasting chains, etc -- for very good reasons, there simply isn't the incentive when faced with large upfront development costs for unique development. Given an open source platform, however, that supports (and encourages) sharable development of solutions, maybe we could start finding answers to these persistent problems (without re-inventing the wheel!). I believe most of the core infrastructure pieces are there: - Fedora, as I mentioned, which provides preservation and management services; - Solr, which provides a discovery framework (and associated metadata extraction utilities like Tika); - Blacklight, which provides discovery and access services; - ESB or other workflow solutions like Camel, Ruote, or otherwise; - Generic metadata editing options, like XForms, Django, etc; - Open standards that allow for publishing and reuse (Atom, MediaRSS, RDF, ???); - FFMPEG, which offers encoding and transcode services. It isn't an extensive development problem, these are well-established communities in their fields, it's a simple matter of getting initial momentum in tying the complex pieces together and creating interesting and useful services on top. So, why aren't we doing this? Money, time, lack of a collaborative/communicative culture, and apathy (and acceptance) of second-rate, buggy commercial solutions that fail to address all aspects of a media objects life-cycle as it goes from the rapid iterations in production to many different distribution channels back to relative obscurity in an archival context (until a new production pulls it out again). Without full support, no step in the process can realize the potential of the content and have the incentive to put in the hard work to ingest and describe the asset.

Linked data and public broadcasting

Lately, I've been talking up linked data and the semantic web to some of my colleagues in US-based public broadcasting, which is heavily fragmented (by design) and operates on a number of levels (producers, distributers, and broadcasters at both local and national levels) with many competing interests, funding models, and missions. Linked data seems to offer a common framework to disseminate, describe, and aggregate information, beyond one-way APIs, custom solutions, and one-size-fits-all software. It seems elegant to pair the organizational models with a data model that already deals with issues of authority, distributed information, and relationships between objects. Further, the BBC have done or enabled some exciting linked-data based projects that expose the programme catalog, mash-up BBC content with user-generated content, and contextualize BBC content within the wider web in a way that makes it useful and discoverable outside of a walled garden. Getting started seems easy enough, and at least a few of us on the inside are making some quiet progress. Glenn Clatworthy at PBS has done some very early RDF experiments with the PBS catalog, which could unlock a valuable resource, that has the potential to tie together programs assets, extra production material, and all manner of external resources. So, why should public broadcasting begin this process now? - it frees and decentralizes information, making it available for new applications and better resource discovery (especially within news and public affairs programming, which has many different outlets gathering different pieces and angles on a story) - legacy content is already being moved into new content management and asset management system, so additional overhead is minimal. - it can begin at any level of effort and still produce valuable results -- and it can begin as unilateral collaboration, without the need for extensive oversight, project planning, or finalized use-cases.

NPR API + Solr = ?

Adapted from an email to the pubforge list. Solr is a great application, and its out-of-the-box features still amaze me. With the newer versions, it’s incredibly easy to hook Solr up to any data source (using the Solr Data Import Handler) and just let it do its thing. I don’t have any thoughts about communication, but one of the tennents of the code4lib community is “less talk, more code”. Public media spends a lot of time planning collaborations or trying to find funding (or worse, talking about doing those things) instead of actually doing it. I'd love to see more prototyping, iterative development, and open sharing and discussion about what new and interesting services we can provide. On an earlier post to the list, John Tynan suggested the potential of providing a "More Like This" service for NPR News data, and in the interest of just getting something out there, I spent a little bit of time hooking everything together. To give it a pretty front-end, I also hacked in a Solr AJAX interface. The NPR/Solr demonstrator uses this solr endpoint. I've locked down the indexes, but left everything else open so you can see how the pieces fit together. If there is enough interest in this application, I would be willing to develop it out further if you provide ideas, use-cases, etc in the comments. The source code is available from the github project npr-solr. None of this took very long to develop, the most time consuming part was importing from the paginated NPR API (with its absurdly low 20 records-per-request maximum..).