blog.cbeer.info

Mar 14, 2010

Public media links for the week of 3/6

Some thoughts on curation – adding context and telling stories

Just over two years ago I wrote a post about the importance of the resource and the URL — and I still stand by what I said there: the core of a website should be the resource and its URL. And if those resources describe real world things and they are linked together in the way people think about the world then you can navigate the site by hopping from resource to resource in an intuitive fashion. But I think I missed something important in that post — the role of curation, the role of storytelling.

Tom Scott's article is particularly interesting as public broadcasting begins to transform from distribution into conversation. It's great to see some thinking about the interaction of user generated content and programming.

Consolidation: CPB renews its economy push for shared master control facilities

CPB has come up with another incentive and a new demonstration project in its long and sporadic campaign for the cost savings of shared technical facilities and staff. ¶Under a new rule adopted by the CPB Board, public TV stations won’t be eligible for master control equipment funding unless they share the facility with one or more other stations, according to Mark Erstling, senior v.p. for system development and media strategy.

I'm probably biased because I started in public media as a master control op, but consolidation worries me. One of the great things about public media has been its incredible local efforts. While it seems like public media drifted away from that, local master control should encourage better programming for local communities.

This is a little older, but the announced cuts to BBC Online (and the responses to it) are interesting, and I'd love to see a discussion in US public media about digital vision going forward:

The BBC: still no digital vision

I’ve been meaning for a while to write about a growing sense of frustration with the BBC (and, for that matter Channel 4) for their continuing failure to establish a strategy repositioning them in a way that makes sense for a public service media organisation in the emerging digital ecology. I drafted this before Mark Thompson’s recent announcement of cuts in BBC Online; the decisions he has announced recently only confirm a view that the BBC has yet to find a direction in the new media landscape.

What is the BBC?

Well, so what? What's so special about the BBC that we should have a right to public money? Well, we have no intrinsic right to this money in the same way that, say, the police and fire departments don't have an intrinsic right to public money. However, like any public good, society cooperates to share certain resources for public gain.

Mar 4, 2010

Fedora and Microservices

In this post, I want to discuss repository architecture philosophies, although I will focus primarily on Fedora and California Digital Library microservices, there are some generalizations one can pull out of this. It would also be interesting to pull in some very different repository models, like iRODS or a triple-store-backed system, but that's outside of my expertise.

The basics

This is not a section I really want to write, but I don't know of a high-level answer to "when we say repository, this is what we mean". I spent a little time looking around for a summary, but more often than not I found more questions (or, perhaps more useful yet inappropriate for my purposes, technology-based answers rather than use-driven), so I've taken a stab at addressing what I believe are some key issues:

Repositories are a collection of services, with well-defined interfaces, for storing and managing data (both content and metadata) in a format-neutral, display-independent manner way. Repositories can be used as preservation repositories, as access repositories, as centralized aggregations of far-flung data, etc and operate on any scale for any audience. Furthermore, there are existing standards and agreements about what it means to be a certain type of repository (TDR, OAIS, etc). All of these repositories, however, share some common services -- whether implemented as software, external processes, or manual processes.

Some essential repository services are:

Identifier services, which may include assignment + registration
Storage services (although the content stored may be only pointers to the "actual" content)
Content identification, matching identifiers to content items
Ingest workflows
Access mechanisms

Without these services in place, a repository system would face some difficult obstacles in creating and providing value-added services. Repositories may provide multiple flavors of these services, some of which may be defined in generally accepted standards, models, and specifications.

Other basic services which operate on top of the above services are fairly common in most well-developed repository frameworks include:

Dissemination services, to transform repository data into other forms + formats
Authorization services

More advanced services may include:

preservation services, including checksum (generation + verification), file format migration, support for models like LOCKSS
relationship services, using an RDF triplestore or similar, offering SPARQL endpoints, interferencing, etc
discovery services, using Lucene/Solr/etc, to provide relevancy, optimized user experience, drill-down faceting

These more advanced services are likely separate applications in the repository ecosystem and are generally useful utilities independent of any repository system. Repositories generally integrate with these external applications in a modular, mix-and-match manner using well-defined interfaces.

Fedora

One approach to repository services is the "repository-in-a-box" model, where you can install and configure a base set of services provided by a single application. Within this group of services, Fedora provides a very basic implementation of the core repository services (vs a full-stack application like DSpace, which provides production-ready user interfaces). Fedora bills itself as a Flexible, Extensible Digital Object Repository Architecture.

Identifier services, through PIDGen which provides sequential identifiers per-namespace
maps http uris to deferenceable uris to files
REST + SOAP APIs for Ingest + Delivery
Dissemination services using WSDL
Authorization using XACML (and authentication using a number of plugins)
Integrates with the Mulgara triplestore and a Lucene index (by default)

Fedora provides a many opportunities for customization and enhancements through custom development:

the Fedora REST, SOAP, and triple-store APIs allow developers to build on top of low-level services, which may include access interfaces, administrative interfaces, or otherwise
the Fedora application provides Java Messaging Services (JMS) events when objects within the repository are created, deleted, or modified, and developers can build applications that listen to these events and trigger actions (Shelver <http://yourmediashelf.com/blog/2010/03/01/blacklight-activefedora-and-shelver-interplay-between-searching-managing-and-indexing-in-a-repository-solution/>, fedora-workflow <http://github.com/cbeer/fedora-workflow>, etc)
the Fedora application is build modularly, and Java developers are able to develop and use components as needed, if they conform to the Fedora interfaces

As services go beyond the basic, common applications present in institutional repositories, enhanced repository services require custom development or supplemental services outside of the repository services. For most, this includes integration with a more advanced search provider (like Solr). At some point, additional services can blur the lines between the repository services and front-end user interfaces (which have to respond to local customization to meet user needs).

Repository-independent services, or third-party services, require some wrapper to make them interoperable with the Fedora APIs, which makes integration with existing technology more difficult. Even Duraspace's Duracloud offering is (currently) built as separate services with some possibility of storage-level integration. Preservation support services will bypass the repository APIs and provide those services against the file system instead.

Considering the services Fedora doesn't provide or the obstacles Fedora creates in integration, many ask why they should start using Fedora anyway. The strongest response to this, I believe, is that it provides a common structure to basic repository services, while at the same time not creating major obstacles to future expansion or migration outside Fedora. Out of the box, Fedora provides a set of "training wheels" (ht Mike Giarlo <http://lackoftalent.org/michael/blog/>) for repository services development that can be removed when unnecessary, but in the meantime offers structure for the creation of new repositories and support for repository services as needed.

CDL Microservices

Another approach to repository services are "microservices" like those designed by the California Digital Library (CDL), provide standards and specifications for individual repository services, which form a structure for standardized, mix-and-match repository services that can integrate, interoperate and take advantage of existing technology independent of a repository application like Fedora. This, conceivably, allows all domain developers to take advantage of these common projects without using a specific technology. CDL provides microservices specifications for:

identifier assignment + registration, using NOID, which can act as a CLI tool or a CGI service
file-system structures, using the Pairtree convention
data exchange and verification, using BagIt
access standards, using the ARK URL format

The standards are developed inline the "UNIX philosophy":

Write programs that do one thing and do it well. Write programs to work together. -- Doug McIlroy

These basic services can be organized and crafted using the existing capabilities in web servers, file systems, etc. More advanced services can act within this structure, using individual standards when needed. While significant development and customization may be required to get a microservices architecture to a useable state, the end result is more flexible and targeted to an institutions needs.

Flexing Fedora

These two approaches are certainly not incompatible, and Fedora is quite capable of using some of these micro-services standards under the hood (replacing custom developed approaches to these basic services). By taking this approach, Fedora could act as a management application on top of generic repository data, allow both Fedora-based and microservices-based services to operate on the data, and make it easier to reach around Fedora when necessary (or, go so far as to remove it entirely).

What follows is a short summary of on-going work in this area, which mostly focus on removing the Fedora-centric definitions of /how/ or /where/ services act. The majority of these ideas build on new developments and best practices (since Fedora was initially created) in the repository community as a result increased adoption or awareness of issues. Where available, I've included links to projects in-the-works.

Some of this work is quite easy to do:

integration of NOID identifier services by creating a web-services consumer for Fedora identifier assignment <http://gist.github.com/273584>
replacing the custom, timestamp-hash file store with a Pairtree structure (the prototype is limited, however, by Fedora's hard-coded distinction between object and datastream filestores <http://gist.github.com/280020>
using memento http headers to provide versioning <http://www.fedora-commons.org/jira/browse/FCREPO-604>

Other projects that are more involved, and require more work than just creating new modules for Fedora:

BagIt and SWORD ingest and dissemination options to replace the custom Atom structure <http://fedora-commons.org/confluence/display/FCSVCS/SWORD-Fedora+1.2>
Integration of arbitrary ingest of structured data (perhaps similar to CDL's 7train <http://seventrain.sourceforge.net/>?)
Pluggable authn/authz, through the FESL project, JAAS should provide a pluggable authentication backend <http://www.fedora-commons.org/confluence/display/DEV/Fedora+Enhanced+Security+Layer>
support for arbitrary RDF metadata, forget RELS-EXT/RELS-INT -- force that kind of decision into a disseminator and use a seamless API to pull back RDF triples (/object/{pid}/relationships) <http://www.fedora-commons.org/confluence/display/DEV/Supporting+the+Semantic+Web+and+Linked+Data>

More advanced microservices integration is highly involved and would require a major re-work of the application:

Two-way messaging queues (or file alteration monitors, or database update hooks) to allow Fedora to receive updates
decreased reliance on self-generated registries, I think the situation is getting better, but I'm not sure its fully there..
pluggable storage modules with intelligent filtering, routing, multiplexing, and rules mechanisms -- the Akubra project may be doing (part of?) this <http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project>
workflow support hooks, to allow integration and automation of workflow tools (possibly a result of Hydra?)

Feb 24, 2010

Media, Blacklight, and Viewers Like You

[Video] [PDF] There are many shared problems (and solutions) for libraries and archives in the interest of helping the user. There are also many "new" developments in the archives world that the library communities have been working on for ages, including item-level cataloging, metadata standards, and asset management. Even with these similarities, media archives have additional issues that are less relevant to libraries: the choice of video players, large file sizes, proprietary file formats, challenges of time-based media, etc. In developing a web presence, many archives, including the WGBH Media Library and Archives, have created custom digital library applications to expose material online. In 2008, we began a prototyping phase for developing scholarly interfaces by creating a custom-written PHP front-end to our Fedora repository. In late 2009, we finally saw the (black)light, and after some initial experimentation, decided to build a new, public website to support our IMLS-funded /Vietnam: A Television History/ archive (as well as existing legacy content). In this session, we will share our experience of and challenges with customizing Blacklight as an archival interface, including work in rights management, how we integrated existing Ruby on Rails user-generated content plugins, and the development of media components to support a rich user experience.

Feb 7, 2010

PBCore 2.0: What I'd like to see

This is a short writeup of things I would like to see present in PBCore 2.0, which is currently in progress. It reflects my own personal opinions, etc. One of the biggest challenges that PBCore 2.0 will face is determining how all-encompassing a standard it should be. Media organizations create a large variety of assets through diverse mechanisms for a wide range of purposes with any and all possible skill sets and technologies. Billed as the metadata standard for public broadcasting, it probably needs to respond to everyone's needs and avoid requiring the impossible or limiting the foreseeable. It is for this reason I believe the most important thing PBCore 2.0 can do is provide a structure and framework for metadata without proscribing "the one true way". To do this, PBCore 2.0 must be flexible, and more importantly, extendable if it is going to succeed. These ideas probably fall outside "core" PBCore-compliance, but would enhance the descriptive power of the schema. All it would take are two considerations during the development of PBCore 2.0: a permissive data model and (more importantly) a system and place to document and describe standard extensions, best practices, and implementations. One of the biggest strengths of PBCore 1.x, as I've written earlier, is the vast data dictionary that is the combination of a number of siloed applications full of current data. In PBCore 2.0, I truly hope due consideration is given to linked data and semantic ontologies to provide an easy way for an organization or individual to supplement a core vocabulary with a purpose-driven vocabulary for describing assets (the EBU's P-META classification schemes have taken the first tentative step into this realm and are well worth a look) . This could be done as simply as providing URL-based references to data dictionary values, e.g.:

...

RDF Schema
wikipedia.org

...

This system could be easily extended (in a standardized way) to provide data dictionary descriptions, relational information (sameAs, parentOf, etc) and more, while allowing some level of basic compliance that can ignore the extension. Other extensions to the schema are probably more complex and would require the PBCore 2.0 schema to be permissive, rather than restrictive. One important (and I'd argue, essential) example of this is temporal + spatial media fragments, which could allow a system to describe, in some level of detail, fragments of an asset. This could be represented like:

...

RDF Schema


...

...

...

(obviously the semantics, describing multiple instantiations, and other issues would need to be worked out..) I'd like to take this a step further and develop a systematic way of embedding other schemas (presumably designed for describing objects and ideas outside of the core focus of PBCore, such as people and entities, rights metadata, and provenance). By developing some best practices, this could be done in a discoverable and standard way, maybe something like:

...

    Chris Beer

   Chris Beer
   Male
   Mr
   
 
    Rabble-rouser

...

Tools that don't understand FOAF should be encouraged to ignore these additions, but they provide a rich method of extending the schema in a decentralized and flexible manner. Again, I'm not calling for the inclusion of advanced (and likely, complicated) features into core PBCore compliance, just hoping that in developing a standard for the future, it remains flexible and extendable to meet the needs of all users while being accessible to all.

Feb 2, 2010

Open source happenings

Just some quick notes:

I got a patch into FITS to add some basic video metadata extraction. I'd like to take it further to ensure support for the formats that exiftool supports, but it's a good start.
Today I pushed out a first release of ave-sync, a media/xml synchronization tool. Also a good start, and should be a starting place to play with the w3c FileAPI in Firefox 3.6.
XForms applications are painful to write, but probably a good choice for XML-based workflows.. more on that later..