Jul 18, 2010
To ease the transition from the previous incarnation of this blog (a shared blog) to a more focused, personal blog, I used the Wordpress Import/Export feature to transfer all of my own posts into the new Wordpress instance. In order not to disrupt other contributors and leave the old history intact, I whipped up this quick plug-in to redirect requests for my posts to the new blog:
/*
Plugin Name: Author posts redirect
Plugin URI: http://cbeer.info/blog/2010/07/18/from-a-shared-blog-to-a-personal-site
Description: Redirect an author's posts to a new url..
Version: 0.0a
Author: Chris Beer
Author URI: http://cbeer.info
*/
add_action('the_post', 'redirect_author_post');
function redirect_author_post($post) {
if($post->post_author == 2 && is_single()) {
header("HTTP/1.1 301 Moved Permanently");
header('Location: ' . str_replace('http://authoritativeopinion.com/', 'http://cbeer.info/', $post->guid ));
die();
}
}
May 10, 2010
In the previous parts, I wrote about two "back-office" open source applications (and tangentially discussed a few others) that are well-established in their communities and can support a wide variety of repository services. While it may be philosophically important that these are open source applications, I would argue that the next parts, in which I want to talk about services and applications on top of the repository infrastructure, are the more crucial and benefit tremendously from the ability to create and customize interfaces for specific use cases to the full extent necessary by anyone with a fairly broad skill-set.
Blacklight grew out of a next-generation library catalog interface, and while it still has very firm roots in the library world, it is also being used for archives, digital collections, and institutional repository interfaces. It is also an open source application, based on the Ruby on Rails framework.
Out of the box, it is a fairly generic interface to a solr index (with a little sprinkling of optional MARC data) and some relatively benign application features (users, bookmarks, saved searches). Connecting it to our existing Solr index is fairly trivial, and just requires some little configuration changes:
config[:index_fields] = {
:field_names => [
"dc.description",
"dc.creator",
"dc.publisher",
"dc.subject",
"dc.date",
"dc.format"
],
:labels => {
"dc.description" => "Description:",
"dc.creator" => "Creator:",
"dc.publisher" => "Publisher:",
"dc.subject" => "Subject:",
"dc.date" => "Date:",
"dc.format" => "Format:"
}
}
Which gives you a very basic discovery interface into your collection.
Extending Blacklight to work with Fedora is also easy, so in less than 50 lines of code, I had full access to the Fedora web services APIs and SPARQL interface. Adding management interfaces was also simple, using normal Ruby of Rails techniques and with less than 500 lines of code, a passable repository manager interface was available and I could import assets and metadata.
Adding a security layer on top of the repository content is also easy, thanks to the work the UPEI team put into the DrupalServletFilter, which allows Fedora to authenticate users against any SQL database. Because of this, we can use the XACML policy language built into Fedora to do record-level security (which I confess, I don't entirely understand, however, it is an enormously powerful and expressive language if you like XML verbiage). For storing re-use rights, I am very intrigued by the Open Digital Rights Language, which can integrate with Fedora and Blacklight to express non-object-security rights (re-use, segmentation, etc) using my proof-of-concept ruby-odrl.
With these fundamentals in place (ingest services, security policies, and resource discovery), one can build more advanced services on top of the repository, like collections, batch and on-demand conversion/transcode services, export/transfer services (one-click "export to PBS COVE"?) -- and, because this can be done as rails plug-ins, they are readily sharable outside of this single application and provide templates for others to continue to develop and extend similar services to evolving platforms.
Because setting up a Blacklight application is so painless, it would be easy for public broadcasting institutions to create custom-made (yet shareable) modules and views for specific purposes (news, productions, archiving, etc) that all share the same back-end infrastructure yet offer users an easy way to interact with their data in a way that makes sense for their work. As I mentioned in my Fedora article, you aren't limited to data you control and have locally, but can bring in data from external sources (say, pulling in metadata from the NPR API or an RSS feed from a stock footage house) and present it both coherently and cohesively.
I'm looking for a good source of freely available test data, and I would rather not invest too much time building a corpus of archival assets if there is something already existing. The biggest challenge I'm having is finding comprehensive metadata, but the closest I've come are some podcast feeds from sources like Democracy Now!, however that doesn't capture the breadth of materials I'd like to demonstrate.
Finally, a couple requisite screen-shots now that there is something visual to work with, using the default Blacklight theme with some quick interface hacks.
[gallery]
May 8, 2010
The Lucene-based Apache Solr is an incredible platform for building decent search experiences with -- especially compared to the "more traditional" database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).
For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here's the skeleton schema:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="description" type="string" indexed="true" stored="true"/>
<dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
<field name="payloads" type="payloads" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
<copyField source="title" dest="title_t" />
<copyField source="subject" dest="dc.subject" />
<copyField source="description" dest="description_t" />
<copyField source="comments" dest="text" />
<copyField source="dc.creator" dest="author" />
<copyField source="dc.*" dest="text" />
<copyField source="text" dest="text_rev" />
<copyField source="payloads" dest="text" />
<copyField source="dc.title" dest="dc.title_t" />
<copyField source="dc.description" dest="dc.description_t" />
<copyField source="dc.coverage" dest="dc.coverage_t" />
<copyField source="dc.contributor" dest="dc.contributor_t" />
<copyField source="dc.subject" dest="dc.subject_t" />
<copyField source="dc.contributor" dest="names_t" />
<copyField source="dc.coverage" dest="names_t" />
The new edismax query parser provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here.
The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like GSearch and Shelver to the more generic (ESBs and all that) like Apache Camel or the Ruote-based Fedora Workflow component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I've given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works.
---
On twitter, John Tynan requested a virtual machine image to encourage others to begin playing with this software, so I've actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in.