blog.cbeer.info

Feb 1, 2009

Repositories: What are they and what are they good for?

Beer, C. Repositories: What are they and what are they good for?. AMIA Open Source Moving Image Access Meeting, February, 2009

Dec 1, 2008

Semi-controlled-folksonomic-tagging-vocabulary: Encouraging Useful Metadata Contributions. New England Code4Lib, December 2008.

Beer, C., and Michael, C. Semi-controlled-folksonomic-tagging-vocabulary: Encouraging Useful Metadata Contributions. New England Code4Lib, December 2008.

Nov 20, 2008

jQuery, OpenSearch and Autocomplete

Here's a quick code snippet for making JQuery's autocomplete ui element consume an OpenSearch resource:

    jQuery('#term').autocomplete('/proxy/opensearch', {parse: opensearch});
    function opensearch(data) {
        data = eval(data);
        var parsed = [];

        for (var i=0; i < data[1].length; i++) {
            var row = jQuery.trim(data[1][i]);
            if (row) {
                parsed[parsed.length] = {
                    data: [row],
                    value: row,
                    result: row
                };
            }
        }
        return parsed;
    }

Oct 26, 2008

Federated/distributed digital repositories

For the bVault project I am developing, one of our secondary goals is to create a replicable model for other digital media repositories. One of the ways we are pursuing this is to lay the foundations for an interface to a federated/distributed repository among other public broadcasters, which takes advantage of one of the architectural features of public broadcasting in the US‚ the public broadcasting network is really a federation of individual stations that subscribe and contribute to a particular programming distribution service (PBS and NPR among others)

A federated repository ultimately needs three things:

A common API among the participating repositories,
A search index that covers all the repositories, and
A resolver to translate a search result back to the originating repository

Common API

For bVault, the common API is the set of web services exposed by Fedora, and the metadata translation dissemination service behind that, which allows a client to receive a particular metadata format, regardless of the underlying schema. This is an important feature, because it allows individual repositories to use whichever metadata format is most natural to their needs, while seamlessly generating interoperable metadata.

Search index

The exact methods employed to generate a spanning search index are essentially arbitrary. Solr provides some distributed/sharded search capabilities, but the index could also operate on a pub/sub model where repositories push content out to a master search index, or with a search engine like crawler using OAI-PMH endpoints for the repository. Because the search index is loosely coupled to the whole system, it ultimately is an architectural decision rather than a technical one

Distributed Resolver

Now that we have a way to discover items within a repository, the interface needs a way to extract the content from the origin. For this, we need a way to resolve a unique resource identifier (URI!) back to its source. Again, the method is somewhat arbitrary, but for this project, we elected to require unique namespaces for each repository (quite reasonable, considering the application).

To do this, I‚Äôve slipped a namespace resolver into the client‚Äôs API call to allow the interface to act independently from the source of the content. For a simple API call, like listDatastreams, we have:

public function listDatastreams($pid, $asOfDateTime = null) {
      return Fedora_Repository::get('API-A', $pid)->listDatastreams(array('pid' => $pid,
                    'asOfDateTime' => $asOfDateTime));
}

This requests the API-A binding appropriate to the current persistant identifier (pid):

/**
  * Retrieves a Fedora Repository that can provide the $type endpoint for the PID/prefix $prefix
  *
  * @param string $type
  * @param string $prefix
  * @return Fedora_Repository
  */
static public function get($type, $prefix = '') {
     global $objManager;

     $arrRepository = $objManager->resolve($prefix);
     $objClient = new stdClass;

     if(count($arrRepository) == 1) {
           $objClient = $arrRepository[0]->getSoapClient($type);
     } else {
           $arrKey = array_rand($arrRepository, count($arrRepository));

           foreach($arrKey as $key) {
               $objClient = $arrRepository[$key]->getSoapClient($type);
               if($objClient !== false) {
                     break;
               }
           }
      }

      if($objClient instanceof SoapClient) {
            return $objClient;
      } else {
            return false;
      }
}

Creating a distributed repository doesn‚Äôt cost much now, and if you design it right, you can benefit from the potential for redundancy and mirroring immediately, even before there is a federated network to tap into.

The full source is available from the bVault Fedora PHP library.

Oct 25, 2008

Zend_Cache for Web Services

My current project involves a number of SOAP Web Services requests to retrieve information from our Fedora repository. To help minimize overhead from HTTP requests, I‚Äôm using Zend Framework‚Äôs Zend_Cache_Frontend_Class to wrap the whole Fedora/PHP interface class. Zend_Cache allows me to implement this style of caching with only a single line of code.

Our web services consumer provides a couple of access methods that can be safely cached:

class Fedora_Object
{
/* .... */
	        public function getDissemination($pid, $sDefPid, $methodName, $parameters, $asOfDateTime = null) {
           try {
                   return Fedora_Repository::get('API-A', $pid)->getDissemination(array('pid' => $pid,
                                                     'serviceDefinitionPid' => $sDefPid,
                                                     'methodName' => $methodName,
                                                     'parameters' => $parameters,
                                                     'asOfDateTime' => $asOfDateTime));
                } catch(SoapFault $s) {
                       return $s;
                }
        }
/* .... */
}

In the bootstrap file, instead of initializing the Fedora_Object class, I wrap it in a Zend_Cache instance:

$fedora = Zend_Cache::factory('Class', 'File', array('cached_entity' => new Fedora_Object(),
                          'cached_methods' => array('getObjectXML', 'getDatastreamDissemination', 'getDissemination'),
                           'cache_by_default' => false));

This code tells Zend_Cache to cache only the specified cached_methods and pass everything else through. Easy.