For the bVault project I am developing, one of our secondary goals is to create a replicable model for other digital media repositories. One of the ways we are pursuing this is to lay the foundations for an interface to a federated/distributed repository among other public broadcasters, which takes advantage of one of the architectural features of public broadcasting in the US‚ the public broadcasting network is really a federation of individual stations that subscribe and contribute to a particular programming distribution service (PBS and NPR among others)

A federated repository ultimately needs three things:

  1. A common API among the participating repositories,
  2. A search index that covers all the repositories, and
  3. A resolver to translate a search result back to the originating repository

Common API

For bVault, the common API is the set of web services exposed by Fedora, and the metadata translation dissemination service behind that, which allows a client to receive a particular metadata format, regardless of the underlying schema. This is an important feature, because it allows individual repositories to use whichever metadata format is most natural to their needs, while seamlessly generating interoperable metadata.

Search index

The exact methods employed to generate a spanning search index are essentially arbitrary. Solr provides some distributed/sharded search capabilities, but the index could also operate on a pub/sub model where repositories push content out to a master search index, or with a search engine like crawler using OAI-PMH endpoints for the repository. Because the search index is loosely coupled to the whole system, it ultimately is an architectural decision rather than a technical one

Distributed Resolver

Now that we have a way to discover items within a repository, the interface needs a way to extract the content from the origin. For this, we need a way to resolve a unique resource identifier (URI!) back to its source. Again, the method is somewhat arbitrary, but for this project, we elected to require unique namespaces for each repository (quite reasonable, considering the application).

To do this, I’ve slipped a namespace resolver into the client’s API call to allow the interface to act independently from the source of the content. For a simple API call, like listDatastreams, we have:

public function listDatastreams($pid, $asOfDateTime = null) {
      return Fedora_Repository::get('API-A', $pid)->listDatastreams(array('pid' => $pid,
                    'asOfDateTime' => $asOfDateTime));
}

This requests the API-A binding appropriate to the current persistant identifier (pid):

/**
  * Retrieves a Fedora Repository that can provide the $type endpoint for the PID/prefix $prefix
  *
  * @param string $type
  * @param string $prefix
  * @return Fedora_Repository
  */
static public function get($type, $prefix = '') {
     global $objManager;

     $arrRepository = $objManager->resolve($prefix);
     $objClient = new stdClass;

     if(count($arrRepository) == 1) {
           $objClient = $arrRepository[0]->getSoapClient($type);
     } else {
           $arrKey = array_rand($arrRepository, count($arrRepository));

           foreach($arrKey as $key) {
               $objClient = $arrRepository[$key]->getSoapClient($type);
               if($objClient !== false) {
                     break;
               }
           }
      }

      if($objClient instanceof SoapClient) {
            return $objClient;
      } else {
            return false;
      }
}

Creating a distributed repository doesn’t cost much now, and if you design it right, you can benefit from the potential for redundancy and mirroring immediately, even before there is a federated network to tap into.

The full source is available from the bVault Fedora PHP library.