Repository workflows

One of the sessions at RIRI '09, and a common theme across many of the conferences I attended this year, is workflows, workflow engines, and tools. Workflows initial came out of "enterprise" systems to managing web services, which moved into the repository in the hands of the repository managers. They establish, in advance, rule based workflows to manage (primarily) submission and dissemination with the Business Process Execution Language (BPEL). While this may be perfect for describing complex interactions, the burden of creating workflows and additional overhead makes this approach seem like overkill. From the scientific community comes two more basic workflow systems, Taverna and Kepler. The key difference between these systems and BPEL seems to be the intended audience, which for these two applications are the scientists themselves looking to manage and marshal their data in a manner specific to their needs. At this point, many of the applications seem ad-hoc (although, judging by myExperiement, seem to be gathering interest in the scientific community. While certainly applicable to making use of the material in a repository environment, at this point it seems like its application to repository management may be questionable. A third option is a programmatic workflow engine like Ruote, which allows one to specify business processes in either Ruby, JSON, or an XML syntax, and can be linked with Fedora's Java Messaging Service using stomp and some Fedora objects. After the fold, I've outlined a very basic Ruote workflow for updating a solr search index every time a Fedora object is updated, similar to the GSearch plugin. Forgive my rather ugly Ruby code, this is just a quick sketch of a possible service. Here's a simple Fedora jms/ruote driver:

#--
# Copyright (c) 2009, Chris Beer, chris@authoritativeopinion.com
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#++

require 'rubygems'
require 'uri'
require 'cgi'
require 'net/http'
require 'rexml/document'
require 'rest_client'
require 'stomp'
require 'openwfe/engine'
require 'openwfe/participants'
require 'tempfile'
require 'set'

gem 'soap4r'
require 'FedoraAPIMDriver'
require 'FedoraMessage'


$driver = FedoraAPIM.new
$driver.options["protocol.http.basic_auth"] << ['http://localhost:8080/fedora/services/management', 'fedoraAdmin', 'fedora']

#Workflow engine
$engine = OpenWFE::Engine.new(:definition_in_launchitem_allowed => true)

# Participants
$engine.register_participant('file', OpenWFE::FileParticipant)
require 'participants/dam'
require 'participants/solr'
require 'participants/bagit'

#Message queue
client = Stomp::Client.open "stomp://localhost:61613"

client.subscribe('/topic/fedora.apim.update') do |msg|
  begin
    FedoraMessage.new(msg)
  rescue Exception => e
    #log
    puts e
  end
end

client.join

This triggers the message handler, which asks a Fedora disseminator for an XML workflow definition (which easily allows us to tailor custom workflows per object type, and with some XSLT magic, per message type):

class FedoraMessage

  def initialize( msg )
    @msg = msg
    @doc = REXML::Document.new msg.body
    @repository_uri = REXML::XPath.first( @doc, "//author/uri").text
    @type = REXML::XPath.first( @doc, "//title" ).text
    @pid = REXML::XPath.first( @doc, "//summary" ).text
    @dsID = REXML::XPath.first( @doc, '//category[@scheme="fedora-types:dsID"]/@term' ).to_s

    handle_msg
  end

  #
  # Handle an incoming message
  #
  def handle_msg
    begin
      d = get_definition @pid + '/sdef:WORKFLOW'
      launch_workflow d
      #select $a from <#ri> where $a   $b
   # rescue
      #log
    end

    if self.respond_to? @type
      self.send @type
    end
  end


  #
  # Get the appropriate workflow definition from the repository
  #
  def get_definition( action = nil, params = {} )
    if action.empty?
      action = CGI::escape(@pid)
    end
    RestClient.get( [@repository_uri, "get", action, @type].compact.join('/') + hash_to_qs(params) )

  end

  def hash_to_qs(args={})
  	if args.empty?
  	  return ''
  	end
    '?' + args.map { |k,v| "%s=%s" % [URI.encode(k.to_s),
URI.encode(v.to_s)] }.join('&')
  end

  #
  # Launch a Ruote workflow process
  #
  def launch_workflow( definition, params={} )
    li = OpenWFE::LaunchItem.new definition

    li.repository_uri = @repository_uri
    li.pid = @pid
    li.type = @type
    li.msg = @msg

    params.each do |k,v|
      li[k] = v
    end

    fei = $engine.launch li
  end

#--
# Relationship workflow handlers
#++
[...]
end

The workflow definition returned by this sdef:WORKFLOW may be a way to trigger updating a solr search index (if one wanted to do some more advanced routing outside the scope of gsearch, say):

and, finally, a couple of small solr participants that make the magic happen:

require 'solr'
require 'add_xml_document'

$conn_solr = Solr::Connection.new('http://localhost:8983/solr', :autocommit => :on)

$engine.register_participant(:solr_prepare) do |workitem|
	profile = RestClient.get workitem.repository_uri + "/objects/" +workitem. pid + "?format=xml", :accept => 'text/xml'
    profile = REXML::Document.new profile
    workitem.status = REXML::XPath.first( profile, "//objState").text
end

$engine.register_participant(:solr_update) do |workitem|
    data = RestClient.get workitem.repository_uri + "/get/" + workitem.pid + "/sdef:METADATA/SolrXML"
    doc = Solr::Request::AddXMLDocument.new data
    $conn_solr.post doc
    $conn_solr.commit
end

$engine.register_participant(:solr_purge) do |workitem|
  $conn_solr.delete workitem.pid
end

$engine.register_participant(:solr_close) do |workitem|
  # purge cache?
end