Solr Data Input Handler

This week, I had the opportunity to write a data import handler (DIH) for the Solr search server, which elegantly mapped a mySQL database to the Solr schema. Before this, I had been writing small scripts with an XML output, because the scope of the underlying data wasn't neatly contained in a single document or database. This is a new feature in Solr 1.3, and it really seems to make integrating search almost trivial, to the point where anyone who can write an SQL query can begin replacing the in-built fulltext engines with a Solr service, offering more flexibility, efficient faceting, and a document-centric view appropriate for search. The basic skeleton looked something like this:
<dataConfig>
        <dataSource driver="com.mysql.jdbc.Driver" batchSize="-1" url="jdbc:mysql://localhost:3306/cms?zeroDateTimeBehavior=convertToNull" user="root" />
<document name="doc">
        <entity transformer="RegexTransformer" name="page" query="SELECT ... FROM ... JOIN ... JOIN ... JOIN ..">
<field column="title" name="dc.title" />
[...]
<field column="names" splitBy="," name="dc.contributor" />
        </entity>
    </document>
</dataConfig>
A couple things to note: In the dataSource configuration, I've set the batchSize="-1", which lowers the number of rows kept in memory and prevents solr (and the servlet engine) from running out of memory Second, in the jdbc configuration, I'm using zeroDateTimeBehavior=convertToNull, which is a very easy way of dealing with those pesky "0000-00-00 00:00:00" dates that normally come out of the database, and allows solr to gracefully skip that field. In some multivalued field declarations (like the names -> dc.contributor), I'm using the regex transformer, and its helper splitBy, to reverse a mySQL GROUP_CONCAT() field, which at least saves a query (and forces more of the data marshaling logic into the SQL query, leaving the Solr mapping fairly straightforward). The Solr transformers look incredibly powerful and almost certainly worth pursuing further in the future. One update I eagerly await is the integration of the DIH with Solr Cell, a text+metadata extraction service, under [#SOLR-1358], which would let you merge previously extracted (or entered) metadata with the fulltext of documents. When this feature is added, I think I can pretty much give up on my transforming scripts and switch to the DIH for all purposes.

Repository workflows

One of the sessions at RIRI '09, and a common theme across many of the conferences I attended this year, is workflows, workflow engines, and tools. Workflows initial came out of "enterprise" systems to managing web services, which moved into the repository in the hands of the repository managers. They establish, in advance, rule based workflows to manage (primarily) submission and dissemination with the Business Process Execution Language (BPEL). While this may be perfect for describing complex interactions, the burden of creating workflows and additional overhead makes this approach seem like overkill. From the scientific community comes two more basic workflow systems, Taverna and Kepler. The key difference between these systems and BPEL seems to be the intended audience, which for these two applications are the scientists themselves looking to manage and marshal their data in a manner specific to their needs. At this point, many of the applications seem ad-hoc (although, judging by myExperiement, seem to be gathering interest in the scientific community. While certainly applicable to making use of the material in a repository environment, at this point it seems like its application to repository management may be questionable. A third option is a programmatic workflow engine like Ruote, which allows one to specify business processes in either Ruby, JSON, or an XML syntax, and can be linked with Fedora's Java Messaging Service using stomp and some Fedora objects. After the fold, I've outlined a very basic Ruote workflow for updating a solr search index every time a Fedora object is updated, similar to the GSearch plugin. Forgive my rather ugly Ruby code, this is just a quick sketch of a possible service. Here's a simple Fedora jms/ruote driver:
#--
# Copyright (c) 2009, Chris Beer, chris@authoritativeopinion.com
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
#++

require 'rubygems'
require 'uri'
require 'cgi'
require 'net/http'
require 'rexml/document'
require 'rest_client'
require 'stomp'
require 'openwfe/engine'
require 'openwfe/participants'
require 'tempfile'
require 'set'

gem 'soap4r'
require 'FedoraAPIMDriver'
require 'FedoraMessage'


$driver = FedoraAPIM.new
$driver.options["protocol.http.basic_auth"] << ['http://localhost:8080/fedora/services/management', 'fedoraAdmin', 'fedora']

#Workflow engine
$engine = OpenWFE::Engine.new(:definition_in_launchitem_allowed => true)

# Participants
$engine.register_participant('file', OpenWFE::FileParticipant)
require 'participants/dam'
require 'participants/solr'
require 'participants/bagit'

#Message queue
client = Stomp::Client.open "stomp://localhost:61613"

client.subscribe('/topic/fedora.apim.update') do |msg|
  begin
    FedoraMessage.new(msg)
  rescue Exception => e
    #log
    puts e
  end
end

client.join
This triggers the message handler, which asks a Fedora disseminator for an XML workflow definition (which easily allows us to tailor custom workflows per object type, and with some XSLT magic, per message type):
class FedoraMessage

  def initialize( msg )
    @msg = msg
    @doc = REXML::Document.new msg.body
    @repository_uri = REXML::XPath.first( @doc, "//author/uri").text
    @type = REXML::XPath.first( @doc, "//title" ).text
    @pid = REXML::XPath.first( @doc, "//summary" ).text
    @dsID = REXML::XPath.first( @doc, '//category[@scheme="fedora-types:dsID"]/@term' ).to_s

    handle_msg
  end

  #
  # Handle an incoming message
  #
  def handle_msg
    begin
      d = get_definition @pid + '/sdef:WORKFLOW'
      launch_workflow d
      #select $a from <#ri> where $a   $b
   # rescue
      #log
    end

    if self.respond_to? @type
      self.send @type
    end
  end


  #
  # Get the appropriate workflow definition from the repository
  #
  def get_definition( action = nil, params = {} )
    if action.empty?
      action = CGI::escape(@pid)
    end
    RestClient.get( [@repository_uri, "get", action, @type].compact.join('/') + hash_to_qs(params) )

  end

  def hash_to_qs(args={})
  	if args.empty?
  	  return ''
  	end
    '?' + args.map { |k,v| "%s=%s" % [URI.encode(k.to_s),
URI.encode(v.to_s)] }.join('&')
  end

  #
  # Launch a Ruote workflow process
  #
  def launch_workflow( definition, params={} )
    li = OpenWFE::LaunchItem.new definition

    li.repository_uri = @repository_uri
    li.pid = @pid
    li.type = @type
    li.msg = @msg

    params.each do |k,v|
      li[k] = v
    end

    fei = $engine.launch li
  end

#--
# Relationship workflow handlers
#++
[...]
end
The workflow definition returned by this sdef:WORKFLOW may be a way to trigger updating a solr search index (if one wanted to do some more advanced routing outside the scope of gsearch, say):

  
    
    
      
      
      
      
    
    
  

and, finally, a couple of small solr participants that make the magic happen:
require 'solr'
require 'add_xml_document'

$conn_solr = Solr::Connection.new('http://localhost:8983/solr', :autocommit => :on)

$engine.register_participant(:solr_prepare) do |workitem|
	profile = RestClient.get workitem.repository_uri + "/objects/" +workitem. pid + "?format=xml", :accept => 'text/xml'
    profile = REXML::Document.new profile
    workitem.status = REXML::XPath.first( profile, "//objState").text
end

$engine.register_participant(:solr_update) do |workitem|
    data = RestClient.get workitem.repository_uri + "/get/" + workitem.pid + "/sdef:METADATA/SolrXML"
    doc = Solr::Request::AddXMLDocument.new data
    $conn_solr.post doc
    $conn_solr.commit
end

$engine.register_participant(:solr_purge) do |workitem|
  $conn_solr.delete workitem.pid
end

$engine.register_participant(:solr_close) do |workitem|
  # purge cache?
end

Video4All: HTML5

Earlier this week, Matt Mastracci released his video4all project, which replaces the HTML5 <video> element with a flowplayer-based alternative for non-compatible browsers. Independently, I've been working on bringing the HTML5 javascript API to some video plugins using a javascript wrapper. At this point it is still very basic, but hopefully proves useful or interesting. I've created a basic flowplayer version, which currently requires the Prototype javascript library (although should trivially port to the flowplayer subset). This layer supports functionality like play/pause, volume control, seeking/currentTime and metadata as well as more advanced features like cue ranges. Error states and events are not yet supported. Here is a very basic demo that demonstrates play/pause and seeking to a time. I've only tested this in Safari 4/Firefox 3.5, but I believe it should work in earlier versions. There is some __getter_/__setter__ javascript which likely fails in Internet Explorer (although I am aware of a project that offers a workaround, I haven't tried it out yet).

MALLET topic analysis of JCDL + Open Video tweets

I'm working towards some interesting visualizations of the twitter streams from a number of conferences (starting with JCDL and Open Video this last week). I'm using Judith Bush's very cool gawk script to parse up the raw atom files. My first step was to get topics for the corpus as a whole: /Applications/mallet/bin/mallet train-topics --input data.mallet --num-topics 10 --output-state topic-stat.gz --output-doc-topics doc-topics --output-topic-keys doc-keys --num-iterations 2000 --optimize-interval 2500 JCDL
0	5	http bit ly org interesting marshall analysis wolf week existing pizza people
1	5	jcdl books data don works problem target foundation facilitate creating
2	5	jcdl libraries evaluation future discussion day multiple public lots univ
3	5	jcdl paper lightweight music back issues funny build dog
4	5	session user talk search talking papers documents great collection type tatted
5	5	conference library good mentors content students focus run building pints
6	5	jcdlgoogle www law participation dl dchud online nice bats duck
7	5	jcdl austin poster google tomorrow small librarian tonight nice
8	5	jcdl digital tags question collections social wikipedia war
9	5	workshop people time quality study alan live archive idea lots
Open Video
0	5	video conference open source net making metadata mozilla developers adobe brokep learned system presentation long openvideo ly app msf
1	5	openvideo ovc tv time gd week vlc html stuff folks nyc platform google meet checking slides startrek kdnlf ll
2	5	media goodman amy watch good great mainstream idea war days im tr flash tpb change put films class devine
3	5	openvideo youtube rt videos session world xenijardin system room art doesn show iran channel film audio totally activism presentation
4	5	openvideo content pirate public live sunde peter cc jardin project creative keynote speaker ogg sweden twitpic licensed seminar fisl
5	5	openvideo people internet talk access day tinyurl conf vid online storytelling awesome working hack digital miro final evolution similar
6	5	openvideo de free en la xeni el years amazing copyright film blog education works closed msurman tk iranian tagged
7	5	openvideo amp ted don great work culture fair back editing question technology site cable id lecture wiki form youtube
8	5	http bit ly check interviews wrap royblumenthal creativecommons based casts ll website footage archives ogg rad blogposts
9	5	openvideo org www openvideoconference make http web watching foss roflmemes put hope sessions online cool launches marketing rest rt
Future work will include temporal analysis and "speaker" analysis.

Developing a Flexible Content Model for Media Repositories: A Case Study

Beer, C., Pinch, P. and Cariani, K. 2009. Developing a Flexible Content Model for Media Repositories: A Case Study. JCDL '09, June 15-19, 2009.
This article describes the process and challenges of developing a content model that can support the content and metadata present in a complex media archive. Media archives have some of the most diverse requirements in an effort to catalog, preserve, and make accessible a wide range of content with multifaceted relationships between works. We focus particularly on the design and implementation of the WGBH Media Library and Archives’ Fedora digital access repository for scholars, educational users and the public. It is our hope that the process and findings from this work can support the architecture and development of other media archives.
Slides as prepared for JCDL '09 paper session 3 are also available