LDPath in 3 examples

At Code4Lib 2015, I gave a quick lightning talk on LDPath, a declarative domain-specific language for flatting linked data resources to a hash (e.g. for indexing to Solr).

LDPath can traverse the Linked Data Cloud as easily as working with local resources and can cache remote resources for future access. The LDPath language is also (generally) implementation independent (java, ruby) and relatively easy to implement. The language also lends itself to integration within development environments (e.g. ldpath-angular-demo-app, with context-aware autocompletion and real-time responses). For me, working with the LDPath language and implementation was the first time that linked data moved from being a good idea to being a practical solution to some problems.

Here is a selection from the VIAF record [1]:

   void:inDataset <../data> ;
   a genont:InformationResource, foaf:Document ;
   foaf:primaryTopic <../65687612> .

   schema:alternateName "Bittman, Mark" ;
   schema:birthDate "1950-02-17" ;
   schema:familyName "Bittman" ;
   schema:givenName "Mark" ;
   schema:name "Bittman, Mark" ;
   schema:sameAs <http://d-nb.info/gnd/1058912836>, <http://dbpedia.org/resource/Mark_Bittman> ;
   a schema:Person ;
   rdfs:seeAlso <../182434519>, <../310263569>, <../314261350>, <../314497377>, <../314513297>, <../314718264> ;
   foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/Mark_Bittman> .

We can use LDPath to extract the person’s name:

So far, this is not so different from traditional approaches. But, if we look deeper in the response, we can see other resources, including books by the author.

    schema:creator <../65687612> ;
    schema:name "How to Cook Everything : Simple Recipes for Great Food" ;
    a schema:CreativeWork .

We can traverse the links to include the titles in our record:

LDPath also gives us the ability to write this query using a reverse property selector, e.g:

books = foaf:primaryTopic / ^schema:creator[rdf:type is schema:CreativeWork] / schema:name :: xsd:string ;

The resource links out to some external resources, including a link to dbpedia. Here is a selection from record in dbpedia:

    dbpedia-owl:abstract "Mark Bittman (born c. 1950) is an American food journalist, author, and columnist for The New York Times."@en, "Mark Bittman est un auteur et chroniqueur culinaire américain. Il a tenu une chronique hebdomadaire pour le The New York Times, appelée The Minimalist (« le minimaliste »), parue entre le 17 septembre 1997 et le 26 janvier 2011. Bittman continue d'écrire pour le New York Times Magazine, et participe à la section Opinion du journal. Il tient également un blog."@fr ;
    dbpedia-owl:birthDate "1950+02:00"^^<http://www.w3.org/2001/XMLSchema#gYear> ;
    dbpprop:name "Bittman, Mark"@en ;
    dbpprop:shortDescription "American journalist, food writer"@en ;
    dc:description "American journalist, food writer", "American journalist, food writer"@en ;
    dcterms:subject <http://dbpedia.org/resource/Category:1950s_births>, <http://dbpedia.org/resource/Category:American_food_writers>, <http://dbpedia.org/resource/Category:American_journalists>, <http://dbpedia.org/resource/Category:American_television_chefs>, <http://dbpedia.org/resource/Category:Clark_University_alumni>, <http://dbpedia.org/resource/Category:Living_people>, <http://dbpedia.org/resource/Category:The_New_York_Times_writers> ;

LDPath allows us to transparently traverse that link, allowing us to extract the subjects for VIAF record:

[1] If you’re playing along at home, note that, as of this writing, VIAF.org fails to correctly implement content negotiation and returns HTML if it appears anywhere in the Accept header, e.g.:

curl -H "Accept: application/rdf+xml, text/html; q=0.1" -v http://viaf.org/viaf/152427175/

will return a text/html response. This may cause trouble for your linked data clients.

Building a Pivotal Tracker IRC bot with Sinatra and Cinch

We're using Pivotal Tracker on the Fedora Futures project. We also have an IRC channel where the tech team hangs out most of the day, and let each other know what we're working on, which tickets we're taking, and give each other feedback on those tickets. In order to document this, we try to put most of our the discussion in the tickets for future reference (although we are logging the IRC channel, it's not nearly as easy to look up decisions there). Because we're (lazy) developers, we wanted updates in Pivotal to get surfaced in the IRC channel. There was a (neglected) IRC bot, Pivotal-Tracker-IRC-bot, but it was designed to push and pull data from Pivotal based on commands in IRC (and, seems fairly abandoned). So, naturally, we built our own integration: Pivotal-IRC. This was my first time using Cinch to build a bot, and it was a surprisingly pleasant and straightforward experience:
bot = Cinch::Bot.new do
  configure do |c|
  	c.nick = $nick
    c.server = $irc_server
    c.channels = [$channel]

# launch the bot in a separate thread, because we're using this one for the webapp.
Thread.new {
And we have a really tiny Sinatra app that can parse the Pivotal Webhooks payload and funnel it into the channel:
post '/' do 
	  message = Pivotal::WebhookMessage.new request.body.read
	  bot.channel_list.first.msg("#{message.description} #{message.story_url}")
It turns out we also send links to Pivotal tickets not infrequently, and building two-way communication (using the Pivotal REST API, and the handy pivotal-tracker gem) was also easy. Cinch exposes a handy DSL that parses messages using regular expressions and capturing groups:
bot.on :message, /story\/show\/([0-9]+)/ do |m, ticket_id|
    story = project.stories.find(ticket_id)
    m.reply "#{story.story_type}: #{story.name} (#{story.current_state}) / owner: #{story.owned_by}"

Real-time statistics with Graphite, Statsd, and GDash

We have a Graphite-based stack of real-time visualization tools, including the data aggregator Statsd. These tools let us easily record real-time data from arbitrary services with mimimal fuss. We present some curated graphs through GDash, a simple Sinatra front-end. For example, we record the time it takes for Solr to respond to queries from our SearchWorks catalog, using this simple bash script:
tail -f /var/log/tomcat6/catalina.out | ruby solr_stats.rb

(We rotate these logs through truncation; you can also use `tail -f --retry` for logs that are moved away when rotated)

And the ruby script that does the actual parsing:
require 'statsd.rb'

STATSD = Statsd.new(...,8125)

# Listen to stdin
while str = gets
  if str =~ /QTime=([^ ]+)/
    # extract the QTime
    ms = $1.to_i

    # record it, based on our hostname
    STATSD.timing("#{ENV['HOSTNAME'].gsub('.', '-')}.solr.qtime", ms)
From this data, we can start asking qustions like:
Is our load-balancer configured optimally?
(hint: not quite; for a variety of reasons, we've sacrificed some marginal performance benefit for this
non-invasive, simpler load-blaance configuration.
Why are our the 90th-percentile query times creeping up? (time in ms)

(Answers to these questions and more in a future post, I'm sure.)

We also use this setup to monitor other services, e.g.:
What's happening in our Fedora instance (and, which services are using the repository)?
Note the red line ("warn_0") in the top graph. It marks the point where our (asynchronous) indexing system is unable to keep up with demand, and updates may appear at a delay. Given time (and sufficient data, of course), this also gives us the ability to forecast and plan for issues:
  • Is our Solr query time getting worse? (Ganglia can perform some basic manipulation, including taking integrals and derivatives)
  • What is the rate of growth of our indexing backlog, and, can we process it in a reasonable timeframe, or should we scale the indexer service?
  • Given our rate of disk usage, are we on track to run out of disk space this month? this week?
If we build graphs to monitor those conditions, we can add Nagios alerts to trigger service alerts. GDash helpfully exposes a REST endpoint that lets us know if a service has those WARN or CRITICAL thresholds. We currently have a home-grown system monitoring system that we're tempted to fold into here as well. I've been evaluating Diamond, which seems to do a pretty good job of collecting granular system statistics (CPU, RAM, IO, Disk space, etc).

Icemelt: A stand-in for integration tests against AWS Glacier

One of the threads we've been pursuing as part of the Fedora Futures project is integration with asynchronous and/or very slow storage. We've taken on AWS Glacier as a prime, generally accessable example. Uploading content is slow, but can be done synchronously in one API request:
POST /:account_id/vaults/:vault_id/archives
x-amz-archive-description: Description
...Request body (aka your content)...
Where things get radically different is when requesting content back. First, you let Glacier know you'd like to retrieve your content:
POST /:account_id/vaults/:vault_id/jobs HTTP/1.1

  "Type": "archive-retrieval",
  "ArchiveId": String,
Then, you wait. and wait. and wait some more; from the documentation:
Most Amazon Glacier jobs take about four hours to complete. You must wait until the job output is ready for you to download. If you have either set a notification configuration on the vault identifying an Amazon Simple Notification Service (Amazon SNS) topic or specified an Amazon SNS topic when you initiated a job, Amazon Glacier sends a message to that topic after it completes the job. [emphasis added]


If you're iterating on some code, waiting hours to get your content back isn't realistic. So, we wrote a quick Sinatra app called Icemelt in order to mock the Glacier REST API (and, perhaps taking less time to code than retrieving content from Glacier ). We've tested it using the Ruby Fog client, as well as the official AWS Java SDK, and it actually works! Your content gets stored locally, and the delay for retrieving content is configurable (default: 5 seconds). Configuring the official SDK looks something like this:
PropertiesCredentials credentials = new PropertiesCredentials(
AmazonGlacierClient client = new AmazonGlacierClient(credentials);
And for Fog, something like:
Fog::AWS::Glacier.new :aws_access_key_id => '',
                      :aws_secret_access_key => '', 
                      :scheme => 'http', 
                      :host => 'localhost', 
                      :port => '3000'
Right now, Icemelt skips a lot of unnecessary work (e.g. checking HMAC digests for authentication, validating hashes, etc), but, as always, patches are very welcome.

The self-healing repository

One of the goals of the Fedora Futures project is to give repository administrators the tools they need in order to successfully provide a highly available, scalable, resiliant repository to their consumers.

Although Fedora 3.x does have support for replication and mirroring for high-availability, it is neither widely implemented nor easily scalable. As I understand it, each slave repository keeps a complete, independent copy of the data, making it, perhaps, not economical to maintain a large cluster to support either large read- or write-heavy work.

We're building the current Fedora Futures prototype on top of Modeshape and Infinispan, which come out of the box with support for clustering. Infinispan provides two main clustering modes, replication and distribution, and can be configured by the repository administrator to balance needs of high-availability, scalability, durability.

Replication means when one node receives any "modify"-type request, it will replicate that change to every node in the cluster. Distribution, however, is a tunable pattern that allows you to set the number of copies that should be maintained in the cluster. In distribution, if numOwners=1, you have traditional sharding where a single copy of the data is maintained in the cluster; if numOwners=m, though, you can remove m - 1 nodes from the cluster and maintain availability.

The different clustering configurations look something like:

<clustering mode="replication">
<!-- SHARDING -->
<clustering mode="distribution">
 <l1 enabled="false" lifespan="0" onRehash="false"/>
 <hash numOwners="1"/>
 <stateTransfer fetchInMemoryState="true"/>
<!-- DISTRIBUTION; keep 3 copies of the data -->
<clustering mode="distribution">
 <l1 enabled="false" lifespan="0" onRehash="false"/>
 <hash numOwners="3"/>
 <stateTransfer fetchInMemoryState="true"/>

When you add a new node to the cluster, the cluster will re-balance objects from other nodes to the new node; when you remove a node, the cluster will redistribute existing data to maintain the distribution guarantees.

(I'll also note here, Infinispan's cross-site replication feature, which allows different clustering configurations between the replicated sites. In a repository-context, perhaps this could be used to ensure 2 copies are kept in spinning-disk cache stores, but only 1 needs to be kept on tape.)

A brief note about fixity

Fixity checks in a single-node configuration are relatively simple. Your service can request the data from storage, compute checksums and compare them against stored values. With clustering modes, we'd also have to check each copy of the data. At least in Modeshape (and Infinispan), the high-level APIs do not seem to provide that kind of visibility down at the cache store levels. We'll have to dig deep into the Infinispan API to locate the data in a particular cache store and run our checks.

Once you bring true fixity checks this close to the repository, you could even start building intelligent, self-healing repositories that to identify corrupt data, evict them, and let the cluster rebalance automatically. You could even reject suspect nodes, while bringing up new nodes to replace them.