Building a Pivotal Tracker IRC bot with Sinatra and Cinch

We're using Pivotal Tracker on the Fedora Futures project. We also have an IRC channel where the tech team hangs out most of the day, and let each other know what we're working on, which tickets we're taking, and give each other feedback on those tickets. In order to document this, we try to put most of our the discussion in the tickets for future reference (although we are logging the IRC channel, it's not nearly as easy to look up decisions there). Because we're (lazy) developers, we wanted updates in Pivotal to get surfaced in the IRC channel. There was a (neglected) IRC bot, Pivotal-Tracker-IRC-bot, but it was designed to push and pull data from Pivotal based on commands in IRC (and, seems fairly abandoned). So, naturally, we built our own integration: Pivotal-IRC. This was my first time using Cinch to build a bot, and it was a surprisingly pleasant and straightforward experience:
bot = Cinch::Bot.new do
  configure do |c|
  	c.nick = $nick
    c.server = $irc_server
    c.channels = [$channel]
  end
end

# launch the bot in a separate thread, because we're using this one for the webapp.
Thread.new {
  bot.start
}
And we have a really tiny Sinatra app that can parse the Pivotal Webhooks payload and funnel it into the channel:
	post '/' do 
	  message = Pivotal::WebhookMessage.new request.body.read
	  bot.channel_list.first.msg("#{message.description} #{message.story_url}")
	end
It turns out we also send links to Pivotal tickets not infrequently, and building two-way communication (using the Pivotal REST API, and the handy pivotal-tracker gem) was also easy. Cinch exposes a handy DSL that parses messages using regular expressions and capturing groups:
bot.on :message, /story\/show\/([0-9]+)/ do |m, ticket_id|
    story = project.stories.find(ticket_id)
    m.reply "#{story.story_type}: #{story.name} (#{story.current_state}) / owner: #{story.owned_by}"
  end

Real-time statistics with Graphite, Statsd, and GDash

We have a Graphite-based stack of real-time visualization tools, including the data aggregator Statsd. These tools let us easily record real-time data from arbitrary services with mimimal fuss. We present some curated graphs through GDash, a simple Sinatra front-end. For example, we record the time it takes for Solr to respond to queries from our SearchWorks catalog, using this simple bash script:
tail -f /var/log/tomcat6/catalina.out | ruby solr_stats.rb

(We rotate these logs through truncation; you can also use `tail -f --retry` for logs that are moved away when rotated)

And the ruby script that does the actual parsing:
require 'statsd.rb'

STATSD = Statsd.new(...,8125)

# Listen to stdin
while str = gets
  if str =~ /QTime=([^ ]+)/
    # extract the QTime
    ms = $1.to_i

    # record it, based on our hostname
    STATSD.timing("#{ENV['HOSTNAME'].gsub('.', '-')}.solr.qtime", ms)
  end
end
From this data, we can start asking qustions like:
Is our load-balancer configured optimally?
(hint: not quite; for a variety of reasons, we've sacrificed some marginal performance benefit for this
non-invasive, simpler load-blaance configuration.
Why are our the 90th-percentile query times creeping up? (time in ms)

(Answers to these questions and more in a future post, I'm sure.)

We also use this setup to monitor other services, e.g.:
What's happening in our Fedora instance (and, which services are using the repository)?
Note the red line ("warn_0") in the top graph. It marks the point where our (asynchronous) indexing system is unable to keep up with demand, and updates may appear at a delay. Given time (and sufficient data, of course), this also gives us the ability to forecast and plan for issues:
  • Is our Solr query time getting worse? (Ganglia can perform some basic manipulation, including taking integrals and derivatives)
  • What is the rate of growth of our indexing backlog, and, can we process it in a reasonable timeframe, or should we scale the indexer service?
  • Given our rate of disk usage, are we on track to run out of disk space this month? this week?
If we build graphs to monitor those conditions, we can add Nagios alerts to trigger service alerts. GDash helpfully exposes a REST endpoint that lets us know if a service has those WARN or CRITICAL thresholds. We currently have a home-grown system monitoring system that we're tempted to fold into here as well. I've been evaluating Diamond, which seems to do a pretty good job of collecting granular system statistics (CPU, RAM, IO, Disk space, etc).

Icemelt: A stand-in for integration tests against AWS Glacier

One of the threads we've been pursuing as part of the Fedora Futures project is integration with asynchronous and/or very slow storage. We've taken on AWS Glacier as a prime, generally accessable example. Uploading content is slow, but can be done synchronously in one API request:
POST /:account_id/vaults/:vault_id/archives
x-amz-archive-description: Description
...Request body (aka your content)...
Where things get radically different is when requesting content back. First, you let Glacier know you'd like to retrieve your content:
POST /:account_id/vaults/:vault_id/jobs HTTP/1.1

{
  "Type": "archive-retrieval",
  "ArchiveId": String,
  [...]
}
Then, you wait. and wait. and wait some more; from the documentation:
Most Amazon Glacier jobs take about four hours to complete. You must wait until the job output is ready for you to download. If you have either set a notification configuration on the vault identifying an Amazon Simple Notification Service (Amazon SNS) topic or specified an Amazon SNS topic when you initiated a job, Amazon Glacier sends a message to that topic after it completes the job. [emphasis added]

Icemelt

If you're iterating on some code, waiting hours to get your content back isn't realistic. So, we wrote a quick Sinatra app called Icemelt in order to mock the Glacier REST API (and, perhaps taking less time to code than retrieving content from Glacier ). We've tested it using the Ruby Fog client, as well as the official AWS Java SDK, and it actually works! Your content gets stored locally, and the delay for retrieving content is configurable (default: 5 seconds). Configuring the official SDK looks something like this:
PropertiesCredentials credentials = new PropertiesCredentials(
    TestIcemeltGlacierMock.class
        .getResourceAsStream("AwsCredentials.properties"));
AmazonGlacierClient client = new AmazonGlacierClient(credentials);
client.setEndpoint("http://localhost:3000/");
And for Fog, something like:
Fog::AWS::Glacier.new :aws_access_key_id => '',
                      :aws_secret_access_key => '', 
                      :scheme => 'http', 
                      :host => 'localhost', 
                      :port => '3000'
Right now, Icemelt skips a lot of unnecessary work (e.g. checking HMAC digests for authentication, validating hashes, etc), but, as always, patches are very welcome.

The self-healing repository

One of the goals of the Fedora Futures project is to give repository administrators the tools they need in order to successfully provide a highly available, scalable, resiliant repository to their consumers.

Although Fedora 3.x does have support for replication and mirroring for high-availability, it is neither widely implemented nor easily scalable. As I understand it, each slave repository keeps a complete, independent copy of the data, making it, perhaps, not economical to maintain a large cluster to support either large read- or write-heavy work.

We're building the current Fedora Futures prototype on top of Modeshape and Infinispan, which come out of the box with support for clustering. Infinispan provides two main clustering modes, replication and distribution, and can be configured by the repository administrator to balance needs of high-availability, scalability, durability.

Replication means when one node receives any "modify"-type request, it will replicate that change to every node in the cluster. Distribution, however, is a tunable pattern that allows you to set the number of copies that should be maintained in the cluster. In distribution, if numOwners=1, you have traditional sharding where a single copy of the data is maintained in the cluster; if numOwners=m, though, you can remove m - 1 nodes from the cluster and maintain availability.

The different clustering configurations look something like:

<!-- REPLICATION -->
<clustering mode="replication">
 <sync/>
</clustering>
<!-- SHARDING -->
<clustering mode="distribution">
 <sync/>
 <l1 enabled="false" lifespan="0" onRehash="false"/>
 <hash numOwners="1"/>
 <stateTransfer fetchInMemoryState="true"/>
</clustering>
<!-- DISTRIBUTION; keep 3 copies of the data -->
<clustering mode="distribution">
 <sync/>
 <l1 enabled="false" lifespan="0" onRehash="false"/>
 <hash numOwners="3"/>
 <stateTransfer fetchInMemoryState="true"/>
</clustering>

When you add a new node to the cluster, the cluster will re-balance objects from other nodes to the new node; when you remove a node, the cluster will redistribute existing data to maintain the distribution guarantees.

(I'll also note here, Infinispan's cross-site replication feature, which allows different clustering configurations between the replicated sites. In a repository-context, perhaps this could be used to ensure 2 copies are kept in spinning-disk cache stores, but only 1 needs to be kept on tape.)

A brief note about fixity

Fixity checks in a single-node configuration are relatively simple. Your service can request the data from storage, compute checksums and compare them against stored values. With clustering modes, we'd also have to check each copy of the data. At least in Modeshape (and Infinispan), the high-level APIs do not seem to provide that kind of visibility down at the cache store levels. We'll have to dig deep into the Infinispan API to locate the data in a particular cache store and run our checks.

Once you bring true fixity checks this close to the repository, you could even start building intelligent, self-healing repositories that to identify corrupt data, evict them, and let the cluster rebalance automatically. You could even reject suspect nodes, while bringing up new nodes to replace them.

Implementing Java Interfaces on JRuby classes

We wanted to implement a class in JRuby that implements a Java interface, to fulfill a dependency injection contract. This is seemingly (unfortunately) a much more difficult task than it sounds. So, here's our simple Java interface that defines one public method, "indexObject":
public interface ScriptIndexer {
    public void indexObject(RepositoryProfile profile, 
        String pid, 
        SolrInputDocument doc);
}
And here's what we really wish just worked (in JRuby 1.7.0):
java_package 'org.fcrepo.indexer.solr'
class DemoRubySolrIndexer
  include org.fcrepo.indexer.solr.ScriptIndexer
  
  def indexObject a,b,c
    true
  end  
end
Unfortunately, here's what that compiles to (using jrubyc):
public class DemoRubySolrIndexer extends RubyObject  {
    private static final Ruby __ruby__ = Ruby.getGlobalRuntime();
    private static final RubyClass __metaclass__;

    static {
        String source = new StringBuilder("require 'java'\n" +
            "require 'rubygems'\n" +
            "require 'active_fedora'\n" +
            "\n" +
            "java_package 'org.fcrepo.indexer.solr'\n" +
            "class DemoRubySolrIndexer\n" +
            "  include org.fcrepo.indexer.solr.ScriptIndexer\n" +
            "  \n" +
That's not a ScriptIndexer instance! In the end, we gave in and ended up using the spring-lang dynamic language support (for jruby, groovy and beanshell only):
  <lang:jruby id="rubyScriptClass" 
              script-interfaces="org.fcrepo.indexer.solr.ScriptIndexer" 
              script-source="classpath:demo_ruby_solr_indexer.rb" />
I assume, under the hood, this makes a similar proxy class, but does so a little smarter. It's annoying we have to tie ourselves so tightly to Spring to get this to work, without writing some unnecessary Proxy classes ourselves.