blog.cbeer.info

Mar 6, 2013

The self-healing repository

One of the goals of the Fedora Futures project is to give repository administrators the tools they need in order to successfully provide a highly available, scalable, resiliant repository to their consumers.

Although Fedora 3.x does have support for replication and mirroring for high-availability, it is neither widely implemented nor easily scalable. As I understand it, each slave repository keeps a complete, independent copy of the data, making it, perhaps, not economical to maintain a large cluster to support either large read- or write-heavy work.

We're building the current Fedora Futures prototype on top of Modeshape and Infinispan, which come out of the box with support for clustering. Infinispan provides two main clustering modes, replication and distribution, and can be configured by the repository administrator to balance needs of high-availability, scalability, durability.

Replication means when one node receives any "modify"-type request, it will replicate that change to every node in the cluster. Distribution, however, is a tunable pattern that allows you to set the number of copies that should be maintained in the cluster. In distribution, if numOwners=1, you have traditional sharding where a single copy of the data is maintained in the cluster; if numOwners=m, though, you can remove m - 1 nodes from the cluster and maintain availability.

The different clustering configurations look something like:

<!-- REPLICATION -->
<clustering mode="replication">
 <sync/>
</clustering>

<!-- SHARDING -->
<clustering mode="distribution">
 <sync/>
 <l1 enabled="false" lifespan="0" onRehash="false"/>
 <hash numOwners="1"/>
 <stateTransfer fetchInMemoryState="true"/>
</clustering>

<!-- DISTRIBUTION; keep 3 copies of the data -->
<clustering mode="distribution">
 <sync/>
 <l1 enabled="false" lifespan="0" onRehash="false"/>
 <hash numOwners="3"/>
 <stateTransfer fetchInMemoryState="true"/>
</clustering>

When you add a new node to the cluster, the cluster will re-balance objects from other nodes to the new node; when you remove a node, the cluster will redistribute existing data to maintain the distribution guarantees.

(I'll also note here, Infinispan's cross-site replication feature, which allows different clustering configurations between the replicated sites. In a repository-context, perhaps this could be used to ensure 2 copies are kept in spinning-disk cache stores, but only 1 needs to be kept on tape.)

A brief note about fixity

Fixity checks in a single-node configuration are relatively simple. Your service can request the data from storage, compute checksums and compare them against stored values. With clustering modes, we'd also have to check each copy of the data. At least in Modeshape (and Infinispan), the high-level APIs do not seem to provide that kind of visibility down at the cache store levels. We'll have to dig deep into the Infinispan API to locate the data in a particular cache store and run our checks.

Once you bring true fixity checks this close to the repository, you could even start building intelligent, self-healing repositories that to identify corrupt data, evict them, and let the cluster rebalance automatically. You could even reject suspect nodes, while bringing up new nodes to replace them.

Mar 4, 2013

Implementing Java Interfaces on JRuby classes

We wanted to implement a class in JRuby that implements a Java interface, to fulfill a dependency injection contract. This is seemingly (unfortunately) a much more difficult task than it sounds. So, here's our simple Java interface that defines one public method, "indexObject":

public interface ScriptIndexer {
    public void indexObject(RepositoryProfile profile, 
        String pid, 
        SolrInputDocument doc);
}

And here's what we really wish just worked (in JRuby 1.7.0):

java_package 'org.fcrepo.indexer.solr'
class DemoRubySolrIndexer
  include org.fcrepo.indexer.solr.ScriptIndexer
  
  def indexObject a,b,c
    true
  end  
end

Unfortunately, here's what that compiles to (using jrubyc):

public class DemoRubySolrIndexer extends RubyObject  {
    private static final Ruby __ruby__ = Ruby.getGlobalRuntime();
    private static final RubyClass __metaclass__;

    static {
        String source = new StringBuilder("require 'java'\n" +
            "require 'rubygems'\n" +
            "require 'active_fedora'\n" +
            "\n" +
            "java_package 'org.fcrepo.indexer.solr'\n" +
            "class DemoRubySolrIndexer\n" +
            "  include org.fcrepo.indexer.solr.ScriptIndexer\n" +
            "  \n" +

That's not a ScriptIndexer instance! In the end, we gave in and ended up using the spring-lang dynamic language support (for jruby, groovy and beanshell only):

  <lang:jruby id="rubyScriptClass" 
              script-interfaces="org.fcrepo.indexer.solr.ScriptIndexer" 
              script-source="classpath:demo_ruby_solr_indexer.rb" />

I assume, under the hood, this makes a similar proxy class, but does so a little smarter. It's annoying we have to tie ourselves so tightly to Spring to get this to work, without writing some unnecessary Proxy classes ourselves.

Mar 1, 2013

Code4Lib '13 Data Visualization Hackfest: Library Open Data

For the Code4lib '13 Data Visualization Hackfest, we pulled together a short list of some library-relevant open data resources. So they don't get lost to the ages in a Google Doc, here's what we found:

Harvard Library Bibliographic Dataset: A collection of MARC21 data.
This dataset contains over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
Chicago Public Library Circulation Data
OhioLINK Collection and Circulation Analysis—Circulation Data
University of Huddersfield -- Circulation and Recommendation Data - See more at: http://datahub.io/dataset/hud-library-usagedata#sthash.gfw5zdHD.dpuf
Since 2005, the University of Huddersfield has provided book recommendations within its library catalogue, driven by mining of the historical circulation usage data. ... [T]he library has details of just under 3 million circulation transactions spanning a period of 13 years.
Various IMLS Data.gov datasets
Vancouver Public Library: Open Data Catalogue
VPL's open data are sets of aggregated data in three general categories (collections, circulation, and borrower demographics) all describing the Vancouver Public Library collections and use of materials in the collections by Library patrons.
Duke Law Library circulation data:
- daily circulation for the library for the last 8 years
- weekly cataloging aggregation

Dec 7, 2011

Code4Lib talk/proposal wordles

2008 (proposals)

2009 (proposals, talks)

2010 (proposals, talks)

2011 (proposals, talks)

2012 (proposals)

Nov 6, 2011