The self-healing repository
One of the goals of the Fedora Futures project is to give repository administrators the tools they need in order to successfully provide a highly available, scalable, resiliant repository to their consumers.
Although Fedora 3.x does have support for replication and mirroring for high-availability, it is neither widely implemented nor easily scalable. As I understand it, each slave repository keeps a complete, independent copy of the data, making it, perhaps, not economical to maintain a large cluster to support either large read- or write-heavy work.
We're building the current Fedora Futures prototype on top of Modeshape and Infinispan, which come out of the box with support for clustering. Infinispan provides two main clustering modes, replication and distribution, and can be configured by the repository administrator to balance needs of high-availability, scalability, durability.
Replication means when one node receives any "modify"-type request, it will replicate that change to every node in the cluster. Distribution, however, is a tunable pattern that allows you to set the number of copies that should be maintained in the cluster. In distribution, if numOwners=1, you have traditional sharding where a single copy of the data is maintained in the cluster; if numOwners=m, though, you can remove m - 1 nodes from the cluster and maintain availability.
The different clustering configurations look something like:
When you add a new node to the cluster, the cluster will re-balance objects from other nodes to the new node; when you remove a node, the cluster will redistribute existing data to maintain the distribution guarantees.
(I'll also note here, Infinispan's cross-site replication feature, which allows different clustering configurations between the replicated sites. In a repository-context, perhaps this could be used to ensure 2 copies are kept in spinning-disk cache stores, but only 1 needs to be kept on tape.)A brief note about fixity
Fixity checks in a single-node configuration are relatively simple. Your service can request the data from storage, compute checksums and compare them against stored values. With clustering modes, we'd also have to check each copy of the data. At least in Modeshape (and Infinispan), the high-level APIs do not seem to provide that kind of visibility down at the cache store levels. We'll have to dig deep into the Infinispan API to locate the data in a particular cache store and run our checks.
Once you bring true fixity checks this close to the repository, you could even start building intelligent, self-healing repositories that to identify corrupt data, evict them, and let the cluster rebalance automatically. You could even reject suspect nodes, while bringing up new nodes to replace them.
Implementing Java Interfaces on JRuby classes
We wanted to implement a class in JRuby that implements a Java interface, to fulfill a dependency injection contract. This is seemingly (unfortunately) a much more difficult task than it sounds. So, here's our simple Java interface that defines one public method, "indexObject": And here's what we really wish just worked (in JRuby 1.7.0): Unfortunately, here's what that compiles to (using jrubyc): That's not a ScriptIndexer instance! In the end, we gave in and ended up using the spring-lang dynamic language support (for jruby, groovy and beanshell only): I assume, under the hood, this makes a similar proxy class, but does so a little smarter. It's annoying we have to tie ourselves so tightly to Spring to get this to work, without writing some unnecessary Proxy classes ourselves.Code4Lib '13 Data Visualization Hackfest: Library Open Data
For the Code4lib '13 Data Visualization Hackfest, we pulled together a short list of some library-relevant open data resources. So they don't get lost to the ages in a Google Doc, here's what we found:-
Harvard Library Bibliographic Dataset: A collection of MARC21 data.
This dataset contains over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
- Chicago Public Library Circulation Data
- OhioLINK Collection and Circulation Analysis—Circulation Data
- University of Huddersfield -- Circulation and Recommendation Data - See more at: http://datahub.io/dataset/hud-library-usagedata#sthash.gfw5zdHD.dpuf
Since 2005, the University of Huddersfield has provided book recommendations within its library catalogue, driven by mining of the historical circulation usage data. ... [T]he library has details of just under 3 million circulation transactions spanning a period of 13 years.
- Various IMLS Data.gov datasets
- Vancouver Public Library: Open Data Catalogue
VPL's open data are sets of aggregated data in three general categories (collections, circulation, and borrower demographics) all describing the Vancouver Public Library collections and use of materials in the collections by Library patrons.
- Duke Law Library circulation data: