One of the goals of the Fedora Futures project is to give repository administrators the tools they need in order to successfully provide a highly available, scalable, resiliant repository to their consumers.
Although Fedora 3.x does have support for replication and mirroring for high-availability, it is neither widely implemented nor easily scalable. As I understand it, each slave repository keeps a complete, independent copy of the data, making it, perhaps, not economical to maintain a large cluster to support either large read- or write-heavy work.
We're building the current Fedora Futures prototype on top of Modeshape and Infinispan, which come out of the box with support for clustering. Infinispan provides two main clustering modes, replication and distribution, and can be configured by the repository administrator to balance needs of high-availability, scalability, durability.
Replication means when one node receives any "modify"-type request, it will replicate that change to every node in the cluster. Distribution, however, is a tunable pattern that allows you to set the number of copies that should be maintained in the cluster. In distribution, if numOwners=1, you have traditional sharding where a single copy of the data is maintained in the cluster; if numOwners=m, though, you can remove m - 1 nodes from the cluster and maintain availability.
The different clustering configurations look something like:
When you add a new node to the cluster, the cluster will re-balance objects from other nodes to the new node; when you remove a node, the cluster will redistribute existing data to maintain the distribution guarantees.(I'll also note here, Infinispan's cross-site replication feature, which allows different clustering configurations between the replicated sites. In a repository-context, perhaps this could be used to ensure 2 copies are kept in spinning-disk cache stores, but only 1 needs to be kept on tape.)
A brief note about fixity
Fixity checks in a single-node configuration are relatively simple. Your service can request the data from storage, compute checksums and compare them against stored values. With clustering modes, we'd also have to check each copy of the data. At least in Modeshape (and Infinispan), the high-level APIs do not seem to provide that kind of visibility down at the cache store levels. We'll have to dig deep into the Infinispan API to locate the data in a particular cache store and run our checks.
Once you bring true fixity checks this close to the repository, you could even start building intelligent, self-healing repositories that to identify corrupt data, evict them, and let the cluster rebalance automatically. You could even reject suspect nodes, while bringing up new nodes to replace them.
Harvard Library Bibliographic Dataset: A collection of MARC21 data.
This dataset contains over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials.
- Chicago Public Library Circulation Data
- OhioLINK Collection and Circulation Analysis—Circulation Data
- University of Huddersfield -- Circulation and Recommendation Data - See more at: http://datahub.io/dataset/hud-library-usagedata#sthash.gfw5zdHD.dpuf
Since 2005, the University of Huddersfield has provided book recommendations within its library catalogue, driven by mining of the historical circulation usage data. ... [T]he library has details of just under 3 million circulation transactions spanning a period of 13 years.
- Various IMLS Data.gov datasets
- Vancouver Public Library: Open Data Catalogue
VPL's open data are sets of aggregated data in three general categories (collections, circulation, and borrower demographics) all describing the Vancouver Public Library collections and use of materials in the collections by Library patrons.
- Duke Law Library circulation data: