Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)
The Lucene-based Apache Solr is an incredible platform for building decent search experiences with -- especially compared to the "more traditional" database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).
For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here's the skeleton schema:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="string" indexed="true" stored="true" multiValued="true"/> <field name="description" type="string" indexed="true" stored="true"/> <dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/> <dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/> <dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/> <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> <field name="payloads" type="payloads" indexed="true" stored="true"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/> <copyField source="title" dest="title_t" /> <copyField source="subject" dest="dc.subject" /> <copyField source="description" dest="description_t" /> <copyField source="comments" dest="text" /> <copyField source="dc.creator" dest="author" /> <copyField source="dc.*" dest="text" /> <copyField source="text" dest="text_rev" /> <copyField source="payloads" dest="text" /> <copyField source="dc.title" dest="dc.title_t" /> <copyField source="dc.description" dest="dc.description_t" /> <copyField source="dc.coverage" dest="dc.coverage_t" /> <copyField source="dc.contributor" dest="dc.contributor_t" /> <copyField source="dc.subject" dest="dc.subject_t" /> <copyField source="dc.contributor" dest="names_t" /> <copyField source="dc.coverage" dest="names_t" />The new edismax query parser provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here. The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like GSearch and Shelver to the more generic (ESBs and all that) like Apache Camel or the Ruote-based Fedora Workflow component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I've given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works. --- On twitter, John Tynan requested a virtual machine image to encourage others to begin playing with this software, so I've actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in.