Solr Data Input Handler

This week, I had the opportunity to write a data import handler (DIH) for the Solr search server, which elegantly mapped a mySQL database to the Solr schema. Before this, I had been writing small scripts with an XML output, because the scope of the underlying data wasn't neatly contained in a single document or database. This is a new feature in Solr 1.3, and it really seems to make integrating search almost trivial, to the point where anyone who can write an SQL query can begin replacing the in-built fulltext engines with a Solr service, offering more flexibility, efficient faceting, and a document-centric view appropriate for search. The basic skeleton looked something like this:

<dataConfig>
        <dataSource driver="com.mysql.jdbc.Driver" batchSize="-1" url="jdbc:mysql://localhost:3306/cms?zeroDateTimeBehavior=convertToNull" user="root" />
<document name="doc">
        <entity transformer="RegexTransformer" name="page" query="SELECT ... FROM ... JOIN ... JOIN ... JOIN ..">
<field column="title" name="dc.title" />
[...]
<field column="names" splitBy="," name="dc.contributor" />
        </entity>
    </document>
</dataConfig>

A couple things to note: In the dataSource configuration, I've set the batchSize="-1", which lowers the number of rows kept in memory and prevents solr (and the servlet engine) from running out of memory Second, in the jdbc configuration, I'm using zeroDateTimeBehavior=convertToNull, which is a very easy way of dealing with those pesky "0000-00-00 00:00:00" dates that normally come out of the database, and allows solr to gracefully skip that field. In some multivalued field declarations (like the names -> dc.contributor), I'm using the regex transformer, and its helper splitBy, to reverse a mySQL GROUP_CONCAT() field, which at least saves a query (and forces more of the data marshaling logic into the SQL query, leaving the Solr mapping fairly straightforward). The Solr transformers look incredibly powerful and almost certainly worth pursuing further in the future. One update I eagerly await is the integration of the DIH with Solr Cell, a text+metadata extraction service, under [#SOLR-1358], which would let you merge previously extracted (or entered) metadata with the fulltext of documents. When this feature is added, I think I can pretty much give up on my transforming scripts and switch to the DIH for all purposes.