We have a Graphite-based stack of real-time visualization tools, including the data aggregator Statsd. These tools let us easily record real-time data from arbitrary services with mimimal fuss. We present some curated graphs through GDash, a simple Sinatra front-end. For example, we record the time it takes for Solr to respond to queries from our SearchWorks catalog, using this simple bash script:
(We rotate these logs through truncation; you can also use `tail -f --retry` for logs that are moved away when rotated)And the ruby script that does the actual parsing: From this data, we can start asking qustions like:
(Answers to these questions and more in a future post, I'm sure.)We also use this setup to monitor other services, e.g.: Note the red line ("warn_0") in the top graph. It marks the point where our (asynchronous) indexing system is unable to keep up with demand, and updates may appear at a delay. Given time (and sufficient data, of course), this also gives us the ability to forecast and plan for issues:
- Is our Solr query time getting worse? (Ganglia can perform some basic manipulation, including taking integrals and derivatives)
- What is the rate of growth of our indexing backlog, and, can we process it in a reasonable timeframe, or should we scale the indexer service?
- Given our rate of disk usage, are we on track to run out of disk space this month? this week?