Collectd: performance monitoring on servers

24 Jan 2011 19:17

As I mentioned on official Wikidot blog we use collectd to monitor the servers health, so that we can notice some problems and often prevent some disasters.

What's collectd

Collectd is fairly simple when it comes to the design. It's designed to just:

  • initialize plugins with configuration data
  • ask "read" plugins for data at configurable time intervals
  • accept data from "read" plugins
  • forward collected data to "write" plugins
  • forward notification to "notification" plugins if data returned from read plugin is outside configured threshold

Some plugins comes right in the collectd packages:

  • read plugin: system stats, like CPU, load, memory, disk usage etc
  • read plugin: exec: runs configured command to fetch some data (read below)
  • write plugin: save collected data to RRD files (it's easy to make graphs out of them)
  • read-write plugin: network — passes data from one server to another one (useful to collect data from many servers on one machine)
  • notification plugin: send email if some statistics is out of defined bounds

Amazing graphs

A number of WWW (also desktop-GUI) interfaces to RRD files that are generated by collectd exist, but none of them is perfect :-(. One we found one of the most exciting is Jarmon. But even also sucks at some things.

I think this project should be improved at this area. It's not really collectd core (as the WWW interface is only to display the rrd files, so this is already outside collectd), but nice screenshots mean a lot when you decide what monitoring software to install.

Ability to easily monitor "your own custom stuff"

The exec plugin runs a script (or executable, it doesn't really care) that is expected to produce statistics data. The binary should return at least one line that looks like this:

PUTVAL "myhost/mystats-stat1/gauge" interval=10 1179574444:666.44
  • myhost is host name — used to group stats by host (useful when one machine collects data from many hosts)
  • mystat is the plugin name — actually the plugin is exec, but when collectd processes THIS line it sets the plugin name to mystats (useful to note from which exec script the data comes from)
  • stat1 is the datum you return value for — one plugin can return multiple values, for example a plugin that return how much of the file system is used would return one value for each mounted file system
  • gauge — this was experimentally chosen as most universal type of data
  • interval=10 — this data is generated each 10 seconds
    NOTE: if the script ends in less than this interval, collectd will launch another process to generate new values
  • 1179574444 — current epoch time (date +%s)
  • 666.44 — datum value (given gauge is used, this can be float or U for non-counted for some reason)

Full protocol definition can be found at collectd wiki.

Tricks on how to write "exec" plugin

There's official documentation of how to write such a plugin, but a simple plugin can be written without (almost) any knowledge. Just get your favorite scripting language language and go.

  1. Don't trust COLLECTD_INTERVAL and COLLECTD_HOSTNAME. These environmental variables should hold what's configured collectd interval and hostname, but for some reason, they seem not to. Not a big deal, use command hostname to see what's the hostname and decide yourself how often you need to calculate given value.
  2. For simplicity print value(s) once and just exit. Collectd will periodically run your script. If you keep your script fast or make the interval big, this should not be a problem. From my experience, it's not easy to create a plugin that produces values continuously. So unless you have a good reason don't try.
  3. Unless you need, don't use other value type than gauge. It's universal and just works.
  4. Test your script manually (by running it). It should (ideally) instantly print one line, like the example one above and exit with code 0. No standard error, no verbose stuff, no fancy options etc.
  5. If you want, you can get configuration parameters of your script by command line arguments — write actual parameters in collectd.conf then.

Let's see at example/template script then:

#!/bin/bash

hostname=`hostname`
date=`date +%s`

# don't use "-" in those names:
plugin=myplugin
data=somedata
interval=60 # seconds

# compute the value:
value="30.44"

echo "PUTVAL \"$hostname/$plugin-$data/gauge\" interval=$interval $date:$value"

And that's all. You need to find some fancy (but not too fancy of course) name for the plugin and a way to compute the actual value :-).

Notice if the value won't be generated in the interval, that data is considered lost and if you have any thresholds for this values set, you'll get a notification about missing value (and then another one when it's back).

But how exactly looks section in threshold.conf that sets thresholds for such exec plugins:

<Threshold>

# Any value returned by plugin "myplugin"
# warning if value > 200
# failure if value > 400
<Plugin "myplugin">
    <Type "gauge">
        FailureMax 400
        WarningMax 200
    </Type>
</Plugin>
# Specific value (somedata) returned by plugin "myplugin"
# warning if value < 20
# failure if value < 1
<Plugin "myplugin">
    Instance "somedata"
    <Type "gauge">
        WarningMin 20
        FailureMin 1
    </Type>
</Plugin>

</Threshold>

Consult the documentation (or the actual file) for other options.

Bottom line

Collectd is worth trying, but you need to be prepared not everything is going to be pretty from the start. The system allows you to write your own data sources and once you know the tricks (see above) it's damn easy and still very reliable, so it's definitely a monitoring solution that's worth trying.

I hope by listing those tricks I learned while working with collectd I'll make at least one life a bit easier :-). I need to say, that collectd was quite easy to start with, but as problems arose, we were about to switch to some other monitoring solution, but now after finding our way with it, we'll probably stay with it for a while.

Comments: 1

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License