Introduction and Goals
We use Chef from Opscode at Fotopedia. Among many other things, Chef generates the Nagios configuration for all our services.
Hence, we have 220 tests running on the production grid and every test is declared inside the Chef cookbooks.
Because the configuration of Nagios is quite complex, we want to have a high level of flexibility for a given test:
- Name of the probe
- active or passive test
- check command
- retry interval
- normal check interval
- dependency of the service
- maximum check attempts
- normal check interval
- notification period
All these settings are the usual Nagios settings and you probably know what they are about. If this is not the case, please refer to the Nagios documentation for more info.
These settings should have a reasonable default value so that we can quickly add a test that will most of the time do what it's supposed to do.
Of course, depending on the grid the test is running on, the default value should be different: No need to be notified of a warning disk space issue on the testing environment at night. On the production environment, we have to.
Apart from these settings, we also want to include the grids on which the test should run. the grid is a setting we have on our servers that indicates whether we are on the development, test, staging or production servers.
Finally, all tests should be declared in Chef recipes as close to the tested thing as possible so that it can be easy to maintain or at least cargo-cult reused for our fellow developpers.
Also, the machine running the Nagios server must have the ability to fetch a consistent list of services to monitor so as to be able to generate the configuration on the disk and have Nagios pick that up correctly.
These lead us to several chef tricks and tips to implement this solution
Consistent per host registry
During a chef run, you have the ability to save the content of the node by calling node.save at any time. This is very useful but induce an additional reindex in the chef-solr. Before 0.10, solr was really CPU consuming and doing node.save outside the usual chef run saves (nodes are usually save at the end of the run) was not a very good idea.
Moreover, calling node.save during a chef run (at any moment) might result in closing the chef transaction of the current node which can have ugly side effects (your chef runs, triggers a restart at the end of chef, node is saved by your scripts, chef crashes. At next run the restart will not be triggered because the action leading to it has already been save to the registry).
When chef runs, we want to list all the currently active tests on the box and to save that at the end of the run in some part of the registry where the Nagios server will be able to search for it.
Because we want something that reflects the actual content of the box, we want to enumerate all tests at every chef run, and hence start from an empty state and fill it up until the final state reflects all that's supposedly controlled by chef.
If chef runs on the Nagios server at the same time, we don't want to expose a semi-empty list of services to monitor.
So until chef-client has finished running on the box, we want to keep the old list of services and switch it with the new list of service upon chef run completion and not before.
And so we need an "atomic" commit in chef registry.
The trick is a bit complicated to implement using chef because of the lack of hook to run pre and post run code.
There are hooks for post run reports but at the time of the writing of this code, this was not simple, so we implemented two recipes for that. These recipes HAVE to be run as the first and the last recipe of the whole chef run.
The first recipe just creates an empty part of the registry with the "next_key" key that contains the place where all registered services will be stored until the end of the run. It also checks that it's running first and will crash if it's not the case. Because of the way we then declare monitored services, this could be optimized by being done at the first monitored service declaration (I leave that as an exercise for the reader) and not in the first recipe.
Of course, you have to run this recipe as the first recipe on all nodes you want to monitor. Easy way to do that is to include the recipe inside a role that you run anyway on all your nodes.
The last recipes does a very simple job too: After having checked that it's really the last recipe to run, it will move the [:monitored][:next] content to [:monitored][:current]. This is done in a chef ruby_block because this as to run in the converge step of the chef-client run and not before.
As a bonus, the recipe also saves the list of monitored object on the disk. Useful for debugging and interfacing with CLI tools that don't speak Chef.
After the last recipe has run, the node is saved and [:monitored] will now contains the current list of services to monitor.
Chef Helpers to declare services
Default Values and Usual tests
In a the nagios/libraries folder of our nagios cookbook, the following file exists:
This file first contains all the default values for the Nagios test (as described in the introduction). In particular, you can note that notification_period is an Hash that links the Grid to the notification period.
The others functions are used as simple wrapper to declare what's used to run the tests in nagios (with commands name) and in the CLI (as I hinted you before we have a very cruft tool that allows us to run part of the tests from the CLI without Nagios).
Nothing too fancy. It works nevertheless.
Monitored define
The helper we use a lot is a define that can be used anywhere in our recipes to declare a monitored service:
This define bends a bit the define mechanism of chef, particularly regarding the parameters. Apart from that, note that the name must be unique on the node (as you'd expect, name is important also in Nagios so this limitations seems sane).
Monitored takes all the parameters you want and store them in the registry for later retrieval.
The
depends parameters is a magic one and will help you write dependency chain in nagios directly in chef. This must be an array. You can use the following syntax for each of its elements:
- fservice@localhost : currently declared service is dependent on fservice running on the same host
- fservice@server : service depends on fservice on server
There have been thought of extending this option to support fservice@* (so as to describe a service depending on all available fservice ) but this has not been implemented.
Here are some examples of real-use:
monitored "apache2" do
check_command check_http("-p #{node[:apache][:listen_ports][0]} -u /")
restart_command "apache2ctl graceful"
end
Apache2 is monitored using check_http (helper defined in the usual tests, see previous gist), the port is fetched from the node registry.
Oh yeah. The restart_command is used by our CLI tools to indicates our fellow developers what to do when apache is down. Of course, you could add
has_internet true or any other parameter and retrieve it later.
[ :queues, :failures, :pending, :processed, :workers, :working, :failed ].each do |kind|
passive_monitored "resque #{kind}"
end
Declare various Resque queue to be monitorable in Nagios.
{ 'Current_users' => 'check_users',
'Total_Processes' => 'check_total_procs',
'Current_Load' => 'check_load',
'Disk_Usage' => 'check_disk'}.each do |name, command|
monitored name do
kind 'nrpe'
check_command nrpe(command)
infrabox_check false
end
end
This one create nrpe monitored tests. Map the name to the nrpe command you'd better declare in the nrpe cookbook. Also indicates that these tests are not supposed to run on the infrabox.
As you can see, it's very easy to add new tests to the node registry. All these
monitored instructions can happen right next to the daemon configuration.
Generating Nagios Configuration
First, all the commands which are in check_commands must be declared in your Nagios commands.cfg.
By improving the OO structure of check_commands.rb it could be possible to store all that's necessary to generate the commands.cfg file of Nagios but the truth is that this file doesn't change a lot so it's usually very quick to add a new test to the file in the Nagios cookbook directly and then create a simple function in check_commands.rb
Then I will just describe how to generate the configuration file for all the nodes. The installation and regular setup of Nagios can be done as you like it.
We use the following recipe:
Nothing too fancy here, most of the magic has happened elsewhere. Three steps:
- Delete files corresponding to old nodes no longer monitored
- Get current monitored nodes (it is used to resolve the dependency graph). This means that you might have to run chef twice to have the correct tree available for your dependencies.
- Resolve dependencies and dump
/etc/nagios3/conf.d/#{n['fqdn']}.cfg on the disk. Also ask Nagios to reload
And the missing part, the service template:
Note that we reuse the defined Default before and dump the monitored registry of the node in the file. Note the
node[:swissr][:grid] that's used as a selector for some of the default values.
Also, the service dependency is generated at this point too.
Labels: chef, nagios