The first thing that we did during my previous company (a start-up) was to automate everything. Granted, there are some stuff that you cannot automate for whatever reasons, but you should try to automate as much as possible because of scalability. Once an automated process is setup, then the only thing pending is to make sure if it's running fine and if there are any warnings, if it is not running, etc.
Nagios is a great tool for monitoring these type of things. There is almost an infinite amount of things that you can do with it. Previously, it was a pain to install, but lately, the installation is very straight forward.
In Nagios, there are two types of checks: active checks and passive checks. Active checks is a constant check - very much like heartbeat in high-availability architectures. The example that I always tell managers is a constant "are you OK?" conversation between the Nagios server and its clients. This is great if you need to have consistent check on a particular process. For example, uptime, drive space, CPU load, memory usage are processes that should be monitored every "n" time (every 5 minutes). For those process that are asynchronous and should be triggered by a particular event, then we use passive checks. In my case I had the following requirements:
- Client(s) can provide me a file at anytime between 9 am - 6:30 pm
- The file should contain a specific format
- If the format is not valid, we need to contact the client
- If the format is correct, persist it into the DB
- Once it is in the DB, then we launch another process and performed some statistical calculations
Passive checks can be for a host or a particular service. In this example, I will covered the steps to configure a service.
- Install Nagios
- Configure a particular service for this component
- Create an external application that checks the state of the application (in my case I used Groovy and shell script - groovysh)
- Write to an external command file
This is the picture of the process for Nagios:
There are a few configurations that needs to be enabled to have the passive checks work in the nagios configuration (/usr/local/nagios/etc/nagios.cfg). Make sure that the followings are set to "1" (enable):
- accept_passive_service_checks=1
- check_external_commands=1
There should also be a "command_file" with some type of path. For example: command_file=/usr/local/nagios/var/rw/nagios.cmd
Then, configure the service check and enable the passive_checks_enabled. This will be done in the host configuration file (localhost in my case):
vi /usr/local/nagios/etc/objects/localhost.cfg
Restart Nagios:
sudo -i service nagios restart
You should be able to see the "asynch_client_files" service in the localhost. The next step would be to check if the passive check is working via the command file. The way that Nagios knows about any passive events is by writing into a file (nagios.cmd). The following parameters are needed:
where...
timestamp is the time in time_t format (seconds since the UNIX epoch) that the service check was perfomed (or submitted). Please note the single space after the right bracket.
host_name is the short name of the host associated with the service in the service definition
svc_description is the description of the service as specified in the service definition
return_code is the return code of the check (0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN)
plugin_output is the text output of the service check (i.e. the plugin output)
Temporarly, you can get the timestamp by doing the following command in linux: date +%s
To execute the test, we do the following in the terminal screen:
In this case, I had to login as root:
If you go back to the Nagios site: http://localhost/nagios you should be able to see that the new change has been done and now the service shows a status of "OK" with the comment "The security master is up to date"
The only thing missing is configuring an application that sets the status of the page. You can use the language of your choice and use crontab or something else to execute the application.