This article describes RSMON distributed monitoring system basic principles and may be interesting for specialists interested in increased reliability of monitoring as well as in monitoring centralization, flexibility and minimization of false states.
The main goal of every monitoring system is to form a valid picture of the status of the network and equipment, but like any resource, the monitoring system is subject to periodic disruption which requires the duplication of controlling elements. In our case Nagios instances and, in the case of a high load, the Nagios node distribute the load between different nodes.
Nagios duplication is not difficult. It can be easily handled by a couple of shell scripts that use rsync/svn or other mechanisms to replicate configuration.
Load distribution is more complicated because its use should be restricted to either an administrative division or use of complex automated mechanisms based on Nagios add-ons.
In this case, every Nagios instance itself is an additional monitoring point which requires user intervention. Simple Nagios multiplication leads to unnecessarily large points which makes impossible growth of monitoring over some number of Nagios instances.
Other problems include false-negative reports. For example, we see that Nagios1 shows that server1 service SSH has connect timeout alert but Nagios2 does not show such timeout. Do we trust Nagios1 or Nagios2? To manually resolve this, a few items need to be compared such as: time of last check, state of this check, or service status in Nagios3. A manual check may also be performed. All of this takes additional time and requires human intervention.
It is possible to check one service, or two, or even five, but in our practice we often experience temporary unavailability of some routes. In this case, Nagios1 in Europe can not view 1-2 hundred servers in the USA (and 1-2 thousands of services), but Nagios2 in USA (and Nagios3 in Asia) check these services and find that it is working properly.
Almost all of the problems similar to these are solved by developing the REMSYS Monitoring interface.
The RSMON system is based on Nagios (as a main monitoring element) and NDOutils as a database back-end.
RSMON is not only a visual interface, it is a complete solution, which includes multiple Nagios instances, secure and reliable transport for Nagios messages to central database, and decision system and visualization interface.
Multiple Nagios instances are used to monitor required hosts and services. There are 2-3 major monitoring servers with common configuration and 3-5 specific servers where each handles its own list of services. (Formerly, hosts were inside private networks, and administrative infrastructure was distinguished from the public network).
Currently, RSMON performance is very high and makes large horizontal expansion of monitoring system possible.
In our case, one main Nagios instance is working on a single quad core Intel i7 920 (2.67GHz) which makes the load average 1.8-6 at 1 minute intervals (15% CPU time at load) and monitors about 700 hosts with 7000 services. Three Nagios instances monitors 95% of servers. Other 5% monitored by their own instances.
Information from all instances is sent to RSMON, where a decision is made on each event in real-time. Then, the web-interface of RSMON is used to view the current network map of 5-20 active web pages with a 90 second refresh interval. Also, our inventory system has been integrated with RSMON, which allows our customers to see the current status of our own hardware.
RSMON OpenVZ VPS only requires up to 5% of load (one quad core xeon X3360) (2.83GHz).
[Operation schema and principles]
Structure of Nagios hosts:
Monitored hosts are administratively organized into groups by accessibility.
Each group is controlled by a cluster of Nagios instances. In some cases, the Nagios cluster only consists of one instance. All nodes in a cluster share their configuration among all of the other nodes in the cluster.
Every instance is configured to run the NDOutils event broker, which is used to transfer event flow to RSMON central database.
Structure of RSMON monitoring system:
Currently, we have two RSMON hosts. Each host receives full event flow from each Nagios instance. Every RSMON node is fully autonomous and does not depend on the other nodes. We use different fail-over mechanisms to switch different visualization interfaces to an accessible and working RSMON node
Internally, the RSMON host runs NDOutils ndo2db to transfer incoming events to MySQL database. Next, on database level start working decision logic, which makes decision on every event. After the decision is made, the event replaces the current state of service as a newer event or is moved to the history. If a newer event already exists in the current state then there is a problem. If dates are incorrect for in the future there can be problems so we strictly use and monitor synchronization of local clocks among all Nagios instances.
The decision system makes a solution based on three major principles:
1. If the message is late then it instantly moves into the archive. No outdated messages should replace fresh information in the current network map.
2. If the message is marked as non-authoritative (such as timeouts) it should be penaltied for a few seconds. This allows the delay of the appearance of these messages which resolves most of the false-negative states.
3. Fresh messages should always update the current state. This advances the network map.
Now, at this point, information is ready to be visualized.
We have two major visualization interfaces:
- my.remsys customer panel – integrated status
This interface is used by regular customers to see the status of corresponding servers and services. It is also possible to request a list of historic events.
- monitoring interface
This interface, developed for human monitoring and generation of monitoring pages, takes 0.01s of CPU time and about 90KB of HTML code (39 hosts and 66 services reported). This is a very rapid monitoring interface.
Main interface functions:
Per-user configurations (stored in cookies):
- report type: report events from decision system, or non-filtered from all instances
- different sort orders
- include penaltied events or not
- include OK events or not
- some output form customizations
There is one major improvement: hosts are divided into host groups and the person who monitors network state can enable filtering based on lists of host groups (stored in DB). If anyone of high-level tech support personnel should monitor specific groups, it can enable filtering and will not be disturbed/overloaded by events from the unwanted hosts.
Flexibility in expansion possibilities and high performance during event processing with optimized database structure gives this system a way to the future.
Additional information can be found:
Nagios: http://www.nagios.com & http://wikipedia.org/wiki/Nagios