The main goal of this project is to ensure critical service like VoIP to be operational with minimum downtime in case of emergency, network congestion or hardware failure. Also this failover schema could be used with servers in different geographical location and on different ISP or virtual office environment. Such schema provides minimum in situations when for some reason the network is inaccessible for short time period up to 5 min due to congestion or wire unplugged. System is based on bash scripts and we are using only basic system tools thereby it is not hard to use it and its resources usage is low.
System is based on active(PBX01 – master )/passive(PBX02 – slave) clustering approach with semiautomatic fail-over , that means if we get some troubles with master PBX, switching to backup PBX server will be automatic, but fail-recovery will be performed manually.
Also, for reliability we are using three check points to avoid false alerts when for some reason due to the routing problem, the main server will be not accessible from the slave server so both servers will become active and both will process calls.
Main script will be running on PBX02 (slave pbx). If PBX01 will be not accessible then we will start PBX service on slave server and will change CNAME of sip.domain.com to point to PBX, so all inbound and outbound calls will be processed by this server. This will ensure that domain will always be pointed to current active server.
To implement this, all the discussed before was developed in an script that will control accessibility of PBX01 from PBX02 using nrpe on three check points.
The script is running by cron every minute on slave PBX.
*/1 * * * * /root/bin/check_voip.sh
Fail-over Script is working by the following algorithm :
1. Get data from all check points. If everything is ok, then reset counter and start all from begin*.
2. If we got report of failure from all points then we increment counter by one and go to the next point.
3. If fail counter is equal to 5 (that means PBX01 is not accessible for 5 minute from all check points) then we enable pbx service on slave pbx , and are changing DNS record for sip.domain.net to be pointed to PBX02 (slave pbx).
4. If PBX service is started on PBX02 , then we check PBX service on PBX01. If it's enabled, then we stop PBX service on PBX02(slave)
5. In case of PBX service failure, a notification will be sent by email . * Fail counter is used to count 5 minutes of inaccessibility and to start fail-over procedure.
- ТТL for sip.domain.net should be set to 30 seconds in advance.
- /root/bin/event – in this file we are storing all information regarding script work (start event , stop event , counter etc. ).
On PBX02(slave) is placed full working copy of PBX01(master) without any control panel or other control script.
*/2 * * * * /root/bin/rsync.sh
Such frequent synchronization is due to the fact that all files and database data needs to be synchronized. This will give us the following advantages :
- call queue configuration - that will permit agents that were logged in queue on master server not to re-login on slave server and start to receive calls without any action from them.
- PBX users parameters configuration – this means that all users will receive instantly any incoming call and place outgoing. On the servers we got master-slave mysql replication, with mysql master server on PBX01. On VSP (Voip Service Provider) like DIDww, we setup a rule - first call will be sent on PBX01 and if first server is not accessible then the call is forwarded to PBX02, thus the call will be placed on the right server. Other providers that are working by registration will receive new registration and all calls will be sent to the active PBX server.
To fail-recovery PBX01:
on the first server we should run /root/bin/recoverymysql.sh – this can cause a downtime up to 2-3 minutes due to the database synchronization. Main objective of this script is to get synchronized all data, that was added on PBX02 (slave) in the time, when this server was active. It will also change back CNAME of sip.domain.com to be pointed to PBX01.
In conclusion, we would like to say that this approach will help you to keep up and running at minimum admissible downtime such critical services like voip, which is tremendously important for many companies. We could apply this schema in a virtual office environment, where offices and PBX services are dislocated separately from each other geographically. Downtime also could be minimized by customization of fail-over system parameters.