SLA Covered Services

Services offered by us are covered by SLA because we are confident in everything we do. Our Customers with standart service subscriptions will obtain an opportunity to request the compensation for SLA breaches.

 

Let your server idle!

Buy NGINX Integration 50% cheaper!
Decrease the load average of your server just for $10.00!
Use "NGINX" coupon during procurement of NGINX Integration.

 

Migration you will never observe.

No Downtime Мigration
REMSYS' perfect solution for no downtime data migration.

 

Other Services & Products

We are ready to offer you all our potentials to provide your company with high quality and time-efficient services and products.

The best     
      you can do
for your     
      SERVER !

 

Solutions for

Hosting
Providers

 

 

 
 

Solutions for

Corporate
Customers

 

 

 

 

Solutions for

SoHo &
Startups

 

 

 
08.02.2011

VoIP PBX server failover.



Tags:  

 

1. Introduction

The main goal of this project is to ensure critical service like VoIP to be operational with minimum downtime in case of emergency, network congestion or hardware failure. Also this failover schema could be used with servers in different geographical location and on different ISP or virtual office environment. Such schema provides minimum in situations when for some reason the network is inaccessible for short time period up to 5 min due to congestion or wire unplugged. System is based on bash scripts and we are using only basic system tools thereby it is not hart to use it and its resources usage is low.

2. Description

System is based on active(PBX01 – master )/passive(PBX02 – slave) clustering approach with semiautomatic fail-over , that means if we got some troubles with master PBX, switching to backup PBX server will be automatic but fail-recovery will be performed manually.
Also for reliability we are using three check points to avoid false alerts when for some reason due to routing problem, main server will be not accessible from slave server so both servers will become active and both will process calls.
Main script will be running on PBX02 (slave pbx). If PBX01 will be not accessible then we will start PBX service on slave server and will change CNAME of sip.domain.com to point to PBX, so all inbound and outbound calls will be processed by this server. This will ensure that domain will always be pointed to current active server.

3. Realization

To implement this all discussed before was developed in an script that will control accessibility of PBX01 from PBX02 using nrpe on three check points.
The script is running by cron every minute on slave PBX.
*/1 * * * * /root/bin/check_voip.sh

Fail-over Script is working by following algorithm :
1. Get data from all check points. If all is ok then reset counter and start all from begin*.
2. If we got report of failure from all points then we increment counter by one and go to next point.
3. If fail counter is equal to 5 (that means PBX01 is not accessible for 5 minute from all check points) then we enable pbx service on slave pbx , and are changing DNS record for sip.domain.net to be pointed to PBX02 (slave pbx).
4. If PBX service is started on PBX02 , then we check PBX service on PBX01. If it's enabled then we stop PBX service on PBX02(slave)
5. In case of PBX service failure, a notification will be sent by email . * Fail counter is used to count 5 minutes of inaccessibility and to start fail-over procedure.

Note that:
- ТТL for sip.domain.net should be set to 30 seconds in advance.
- /root/bin/event – in this file we are storing all information regarding script work (start event , stop event , counter etc. ).

On PBX02(slave) is placed full working copy of PBX01(master) without any control panel or other control script.
*/2 * * * * /root/bin/rsync.sh

Such frequent synchronization is due to the fact that all files and database data needs to be synchronized. This will give to us following advantages :
- call queue configuration - that will permit agents that were logged in queue on master server not to re-login on slave server and start to receive calls without any action from them.
- PBX users parameters configuration – this means that all users will receive instantly any incoming call and place outgoing. On the servers we got master-slave mysql replication, with mysql master server on PBX01. On VSP (Voip Service Provider) like DIDww, we setup a rule - first call will be sent on PBX01 and if first server is not accessible then the call is forwarded to PBX02, thus the call will be placed on the right server. Other providers that are working by registration will receive new registration and all calls will be sent to the active PBX server.

To fail-recovery PBX01:
on first server we should run /root/bin/recoverymysql.sh – this can cause a downtime up to 2-3 minutes due to database synchronization. Main objective of this script is to get synchronized all data that was added on PBX02 (slave) in the time when this server was active. Also it will change back CNAME of sip.domain.com to be pointed to PBX01.

4. Conclusion

In conclusion we could say that this approach will help you to keep up and running at minimum admissible downtime such critical services like voip which is significantly important for many companies. We could apply this schema in an virtual office environment where offices and PBX services are dislocated separately from each other geographically. Downtime also could be minimized by customization of fail-over system parameters.