Disasters Happen: Is your Business Continuity Plan Ready? – Interview with Mr. Jayesh Shah, MD & CEO, Prism Cybersoft Private Limited
What is BCP and Disaster Recovery?
Institutions like banks, brokerages, exchanges, depositories etc face continuity threat from incidents like fire, flooding, terrorism and other natural and man made calamities. Businesses must go on even if such adversities are faced. Services need to be rendered and data needs to be protected. A Business Continuity Plan (BCP) and Disaster Recovery (DR) is any plan to continue operations if a place of business is affected by disasters mentioned above. Such a plan mentions how the business will restart its operations, how quickly and how it will recover its lost data or move all operations to some other location. For example, if a fire destroys a building where an institution was running its operations from, how will the institution resume its operations from somewhere else with minimal loss of time, effort, continuity etc? Businesses in developed countries place a lot of emphasis on BCP and DR in the post 9/11 scenario. The concept by itself is not new. In olden days, kings use to involve their teenager sons in affairs of the kingdom so that if the king gets killed untimely in a battle, son can take over without much loss to the kingdom.
What is the importance of Business Continuity Planning and Disaster Recovery?
Like many other businesses, financial services business is sensitive. Enormous wealth is made or lost in seconds. If an institution is an intermediary, like a stock broker, it has a responsibility and it shoulders. Transactions worth millions is done through it by its clients like investors and traders. Any disruption in its service could result in losses worth millions to its clients and in turn itself if it is not covered and has not planned to meet such events effectively.
Why should an institution plan for DR?
An institution such as an intermediary executes millions of transactions on a daily basis. A minute disruption in service due to god or man made events like flood, fire, malicious software etc could lead to losses for itself and its clients. Apart from monetary loss, reputation loss and data loss could be fatal. Once client’s trust is lost, it is extremely difficult to regain. BCP and DR is fast becoming a must have for critical businesses like financial institutions.
Is there any area that is ignored in a DR plan?
It’s a myth that DR setup is very expensive. Technologies like virtualization and cloud makes it very easy and cost effective to set up a DR site.
DR is like an emergency service. One prays that is never needs to be activated but once it is triggered; it needs to work seamlessly, as expected.
The problem with most DR plans is that while a lot of planning and care is taken to get it implemented, there is virtually no effort taken to test out whether it is running as expected. Lack of testing may disappoint when the DR facility is actually needed. Ideally, once a quarter, the institution must invoke the DR facility without giving any notice and work for one full day on that facility to ensure that operations can be run on DR when invoked.
Also, care needs to be taken that the same disaster doesn’t hit both primary and DR sites. For example having a DR site in Pune for operations in Mumbai is not a good idea because both Mumbai and Pune fall in the same seismic zone.
Another important thing to be kept in mind relates to myth people have about source of disruptions. Institutions plan very carefully for natural disasters. However disruptions due to natural disasters are only about 3% of the cases. More than 75% of cases of outage are because of hardware malfunctions, human error or software getting corrupt, including computer viruses.
Is there any guideline or regulation for brokerages around the need to have DR?
Yes. Regulator has laid down guidelines for BCP and DR and has provided subsequent guidance. However most of the guideline is for exchanges, depositories and clearing house.
What is the difference between BCP and disaster recovery?
Most people think they are one and the same thing. However, DR is a part of Business Continuity Planning. BCP is a much larger plan that involves planning for failure due to systems, processes or people. Infrastructure and system failure is a part of it.
How does one need to plan for DR? Is a real time DR needed?
Normally, a Business Impact Analysis (BIA) is conducted. In this business processes are separated between critical and non critical. For example, a brokerage must analyze very carefully all the aspects of a transaction value chain. It must also analyze criticality of each function and the business tolerance of each function if it were to go down. For example, an institutional client’s DMA business could be said to be extremely critical with zero tolerance to go down. Same could be said for dealing and real time risk management. Back office operations are important but not mission critical. In the sense that an hour of delay in back office can be managed and won’t prove as a show stopper. Once such criticality map is drawn up, brokerages need to draw up a DR plan accordingly. Since real time DR could be resource hungry, not all parts of the transaction value chain need to go to real time DR. Functions like back office etc could go into delayed DR. Available for use but may not be instantly. An hour or two of delay could be tolerated.
One broker I knew placed their servers in a third party data centre which itself has a strong real time DR facility. Without spending a single additional rupee, the brokerage has moved to real time DR.
To what level is this planning necessary?
Business impact analysis discussed above has to be detailed and finally two critical points have to be reached. These are – Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPO is the accepted latency of data that will not be recovered. For example if there is fire in office and one has backup till yesterday, today’s work will be lost. In many cases, this will be acceptable. RTO is the defined acceptable time it needs to take to restore all functions. Supposing a critical function like trading is halted because of server crash and it takes 10 minutes for the backup server to start and all trading to migrate to this server. In such case, RTO is said to be 10 minutes. Obviously, the lower the RPO and RTO are desired, the more an institution will need to spend to build redundancy. Current guidelines say that exchanges and depositories should have a Recovery Point Objective and Recovery Time Objective of 4 hours and 30 minutes respectively. However, most institutions will try and do better than this.
How fast can a DR facility be activated?
It actually depends on the overall business and technical architecture of the DR site. Activation could be done within few milliseconds to few minutes depending upon a variety of factors like hardware redundancy, bandwidth and scalability of the DR site.
One popular exchange has its DR facility in Chennai. From time to time, it keeps testing this facility by switching off the Mumbai facility during live market. All traders are then shifted to Chennai based DR site for subsequent trading in milliseconds and this shift is so seamless that traders don’t even come to know.
Fast activation naturally needs more money in terms of hardware and support.
What is the state of BCP and DR in Indian Capital Markets?
Readiness on BCP and DR today varies from one institution to another and it typically depends upon IT sophistication of these institutions. However, one common theme that cuts across all institutions is that significantly more planning and investment is needed, especially in the bottom 3 quartile of institutions.
Do you think Financial Institutions must step up their efforts in this area?
Yes of course. Much better understanding and financial investment is needed. Institutions don’t lack the money to put such processes and infrastructure in place. Most have the resources and do millions worth of transactions on a daily basis. They lack the IT awareness and expertise to put this in place.
Are there any people issues that one needs to keep in mind?
Yes. It is important to keep a couple of very high quality people at the BCP or DR site too. This is to take over operations when needed if the institution’s regular office people fail to reach office. If done intelligently, a lot of cost optimization can also be done. For example, if a flood prevents Mumbai staff to reach office, it can have a simple failover plan to start the DR server and a mechanism for critical people to be able to connect their home PCs to this server. Hardware redundancy has to be backed up by a working plan and there should be people to run operations along with enabling the technology. If a couple of trained people are not there to manage the DR set up, and main office staff is stranded, applications will start but there won’t be any one to run it.
Is it all about Hardware Redundancy and Process Planning?
No, a detailed Threat and Risk Analysis needs to be conducted in which the institution needs to properly analyze every potential threat which it faces like earthquake, fire, electricity outage, flood, cyber attack like virus etc. Many threats are purely human which needs solution involving human beings and don’t need hardware solution. If the institution out sources some of its processes, it must ensure that the vendor doing the outsourcing job must also have a proper BCP and DR in place else this may prove to be a weak link.
Institutions must realize that BCP and DR are no longer just a requirement. It is a necessity. It is like insurance for your business. An institution doesn’t realize the impact of not having it, until disaster strikes.