Systemic approach to business continuity & disaster recovery

Back

The present document is drawn up to outline a possible reference scenario for introducing architectures of operational continuity (Business Continuity and Disaster Recovery).

The increase in the possibilities of access to remote services has in the time given the increase of information handled and stored at the datacenter with consequent increasing sensitivity on the part of management to maintain this information highly available and preserve it from possible information loss or operation.

The differentiation between Business Continuity and Disaster Recovery is to be found in the type of service to offer before that for technological terms or connectivity.

The first question we must ask ourselves in taking one of the two alternatives or lean to the choice of both is the following:

The system of services that we want to safeguard terminates with the cessation of the site on which the services insist ?

If the answer is yes, the type of operational continuity to be undertaken is Business Continuity, if the answer is NO, the service will need for operational continuity geographically distant from the main site to ensure the delivery of services in the event of disasters or events that affect the region of the main site. In the latter case one speaks of Disaster Recovery.

To make an example we can assume an emergency rescue service and a service of an international courier If you check an event that would totally unavailable the service itself first aid, flood, earthquake or other event that would physically not more available service on the part of health care workers, the alternative for users would apply to another casualty. In this case would not make much sense to reactivate it services in the emergency department in another site geographically distant while a good local survivability of computer services (Business Continuity) would allow an operational continuity even in the absence of connectivity and also in the case in which at least a datacenter, within a hospital complex appeared to be practicable. A backup of the data located at a different site is obviously essential. 

In the second case, an undertaking of consignments of a possible problem with the main site which would make it unavailable to remote services of the headquarters that host them, must not cause inconvenience for all branch offices, warehouses and logistics in general that absolutely must continue independently from the main site. In situations such as the latter the Disaster Recovery site is absolutely indispensable for the continuity of services.

Only after you have answered this question, we can interrogate on RTO and RPO, latency, throughput, bandwidth and anything else necessary for the realization of the operational continuity.

Acronyms and Definitions

BC

Business Continuity

DR

Disaster Recovery

CLI

Command Line Interface

FDR

Fast Disaster Recovery

SAN

Storage Area Network

SDR

Site Disaster Recovery

SS-A

Semi-site 

SS-B

Semi-site B

SPOF

Single Point of Failure

RSS

Redundant Semi-Site

SSR

Semi-sites redundancy

SSO

Single Sign On

RTO

Recovery Time Objective

RPO

Recovery Point Objective

Objectives of the document

The document has as its objective the indication of guidelines for those who approach projects for operational continuity (Business Continuity and/or Disaster Recovery).

A project of BC OR DR must be performed with the impacts in operativity reduced to a minimum in order to ensure continuity of service delivery.

The project envisages the following macro-steps:

  1. Categorization of services and identification of mission-critical services
  2. Data Categorization
  3. Preparation of basic services Storage and Storage Area Network (SAN)
  4. Put on the safety of services and operations (infrastructure splitting).
  5. Test
  6. The passage from the project to the operational management and evolutionary

Reference Documentation

The reference documentation is represented by the following publications that may also be the basis for activities not related to the public sector.

Http://www.digitpa.gov.it/fruibilita-del-dato/continuita-operativa

Guidelines FOR DISASTER RECOVERY OF PA_0.pdf

Circular_1_December_2011_n58.pdf

Orlandi_lg_x_14_3_2013_v7.pdf

Rellini_14 March_v.1.0.pdf

Self Assessment.ods

Self Assessment.xls

LBL¬ģApplication Availability Infrastructure

LBL_BusinessContinuityAndDisasterRecovery.pdf

The reference context

The context on which articulates this document takes its origin from the directives contained in the GUIDELINES FOR DISASTER RECOVERY OF THE PUBLIC ADMINISTRATIONS OF DigitPA published in their latest revision, in January 2012. These directives may also be used for activities not related to public sector being of general guidelines on data retention and continuity of services.

The centralisation of remote services and the digitization/dematerialization of documents requires the adoption of measures to promote the safety of information assets maintained and constantly updated inside the datacenter.

The safety of the information and of the operation of the services must necessarily have 3 fundamental characteristics:

1 – preservation of the datum

2 – continuity in the delivery of critical services

3 – security of access to services and to sensitive data

Another very important factor that we look at is the evolution of systems over time. If on the one hand the planning must strive to attain the objectives described in the three previous points, operability and evolutionary management must be provided in architectural phase so as not to create constraints on the future development and further economic burdens.

The situation existing architectural

The evolution of the datacenter has seen increase the services offered for quality and quantity in the natural process of centralization of computing resources accelerated in recent years by the increase and distribution of connectivity.

The services provided by the datacenter have evolved over the years in modular infrastructure where multiple applications work together to obtain the final service offered to the customer.

In most cases you can still find a single deployment model. The distribution of the components can be divided as follows: 

РDispensing Points; 

– Application Server;

– The Database Servers;

– Directory Server;

– Storage Area Network

Below an image that synthesizes in visual manner the layers that service requests, represented by colored ribbons, must cross in order to be able to satisfy the requests.

The structure with multiple levels, AppServer, Database, Storage Area Network, has been introduced to ensure scalability with the increase of the “public” to which these services are facing. This in fact ensures at the same time an effective reduction of single points of failure by allowing the local duplication of functions.

It is clear that such a concentration of technology in a single space can become in itself a single point of failure for events generalized locally as: black-out, earthquakes, floods, fires, vandalism, etc.

In this respect the architectural interventions should be measures which ensure that, at various levels, the conservation of information and, when necessary, arrive to ensure the continuity of supply of critical services.

Continuity of Operations: Scenario

The maintenance of the information and of the operation has always been an important topic in the automatic processing of information.

The starting point for any project of high reliability is certainly the elimination of single points of failure (SPOF Single Point of Failure).

In the early nineties began the embodiments of the first failover cluster-commercial designed basically to a structure of two levels: Application+Database; Storage.

A limited group of people navigated to departmental server that contained both the application components both the database components. The departmental server had its homologue server redundancy and both were connected to a single storage.

Operating Server(Master)

Server Fail-over(Sleeping Master)

The system was designed to be able to ensure the operation  of exclusive access to the data and to avoid phenomena of split-brain and consequent logical corruption of data. The strategy for the conservation of the datum was then formulated on the basis of the resilience of the storage layer and in last appeal to copies of saving in the event of data corruption. In any case the two departmental servers could access storage concurrently.

With the advent of the Internet the architectural scenario of reference has changed profoundly. The provision of services to large numbers of users in different formats of use saw evolve the technological infrastructure in more specialized levels:

The evolution of the infrastructure in different levels is followed the evolution of virtualization technologies. In the same manner also the bulk storage media have registered a huge increase in storage capacity and speed of access while increasing functionality. You implement the replication synchronous and asynchronous, snapshots, etc.

All these technologies were and are rapidly evolving each making available its best result in terms of efficiency and effectiveness by making available tools up to some time ago is not conceivable in terms of costs, density and speed.

It is possible today to focus on distributed services in different locations and induce high reliability and continuity of service beyond the single point of failure that has become today the datacenter located in a single location.

Have thus been conceived and realized infrastructures that can ensure data persistence and continuity of service through the duplication on adjacent sites. 

If we expected a project of this type starting from a failover cluster-thought at the beginning of the Nineties, the transformation of an infrastructure single-site in double-site would seem at first sight very simple. By arranging in fact of a cluster composed of two systems with duplicate services and using new technologies for storage replication, it would seem sufficient to move one of the two nodes in the new site and initiate replication for redundancy across two sites.


In reality, the result obtained is not up to expectations. In fact, in the case of “disaster” the main site, if on one side this architecture complies with data protection and of the physical operations, on the other hand does not meet the characteristics of data protection in cases more banal as the temporary lack of connectivity between the two sites.¬†If this ever more frequent among all.

In the latter case, being distributed services on multiple location and being no longer able to communicate between them, who is primary will tend to remain the primary while who was secondary, replication, will tend to become primary. The two sites will proceed to populate the database in a parallel manner making them logically inconsistent.


To restore connectivity there will also be important communication problems. The cluster realized until now contemplate the use of virtual addresses. Addresses that the reactivation of connectivity between the two sites will enter into contention by further increasing the state of confusion and inconsistency. The result is not predictable.

This hypothesis, reported to the current architectural shapes, would result in a very short time at a logical inconsistency of data not more reconciled.

SITES SPLIT BRAIN

Any intervention on the individual levels, storage, database, application server, virtualization, would be inappropriate because partial. Also the use of systems of quorum does not allow to achieve results certain. The multiplication of the systems of quorum for every single service, would soon to the lack of sustainability in the together, in addition to being perfectly useless, not knowing a priori at what level the problem will come (Split-Brain).

The change in this scenario therefore requires a radical change in the approach to high reliability in Business Continuity.

In these circumstances¬†the solution of this problem is easier if approached from the point of view of “architecture” to the datacenter, not by intervening on the individual components, but where all services are requested. These access points can fill the role of coordinators of lending by blocking, in the case of ambiguity of consistency, access to the entire application stack and thus avoiding with security and simplicity split-brain logical.

This aspect of “block” of the only access to the services of the datacenter is very important because it does not in any way alter the activity status of the datacenter. All layers of the backend remain active and ready to provide its services in the unfortunate case where these remained the only surviving site.

The reactivation of the connectivity and then access to services should be in this case was entrusted to a human decision. The reactivation of the access must be very simple and can even by non-specialized operators having regard to the moment of crisis and/or by the impossibility of finding suitable staff quickly.

Human intervention in the surviving site with the reactivation of the access to services…


Human DECISIONS OR AUTOMATIC

Both for Business Continuity and Disaster Recovery, you can use the techniques human decision-making or automatic. The issue therefore is the possibility that a system can take decisions to move the services from one site to another, whether they be BC OR DR.

It is generally accepted that when we talk about BC recourse to systems of automatic decision while if we are talking about DR decision is entrusted to a human decision, usually the Committee of DR.

In reality the type of decision is not the prerogative of the BC OR DR, see introduction on the choice of continuity on the basis of the service delivery, but since the criticality of the service and of the possibility of the presidium.

Both for the BC That for DR if the decision to fail-over must be automatic it is necessary to tackle the problem of split brain in a more profound way than when the decision is entrusted to a human will even though must be addressed.

SPLIT BRAIN & QUORUMS

When we speak of automatic decisions talk about Split brain is inevitable and is normally begins to speak also of a quorum. In a cluster extended, regardless of the type of clusters (storage, databases, operating system, balancing systems etc.)




¬†the “Node of Quorum” identifies the object that in case of failure determines the automatic action of recovery. It is generally accepted that the “Node of Quorum” is indispensable in a extended architecture but we evaluate well to its actual use.

 


We take an architectural layer any of our enlarged infrastructure (stretched clusters) and sintetizziamola in three elements: the primary node and the secondary node (in replication Split/I/mpxio/or primary-primary) and the node of a quorum. 

 

These three elements are used to ensure high reliability in case the primary node is no longer available and performing one auto switch in the secondary system in replication enabling it to the state of the primary.

 


In this case the promotion “primary node” of the node in replication allows delivering services continuously and for this reason it appears to be sufficient.

 

We normally always stops at this consideration¬†but say that this scenario is “always” true maybe deserves a deepening¬†in consideration of the fact that we are talking about critical services and a substantial investment of both acquisition and implementation.

 


If we think the same scenario with a failure is different from the one just described this architecture presents a “paradox” that in complex environments and extended can bring to the disservice total¬†thus invalidating the characteristic of high reliability and costs incurred

. The scenario, unfortunately 50% of possibility is that the secondary node, together with the visibility of the Quorum, are no longer accessible from the “primary node”.¬†In this case, the “primary node” will enter into “STOP” to avoid phenomena of split-brain no longer having the certainty of the operation of the secondary site.

 

If applied for example to a storage, but could likewise be a DB, this behavior causes the only surviving site to¬†a generalized failure by blocking the scriptures to volumes¬†and consequently causing failure in snapshot databases, applications, batch processing. In this case the staff will be forced to restart an¬†extremely complicated of an entire site… which in this case is also the only surviving.


 

To paraphrase the famous film “War Games”, during the Game of tris (Tic-Tac-Toe), there is no solution if addressed in individual layers.

 

An implementation of the infrastructure for fail-over, in the individual layer, therefore involves a series of problems not indifferent as, an exponential increase of “nodes of quorum” for each layer and for each service and generally a¬†widespread complexity that is very often the cause of problems that should resolve.

 


Try now to move more high-level management of high reliability. If we move the problem upwards, where receives requests for delivery of service, the problem of the stop exists but the only consequence is the stop of the access to services. A possible reactivation is simple and immediate only having to reopen the communications toward the backend.

 


LBL¬ģA.A.I., by controlling the network and access to services, has been studied for environments stretched clusters (Business Continuity & Disaster Recovery)¬†and is today the only integrated tool cross platform and cross virtualization to solve with simplicity this problem. The powerful LBL¬ģADC¬†(ADC) integrated with the components of high reliability geographical LBL¬ģSurface Cluster (Decision Engine and Work Flow) allows to orchestrate the fail-over in a simple, efficient and traceable.

 

Quorum Games©  was published through the news letter on 25 June 2013, all rights reserved

Simplification

From what has been described up to now, it is evident that an infrastructure in high reliability must have as its main objective the simplification of the processes that determine it.

The need to divide the individual services in layers, which contribute in a specialised manner according to the delivery of the final result, the diversification of platforms and multiple releases of the individual components, require the introduction of an element that can orchestrate the activities of fail-over and restore in a coordinated manner, repeatable and above all self-documentante.

To understand the need for an instrument of coordination of activities within the datacenter to try to dismantle the actions to be performed in order to carry out a fail-over in a distributed environment as a result of a failure of the semi-primary site:

Example of activities to be performed in semi-secondary site for the provision of services

  1. Promotion to the primary storage in replication for unstructured data
  2. Promotion to the primary winding of the replicas of the database (where present)
  3. Start/mount of the volumes of the physical server/virtual DATABASE
  4. Start the application server
  5. Starting virtualized guests
  6. Changing the addresses where necessary
  7. Start of service delivery

These operations must be performed in sequence coordinated to ensure the final result, i.e. the delivery of application services. The execution of these tasks in a manner which is not coordinated would jeopardise the operation final.

Another factor of evaluation is the lack of homogeneity of the individual elements that compose the services: different vendors, different operating systems and different release, systems of different virtualization, systems different physical. 

The coordinator must be above the parts and as independent as possible of the platforms and from the release.

LBL¬ģCOMMANDER ARCHITECTURE

LBL¬ģCommander¬†has been studied and realized by seizing the new emerging needs with new architectures stretch-cluster (Operational Continuity).

The need for high reliability and the removal of the Single Point of Failure (SPOF), represented today by the datacenter, have done to redesign the overall strategy of induction of high application reliability.

A modern system of high reliability must respond to the needs set out in the preceding paragraphs safeguarding 3 fundamental characteristics:

What does a failover cluster-?

  1. Constantly checks the proper functioning
  2. Takes the decisions
  3. Performs a series of actions to return to state 1. 

Being the present architectures structured on more levels, even the service of high reliability must be rethought in optical distributed.

To answer questions 1, 2, 3, whereas the current distributed scenarios with phenomena of split-brain and logical corruption of data it is unfeasible to equip systems fail-over the individual components, you would arrive soon to the flimsiness, to a loss of control and an excessive and costly redundant hardware and software.

It is therefore necessary to reduce the control points and adopt an instrument able to orchestrate and distribute in an automatic manner the actions to be undertaken in the various levels to ensure that the restore operations in the event of a failure.

Were then summarised the 3 features of a cluster in two software tools capable of concentrating the control activities while simplifying management.

How it works LBL¬ģCommander ?

  1. Constantly checks the proper functioning
  2. Takes the decisions
  3. Performs a series of actions to return to state 1)

The two components take the name of:

LBL¬ģCommander¬†Decision Engine

LBL¬ģCommander¬†Work Flow

LBL¬ģCommander¬†centralizes control points and operation of services allowing simpler management tasks for the people responsible for supervision.

1. Constantly checks the proper functioning2. Takes the decisions

3. Performs a series of actions to return to state 1.

With the increase in the number of services to maintain high reliability verification points remain unchanged as the number and location, simplifying the control and the configuration on the part of persons responsible for management.

AServizio service B

In this example scenario was introduced an additional service, the service B. The control points do not increase by centralizing the management of high reliability.

Fundamental role in this environment is¬†LBL¬ģADC¬†that has the ability to route service requests there where they can and must be delivered.¬†¬†LBL¬ģADC¬†¬†constantly collaborates with¬†LBL¬ģDecison Commander Engine¬†Causing along the routes of routing.

LBL¬ģcommander self-documenting

In environments such articulated dynamic and any operational documentation¬†“reported by hand” of the real environment immediately becomes obsolete. The datacenter is a dynamic environment, in continuous evolution. Any instrument of high reliability needs to certify its operation to be ready for events quotas. Maintain an environment in constant evolution and “certificate” means perform ongoing testing with methods that document the actual ratio between “Jobs” and reality implemented.

The description of the processes of high reliability with¬†LBL¬ģCommander¬†is directly described in the same processes that determine the high reliability.¬†

The breakdown of the processes of high reliability in logic and action and the consequent simple description of procedures in Workflow) self-documenting actions that induce high reliability within the entire datacenter:

Centralized Collection of coordinated actions 

within the datacenter

The ability to navigate through the actions will be taken in the event of breakdown, critical moment, or for simple routine, make the processes that occur inside the datacenter more confidential because controllable and self-documenting.

Contextual navigation inside of actions with the possibility of verification with drill-down

List of the steps of a workflow

Viewing the single step of action

The evolution of the project: SSR / RSS

The development of a draft Operational Continuity applied to an existing datacenter provides more stages of implementation. The design must take into account not only by providing at the moment T0, but also the subsequent management and evolution.

The first question we have to answer is whether we want to realize a operational continuity with redundancy in the single semi-site (redundant semi-site) or a redundancy determined by the combination of the two semi-site (semi-sites redundancy). This difference in terms of cost is very important because with a solution  redundant semi-site (RSS) is provided with a redundancy of the infrastructure on each individual half-site arriving at total duplication of the components in a redundancy semi-sites redundancy (SSR) you can reach the redundancy with the sum of at least 2 semi-site:

SEMI-SITES REDUNDANCY (SSR)

REDUNDANT SEMI-SITE (RSS)

The substantial difference between the two solutions is that the SSR solution does not provide the double failure while the solution RSS, in case of non-availability of an entire site, is able to support more local failures. These solutions differ in the resilience of the single site in the event of a crash of the homologous half-site.

Even if the choice SSR or RSS is very significant in terms of cost, from the standpoint of implementation does not present substantial differences.

The symbols storage for both solutions are represented with a unique symbol as, all storage present on the market, have characteristics of data persistence in redundant mode through RAID technologies. Only feature to check is the duplication of control heads (RAID CONTROLLER) on individual semi-site. These, in both solutions (SSR¬†or¬†RRS), it is advisable to be duplicated. Are not suitable solutions that envisage a single “head” of storage control positioned in each individual half-site.

Recommended

Not recommended

The evolution of the project: CATEGORIZATION OF SERVICES

Once you have established the resilience (redundancy) that you want to obtain in a single site, the focus must be oriented in the cataloguing of the services that must be placed in operational continuity.

The activity of cataloguing can be carried out in accordance with the boards of self-assessment drawn up by DigitPA that may be taken as a reference also in the private sphere.

The cataloguing must occur for individual service and must understand the hierarchies of the dependence of the individual components es.:

Service 1

Authoritative DNS

Addess xx.xx.xx.xx

Physical Server ZZZZZ

Addess xx.xx.xx.xx

Physical Server ZZZZZ

LBL¬ģADC

Address xx.xx.xx.xx

Physical Server XXXX

Address xx.xx.xx.xx

Physical Server YYYY

Server App

Address xx.xx.xx.xx

JBOSS VMware

Address xx.xx.xx.xx

JBOSS VMware

DB Server

Address yy.yy.yy.yy

Oracle Enterprise 

SAN BBBB

SAN CCCC

Storage Kkkk

Service 2

Authoritative DNS

Addess xx.xx.xx.xx

Physical Server ZZZZZ

Addess xx.xx.xx.xx

Physical Server ZZZZZ

LBL¬ģADC

Address xx.xx.xx.xx

Physical Server XXXX

Address xx.xx.xx.xx

Physical Server YYYY

…

…

…

The Categorization must tend to highlight both the elements of the “physical” is the logic elements as addressing and/or virtualization. Already at this stage it is possible to distinguish for single service dependencies of the data bases on which they rest. For data bases are to be understood as also any directory services, Global File System, NAS.

The dependencies of services are also to be understood as dependencies relating to systems of Single Sign On (SSO).

The evolution of the project: Data Categorization

The Data Categorization is a direct consequence of the categorization of services. The categorization of the data is very important since the first phase of the project will be the replication, in various modes of data bases.

For data bases are different forms of storing data that basically we can be divided into:

  1. Unstructured
  2. Structured

For databases not structured refers to all databases that contain files or documents not referable to an abstract management of given: word documents, text files, pdfs, images, html, etc…

For Bases structured data refers to all databases that have an abstract management of given as: relational DB, Directory Server.

For the duplication of data bases not structured it will seize the technologies of storage replication synchronously or asynchronously in dependence of the capacitance of the connection between the two sites and their latency.

For the basics of structured data replication can occur in different ways:

A) through the instruments of replication of the individual products (e.g. Oracle Dataguard,

  Replication of repository Active Directory or Directory server, My SQL,

¬† PostgreSQL, …)

B) through synchronous replication of storage

C) through synchronous replication of file system

In a context of Business Continuity are preferable in both forms of data base of the techniques of synchronous replication while for Disaster Recovery usually the limit is the connectivity and latency and then you should opt for an asynchronous replication. Are preferable and are not mandatory because some forms of replication of type (A) are by their nature the asynchronous (see replication of the repositories of the directory server) while for replicas of type (B) and (C) are to be taken into consideration of the replicates in synchronous mode where permission.

Direct and indirect costs of replication:

In the design of the replication systems are to assess the costs of the choices that will be carried out. In the context of the campus, such as that under consideration in this document, for direct costs shall comprise the costs relating to the acquisition of licenses hardware/software with relative maintenance. In the indirect costs must be taken into account in the cost of the management of the techniques of replication.

To make an example of how an architectural choice can affect the direct costs you think the replication of data bases Oracle. You can imagine 2 scenarios, the first in which Oracle Dataguard through will carry out the replications between the two semi-site and a second hypothesis in which the storage, in synchronous mode, will carry out the replication of data.

In this case, being both semi-site with Oracle instances active, licenses should be counted for both semi-site and both in the form Enterprise WITH RELATIVE MAINTENANCE.

In the case of storage replication should be counted as the Oracle licenses in standard form in semi-site (if the configuration does not exceed the limits of the license Oracle Standard) and the licensing of synchronous replication for the storage components. Obviously they should be counted as the maintenance of both products.

The indirect costs to count related to these two options are to be expected on the training of operators that will have to be in the first case is instructed on techniques of Oracle replication both on techniques of storage replication. In the second case will need to be trained only for the storage component. The training will have to be carried out both for the personnel that will perform the installations and configurations but also and especially for those who have to manage the infrastructure.

For the components¬†LBL¬ģCommander¬†the choice of an architecture with respect to another is an invariant. In a case¬†LBL¬ģCommander¬†will have to reverse the flow Dataguard and primary make the semi-site B. In the second case will the inversion of the replicas of the storage and will activate the Oracle instance in stop making it operative.

With the advent of virtualization replication systems are moving quickly toward the replication of the images of the guest operating systems that allow a flexibility not possible in physical environments and allowing you to have different systems in the sites in replication.

Also in this case for the components¬†LBL¬ģCommander¬†the choice of an architecture with respect to another is an invariant. LBL¬ģCommander will manage startup coordinated and the setting of the infrastructure replicated during the event of crisis. LBL¬ģCommander contains a series of templates for the most popular virtualization environments over a template for changes of addresses, DNS zones, activations Database‚Ķ

The evolution of the project: SAN & Storage

Once completed the architectural choices more suitable in terms of cost/benefit and in dependence of the objectives of high reliability that we want to reach the first step in the preparation of the system operational continuity is the realization of the connectivity between the two semi-site.

Immediately after having arranged the connectivity you will pass to the predisposition of the replicas of the storage-to-storage, database to database, Guest Virtual to Virtual Guest on the basis of the table of association Services – Data arranged in advance.

The first test of replication is appropriate to execute them on an application set specially created with the aim of verifying the behaviors in different situations of use. In particular the replicas must first be set up with the instruments made available to the platform and then we will proceed with the tests of interruption, resumption of replication, inversion of the replicas etc.. It is important to check the ability of caching of storage or replication systems in the absence of connectivity and set, if possible, adequate logging capabilities before having to start a total reconstruction.

Once you have determined the potential and behavior in the various situations of realignment, due to temporary and/or prolonged disconnections of the two apparatuses, we will proceed with the functionality testing through the command line (CLI) that with the advent of cloud all producers make available.

In order to be able to perform significant tests it would be advisable to provide a volume similar to the production so as to be able to record actual times of recovery in the worst conditions. If the platform allows it, also on the basis of the licenses acquired, take a snapshot of the volume to replicate, make the volume readable and writable and then set the replicas of the volume obtained by the snapshot.

With some versions high-end storage you can set of synchronous replication and run simultaneously mount volumes (Primary and Replication) for reading and writing. This option, however usable in architectures global file system and applications that support (e.g. Oracle RAC), will not be analyzed in this document and refers to the study of implementation of detail.

Given below is the diagram of the contemporary mount of the replicating volume normally usable storage with high-end. It should be noted that the flow of scripture is still performed by a single “controller” storage.

Other tests to perform you refer to the study of implementation of detail.

The evolution of the project: fixed routing and DHCP

The problem of the addressing of services is a very serious problem since the replicated systems, especially in the case of Disaster Recovery, must have diversified routes. The technique commonly adopted in the datacenter until today is to provide the services of fixed addresses while the DHCP is normally used for the client. The extensive use of technologies addressing of dynamic (DHCP) also for server services allows for flexibility in these cases to be taken into consideration for the next implementations. This technique, already used for cloud services, is also the way to begin to use IPv6 addressing inside the datacenter that having a representation little “human” must be used with symbolic names to be widely adopted.

In any case the situation to consider is the insulation of the applicational contexts obtained via vlan vpn and necessary in order to carry out the minimum number of interventions on the addressing and leave the possibility to carry out freely, both for BC for both DR, tests of non-invasive from an application point of view.

Please refer to analysis of the details of this important consideration, below is an example of the encapsulated addressing to allow this objective.

In this case LBL¬ģADC performs a decoupling of the requests from the application requests allowing the site in BC/DR testing non invasive. Please see the routing thematic global audience to another document.

The evolution of the Project: Infrastructure splitting

Once the technologies of replication/reconciliation storage have been tried and tested you can go to the splitting of the infrastructure by identifying the following types of services:

Type

Layer

Infrastructure Elements

Services in balance

AppServer

WebServer

AppServer

Applicatin Server

AppServer

Exchange

AppServer

SAS

Services in fail-over

Date

MS SQL Server

Date

ORACLE

Date

MySQL

AppServer

SAP Production

Routing

Router

Routing

Firewall

Routing

LBL¬ģADC

Services with synchronous replication autonomous

Date

Oracle dataguard

Date

PostgreSQL

Services with asynchronous replication autonomous

Date

Typically Directory Services Server

Routing

DNS Server

The step of splitting, from a site in two semi-site, is an important step and critical because it must be performed by producing less possible disservice to activities in production.

In this case, by way of example, we shall consider a SSR architecture (semi-sites redundancy) thus with a shift of the current redundancy of the existing site (semi-site) in the new site (semi site B). The redundancy of services will therefore be the result of the sum of both semi-site (semi-site+semi-site B).

The first elements that will be separated in order to work in two semi-site will be the elements in the table were defined as Routing. The result that you want to obtain, in a first moment, is a behavior similar to the one currently in production with the only Routing Services divided in two datacenters.

To obtain this result should be relocated (physically or logically in case of virtualization) systems in redundancy and respectively: Router, Firewall,¬†LBL¬ģADC. These systems will need to maintain local autonomy in terms of boot, or by their own internal disks or from storage area network local to the semi-site B.

Schematically below splitting Routing services:


This step is essential in order to be able to carry out the split of the application layer by producing the minimum impact in the activities in the exercise. This activity of splitting should have zero impact by performing the displacement of passive components. Obviously it refers to a detailed study for the implementation.

In this preliminary document we will now examine the splitting of a service leaving the study of detail, not the subject of this document, the description of the steps of splitting of the individual services.

Take for example a service composed of database without autonomous replication and two Application Server in balance.

Services in load balancing can be moved in the semi-site B without too many problems taking care to carry out the displacement in moments of lesser use.

An important consideration is that the services that are going to move should be autonomous with respect to the boot. This aspect is to take always present also in all subsequent activities. The semi-site B will have to be completely autonomous and must be able to perform total boot even in the absence of the semi-site. Any areas of mount in SAN must be positioned in the semi-site B.

The split of the DB Layer, where we will examine a DB without autonomous replication, will take place in two stages. The first phase is relative to the replication of the data area. The replication must be carried out in a synchronous manner. You can arrange your storage to replicate the data area reserved to the DB. Before proceeding further, it will have to wait for storage replication in achieves synchronization with the primary storage (even a few hours).

Once the replicas will have reached the status of sync you can proceed with the movement of the component of redundancy DB in semi-site B. Also in this case the boot must be completely autonomous. 

To carry out the tests fail-over without repercussions in production we will use the snapshot technology, the same used in the preliminary test of the storage component. This test mode is important because a replicating volume cannot be “installed” from an operating system for reading and writing. (It is possible to carry out “mount” in read/write only in the presence of global file system and in any case on the basis of application certified for this mode, such as Oracle RAC).

The “mount” snap of the volume DB in replication allows us to be able to carry out the test without interruption of production and without interruption of the replication. Another consideration in the movement of the service of redundancy DB is IP addressing. It is possible to use two methods, the first with displacement of the virtual address, the second with routing through¬†LBL¬ģADC.

The first solution has the disadvantage of reconciliation in case of fail-over with the return of connectivity after temporary interruption, one would in fact have a conflict of addresses. This solution is not recommended in environments of operational continuity as, for cases which are profoundly different with respect to the cluster made in single site and with shared data (please see paragraph continuity of operations: SCHENARIO).

The second hypothesis has as advantages the ability to periodically test procedures for Business Continuity without affecting the production. Moreover, in case of fail-over and sudden restoration of connectivity between the two semi-site, the layer of ridiriger√† routing, without any ambiguity or conflict, the traffic toward the DB declared Master. In¬†LBL¬ģADC¬†this mode has been provided in the design phase of the product with a timely policy routing of specially designed for such architectures.

Once carried out tests of the instance DB in semi-site B you will set the routing rules in the layer of routing. The routing rules will be set with policies of balancing “failover” so as to direct the right component DB. The AppServer components targeted will now to the layer of routing and no more directly to the DB.

The infrastructure is ready to provide the service in Business Continuity. The operations that follow will serve to automate actions that in case of failure of the semi-site to bring the semi-site B in the state of the primary. The next step will be the construction of the workflow that will automate the operations.

N.B. In the image element [Routing], between layer [App Layer] and [DB Layer], was introduced for the sake of simplicity. In reality they are routing rules in the balancing layer. In the next few images will not be present for this reason.

The evolution of the project: continuity of operation

The step of arranging the Workflow serves to automate actions that will bring the semi-site B to the state of the primary in the case of failure of the semi-site. You will then proceed to the setting of the following workflow:

Name workflow

Application

Description

Graceful shutdown

Semi-Site to

Performs a controlled shutdown of the layer survivors in the semi-site after an event of failure of the service.

Restart

Semi-Site to

(Optional) Perform a restart elements that are usually the cause of a failure of the supply of a service. This workflow allows not to perform false fail-over in the presence of faults software components such as memory overflow, stackoverflow, loop etc.

Take Control

Semi-Site B

Performs operations that will bring the Semi-Site B to be primary. Will perform in sequence:

A) a reversal of the replicas storage

B) Start of the DB Server

(Note in this case the application layer AppServer, has no need of intervention on the part of the WorkFlow because re-routed automatically by¬†LBL¬ģADC.¬†In other cases it may be necessary to start virtual or physical machines or processes within systems already launched).

Once you have completed the necessary workflow procedures fail-over will be setting the automatic decision makers¬†LBL¬ģCommander Decision Engine.

The three instances ¬†LBL¬ģCommander Decision Engine¬†will remain in constant number for all services to be put in automatic fail-over. ¬†LBL¬ģCommander Decision Engine¬†was designed to constantly check the status of the provision of services and/or of its components.

LBL¬ģADC,¬†instructed by¬†LBL¬ģCommander Decision engine¬†will automatically to direct requests for service in a manner consistent with the state of the cluster.

The centralization of component of high reliability in module ¬†LBL¬ģCommander Decision Engine¬†will have a single console for verification of the state of the high reliability of the whole datacenter highlighting possible critical situations.

Example of Decision Engine with highlighting a critical occurrence in a service.

Example of workflow with its procedural step self-documenting

Once completed the procedures fail-over is necessary for a correct management, outline the procedures for fail-back to return to the initial situation once you have passed the event that caused the crisis.

The evolution of the project: fail-back

The procedures for fail-back are very important in the processes of Operational Continuity and/or Disaster Recovery. With¬†LBL¬ģCommander Work Flow¬†it is possible to act on all levels and in all the semi-site concerned from a single console. It is therefore possible to arrange in a coordinated manner all the procedures for fail-back including actions to be put in place between multiple semi-site. At the end of the operation the¬†semi-site¬†will return to deliver the services as primary.

Below an example of procedure of fail-back described in a workflow. The semi-site are carried at the beginning in a known state, reconciled and then returns to the initial operation and the usability of the service.

Benefits

As described in this document apart in large part by the components¬†LBL¬ģS.A.A.I. .

The dynamics, the case studies and the procedures are essential for putting in operational continuity services regardless of the adoption of the infrastructure¬†LBL¬ģS.A.A.I. .

The benefits from the adoption of the components¬†LBL¬ģA.A.I.¬†makes the process simple, controllable and repeatable over time. Below the benefits of the adoption of the components¬†LBL¬ģA.A.I.¬†in environment of operational continuity.

LBL¬ģA.A.I. strengths

  • Decrease of expenditure for the purchase of licenses cluster (no longer usable in this context)
  • Reduction of the costs derived from the purchase of licenses enterprise operating systems/Virtualization (often required for the purchase of the cluster versions)
  • Elimination of virtual addresses in the datacenter (derived from the elimination of the cluster of first generation)
  • Centralization of policies for high reliability
  • Automation of procedures on different layers, different operating systems, different virtualization technologies, different half-site
  • Reproducibility of the tests of high reliability in safety and with lower costs
  • Integration with systems of existing monitoring
  • Homogenization of the operations of Business Continuity / Disaster Recovery/Fast Disaster Recovery with a single tool
  • Automation application restart before the promotion of fail-over
  • Automation of the recovery policies in the face of events of fail-over
  • Moving application for individual service with automation of procedures for testing and commissioning
  • Preparing industrial passage from implementation project management through procedures self-documented
  • Savings in the operation of management through WorkFlow

The components¬†LBL¬ģS.A.A.I. ¬†are also enabling the management of Disaster Recovery can be realized with different degrees of automation.

Conclusions

The introduction of functionality for Operational Continuity and/or Disaster Recovery in enterprise environments requires tools that facilitate centralization, operation and control. 

The maintenance of services in Business Continuity needs an “industrial” setting of processes and procedures in a manner which is not dissimilar to the introduction of robotics in manufacturing.

LBL¬ģS.A.A.I. does not perform different actions than humans, it simply automates them and makes them indefinitely repeatable over time.