Commander setup

Back

LBL®Commander is a product destined for mission-critical environments therefore only staff who made the course and has passed the examination is authorized to certify the installation and maintenance of products in operation. All Certified People are equipped with temporary license issued by TCOGROUP.

LBL®Commander can be installed in two modes, together with LBL®ADC or alone (halo). In both cases it is necessary the specific license LBL®Commander Work Flow or LBL®Commander Decision Engine to use it.

Installation

To install LBL®Commander you must proceed with the installation of base available in the installation manual and setup. For installations MS Windows the reference is the manual LBL_GDG_WINPROTABLE_InstallUpdate_ita/eng.pdf for installation in a Linux system reference manual is LBL_GDG_InstallUpdateFrom_ita/eng.pdf.

LBL®Commander Introduction

LBL®Commander introduces a new concept of high reliability in application going to play the role of coordinator of activities in a mission-critical data center.  LBL®Commander is composed of two main modules: LBL®Commander Work Flow and LBL®Commander Decision Engine. The two modules have been designed to work in cooperation with each other or, in the case are not required of automatic operations, you can use the only component LBL®Commander Work Flow. LBL®Commander Decision Engine was designed to cooperate with LBL®ADC.

The general architecture can be summarized in the following scheme:

The flexibility of the instrument can be expressed in different architectures. Below another example of possible architecture where the components LBL®Commander Decision Engine and Work Flow are on the same machine in every semisito (HalfSite)…. 

…Or a scenario of Disaster Recovery with take-over manual and then with the only components LBL®Commander Work Flow.

Precisely the characteristic of separation of tasks between LBL®Commander Work Flowwhat must be done, and LBL®Commander Decision Engine, when it must be done, allows us to divide the installation in two stages, checking first in a manual manner all the flows and then, if required an automatic operation, proceed with the installation of LBL®Commander Decision Engine.

For this reason the manual is divided into this sequence on two chapters, one concerning the installation of LBL®Commander Work Flow and the other concerning the installation of LBL®Commander Decision Engine.

LBL®Commander Work Flow Introduction

The installation of the module LBL®Commander Work Flow requires a prior analysis of the objectives that we wish to pursue as to its nature LBL®Commander Work Flow is a tool capable of carrying out any operation the scheduled.

For this manual we will use an example of minimum where you will manage a process Apache Tomcat in its life cycle: Start, Shutdown (NOTE: When speaking of shutdwon is meant the stop of applications and non-operating system that hosts them). If it is wished to reuse the same example also for the component LBL®Commander Decision Engine. For the moment we will identify two types of job: normalPrimer trigger (primary) and gracefulShutdown (controlled shutdown of applications).

LBL®Commander Work Flow design

LBL®Commander Work Flow Starts from the premise that any complex operation can be broken down into elementary objects. 

The first operation is therefore the identification of the works to be carried out (Work Flows) and activities within each single job (steps).

If for example it is wished to carry out the start of a process Apache Tomcat we must first of all identify with a name the job type. In this case we will choose normalPrimer as identification name of the Work Flow of priming of the start of the process. Surely such a start should be made possible to perform also a controlled stop. For this purpose also identify a file name to run this job: gracefulShutdown.

Once you have identified the work we are going to check for each work (Work Flow) the  characteristic step.

In this case for the Work Flow normalPrimer we certainly need to use one step that will launch tomcat (es.: startupTomcat) and one step that will terminate the running Tomcat (es.: shutdownTomcat). 

Another operation is the identification of the steps of termination of Work Flow. By convention a Work Flow may terminate with normalEnd if the operation is successful and one step with name abEnd (abnormal end) in case the operation is not ended with success.

As we shall see below the step normalEnd and abEnd are fairly constant in all types of Work Flow by identifying the correct order or abnormal of a work flow.

Once identified the  typical step of our project we will focus on the development of commands that will lead to the definition of our work flow.

LBL®Commander Work Flow plan addresses

Being LBL®Commander Work Flow a service available on the network, it is necessary to carry out the plan of addresses to be able to identify which servers and related networks will be made available. The service is delivered in the form of Remote Workflow Command (RWC) in safe mode (HTTP-SSL authenticated). It is possible to identify one or more LBL®Monitor that through their Web Console can interact with LBL®Commander Work Flow.  LBL®Commander Work Flow is also used by LBL®Commander Decision Engine in case are provided automatisms of primer in the face of application events.

The plan addresses for LBL®Commander Work Flow is very simple and can be implemented through a simple table such as the one proposed below:

Hostname/Address

Port number

Description

Legendonebackend

54444

HalfSite A.A

Legendtwobackend

54444

HalfSite B.B

Note: the nodes/locations are identified by the letters divided by a punctuation. The punctuation indicates the depth of infrastructure. The first letter indicates the context with any relationships with decision engines LBL®Commander Decision Engine. Here again the architectural scheme general for convenience of exposition.

LBL®Commander Work Flow Web Console

LBL®Commander Work Flow requires one or more consoles, minimum 2 in mission-critical environments, in order to be able to be managed. You can use the same administration console installation LBL®Commander Work Flow or, as in the drawing, use other instances LBL®Monitor located in different servers used for administrative console. 


 

 

LBL®Commander Work Flow directory and files

LBL®Commander Work Flow uses only two directories and a configuration file plus the license file.

Default directories are:

(LBL_HOME)procsProfiles/A03_LBLGoSurfaceClusterWF/surfaceClusterWFCommandDir

————————————————————————————————————-

(LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterWF/conf

Contains the license and the configuration file surfaceclusterwf.xml

(LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterWF/surfaceClusterWFCommandDir

Is the default directory scripts/executables that will be launched by the step of a work flow. The distributions listed in this directory for batch files with names that are typical of a flow of take-over of a site:

For UNIX / Linux:

Unix/flushDisk.sh

Unix/fractureReplications.sh

Unix/fsck.sh

Unix/inversionReplication.sh

Unix/mountFileSystem.sh

Unix/restartApache.sh

Unix/selfTest.sh

Unix/shutdownApplicationOne.sh

Unix/shutdownApplicationThree.sh

Unix/shutdownApplicationTwo.sh

Unix/startApplicationOne.sh

Unix/startApplicationThree.sh

Unix/startApplicationTwo.sh

Unix/startDatabase.sh

Unix/umountFileSystem.sh

Unix/waitSynchronization.sh

Unix/checkApacheActivity.sh

For  MS Windows environments:

Windows\lushDisk.bat

Windows\fractureReplications.bat

Windows\fsck.bat

Windows\inversionReplication.bat

Windows\mountFileSystem.bat

Windows\restartApache.bat

Windows\selfTest.bat

Windows\shutdownApplicationOne.bat

Windows\shutdownApplicationThree.bat

Windows\shutdownApplicationTwo.bat

Windows\startApplicationOne.bat

Windows\startApplicationThree.bat

Windows\startApplicationTwo.bat

Windows\startDatabase.bat

Windows\umountFileSystem.bat

Windows\waitSynchronization.bat

Windows\checkApacheActivity.bat

Each of these file contains a template of minimum ready to be populated with the appropriate commands of the affected platform. By way of example below the content of the commands listed above. For reasons of practicality, especially for the batch file MS Windows, at the beginning of each command on the first parameter is distinguished a interactive mode by a batch mode. This is convenient in the implementation phase because in interactive mode is not performed the exit that would lead on MS Windows at the outlet of the launch window. In this case you will see the return code with which would be released the batch file to the exit.

Obviously for what concerns LBL®Commander Work Flow the only value used is the return code and then these examples are only intended as a track fully editable.

MS Windows

@ECHO OFF

REM  LBL(tm) LoadBalancerREM

REM  This is a commercial software

REM You shall not disclose such Confidential Information and shall use

  it REM only in accordance with the terms of the license agreementREM

  www.tcoproject.com REM

REM  www.lblloadbalancer.com

REM  mailto:info@tcoproject.comREM

REM  LBL(tm) LoadBalancer is built on TCOProject(tm) SoftwareLibrary

REM LBL and TCOProject are trademarks of F.Pieretti. All rights reserved.REM

REM LBL(r)Surface Cluster Template

set LBL_INTERACTIVE_CMD=%1

ru “%LBL_INTERACTIVE_CMD%” == “” (set LBL_INTERACTIVE_CMD=false)

echo %LBL_INTERACTIVE_CMD%

set RETURN_OK=0

Sets RETURN_KO=99

set returnCode= %RETURN_OK%REM

REM actionREMREM

REM exitREM:end_procedures

if “%LBL_INTERACTIVE_CMD%” == “true” (goto end_echo_return_code):end_exit_return_code

exitreturnCode %%:end_echo_return_code

ECHO EXIT=%returnCode%:end

Solaris / Linux

#!/Bin/sh# LBL(tm) LoadBalancer## This

 is a commercial software# You

 shall not disclose such Confidential Information and shall use# it

 only in accordance with the terms of the license agreement## www.tcoproject.com# www.lblloadbalancer.com# mailto:Info@tcoproject.com## LBL(tm) LoadBalancer

 is built on TCOProject(tm) SoftwareLibrary# LBL and TCOProject

 are trademarks of F.Pieretti. All rights reserved.## LBL(r)Surface

 Cluster Template

LBL_INTERACTIVE_CMD=${1:- the false’}

RETURN_OK=0

RETURN_KO=99

returnCode=$RETURN_OK## action### exit#

test “$LBL_INTERACTIVE_CMD” != “true” && exit $returnCode

test “$LBL_INTERACTIVE_CMD” = “true” && echo exit=$returnCode

It is possible to carry out the same operations also directly from the web console through the following operations:


Commander-> Wirkflow Managers -> Choice of module 

Choice of workflow

Choice of step to edit es.: 

Editing of the script:

Edit

LBL®Commander Work Flow implement batch files

The first step then in the deployment and installation of a work flow is the analysis of the steps to be performed and then the writing and testing of individual step and then go to their insertion into a workflow.

For our purposes we copy any batch distributed in:

MS Windows

Cd (LBL_HOME\procsProfiles\A03_LBLGoSurfaceClusterWF\surfaceClusterWFCommandDir

Copy selfTest.bat tomcatStartup.bat

Unix / Linux

Cd (LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterWF/surfaceClusterWFCommandDir

Cp selfTest.sh tomcatStartup.sh

For both platforms, but especially on MS Windows, it is important to make sure that the controls performed in the batch file will not run of autonomous exit, implicit or explicit. To do this test is sufficient to use a single exit point in their batch and check with special signaling the successful passage (common practice of good programming).

In our specific cases we will populate our new batch file with the operations to start Tomcat.

Below the files start with highlighted the control lines used to perform the start and the stop of the tomcat instances.

MS Windows tomcat startup:

@ECHO OFF

REM  LBL(tm) LoadBalancerREM

REM  This is a commercial software

REM You shall not disclose such Confidential Information and shall use

  it REM only in accordance with the terms of the license agreementREM

  www.tcoproject.com REM

REM  www.lblloadbalancer.com

REM  mailto:info@tcoproject.comREM

REM  LBL(tm) LoadBalancer is built on TCOProject(tm) SoftwareLibrary

REM LBL and TCOProject are trademarks of F.Pieretti. All rights reserved.REM

REM LBL(r)Surface Cluster Template

set LBL_INTERACTIVE_CMD=%1

ru “%LBL_INTERACTIVE_CMD%” == “” (set LBL_INTERACTIVE_CMD=false)

echo %LBL_INTERACTIVE_CMD%

set RETURN_OK=0

Sets RETURN_KO=99

set returnCode= %RETURN_OK%REM

REM actionREMsetlocal

set JAVA_HOME=C:\work1\bin\Java\jdk1.6.0_07

cd C:\work1\bin\Servers\apache-tomcat-6.0.14_8484\bin

call startup.bat

 returnCode set=%ERRORLEVEL%REM attention back 1 even when OKREM

REM exitREM:end_procedures

if “%LBL_INTERACTIVE_CMD%” == “true” (goto end_echo_return_code):end_exit_return_code

exitreturnCode %%:end_echo_return_code

ECHO EXIT=%returnCode%:end

In this case during the tests we realized that the start of tomcat in Windows environment always returns 1 even if the operation is successful. We preferred not to modify this behavior but highlight it at the level of the comment and then return the Logic as described below, on the step of performing this batch. In this way we have documented this behavior on the flow and we haven’t added logic within the program launch. Besides not having modified the file relating to the application will bring an undoubted advantage in following servicings that must not take account of changes to return back to the applications for each update.

If we try to run from command line this command you will get this result:



The same implementation was introduced in the command of tomcatShutdown.bat with shutdown command:

Echo of the exit code

Start with echo of the return code

REM LBL and TCOProject are trademarks of F.Pieretti. All rights reserved.REM

REM LBL(r)Surface Cluster Template

set LBL_INTERACTIVE_CMD=%1

ru “%LBL_INTERACTIVE_CMD%” == “” (set LBL_INTERACTIVE_CMD=false)

echo %LBL_INTERACTIVE_CMD%

set RETURN_OK=0

Sets RETURN_KO=99

set returnCode= %RETURN_OK%REM

REM actionREMsetlocal

set JAVA_HOME=C:\work1\bin\Java\jdk1.6.0_07

cd C:\work1\bin\Servers\apache-tomcat-6.0.14_8484\bin

call shutdown.bat

 returnCode set=%ERRORLEVEL%REM attention back 1 even when OKREM

REM exitREM:end_procedures

if “%LBL_INTERACTIVE_CMD%” == “true” (goto end_echo_return_code):end_exit_return_code

exitreturnCode %%:end_echo_return_code

ECHO EXIT=%returnCode%:end

Solaris / Linux

In Solaris  and Linux implementations to perform are the same obviously with the differences due to the scripting language used and any differences in behavior.

Below the example with the setting of the batch file tomcatStartup.sh:

#!/Bin/sh# LBL(tm) LoadBalancer## This

 is a commercial software# You

 shall not disclose such Confidential Information and shall use# it

 only in accordance with the terms of the license agreement## www.tcoproject.com# www.lblloadbalancer.com# mailto:Info@tcoproject.com## LBL(tm) LoadBalancer

 is built on TCOProject(tm) SoftwareLibrary# LBL and TCOProject

 are trademarks of F.Pieretti. All rights reserved.## LBL(r)Surface

 Cluster Template

LBL_INTERACTIVE_CMD=${1:- the false’}

RETURN_OK=0

RETURN_KO=99

returnCode=$RETURN_OK## action#JAVA_HOME

=/TCOProject/bin/java; export JAVA_HOME

cd /TCOProject/bin/tomcat/apache-tomcat-6.0.24/bin/./startup.sh

returnCode=$?## exit#

test “$LBL_INTERACTIVE_CMD” != “true” && exit $returnCode

test “$LBL_INTERACTIVE_CMD” = “true” && echo exit=$returnCode

TomcatShutdown.sh

…## LBL(r)Surface

 Cluster Template

LBL_INTERACTIVE_CMD=${1:- the false’}

RETURN_OK=0

RETURN_KO=99

returnCode=$RETURN_OK## action#JAVA_HOME

=/TCOProject/bin/java; export JAVA_HOME

cd /TCOProject/bin/tomcat/apache-tomcat-6.0.24/bin/./shutdown.sh

returnCode=$?## exit#

test “$LBL_INTERACTIVE_CMD” != “true” && exit $returnCode

test “$LBL_INTERACTIVE_CMD” = “true” && echo exit=$returnCode

As it can be noticed a difference between the start on MS Windows and Solaris / Linux is the return code which in this case returns 0 if the operation is successful.

By performing the operations from the command line you will get the following result:

With the limits of these examples and at the end of this chapter we can see that we have started to write our library of “objects” reusable across different platforms for different projects. We now need to determine the usage of these objects inside of logic flows of work.

LBL®Commander Work Flow build Work Flows

Once completed and executed the detailed testing of basic commands for the realization of our workflows remains to implement the logic of context in the Work Flow that we want to achieve.

Parameterisation of the workflow begins by setting the address where they will be accepted commands: typically will not have to be carried out no change because the system is ready to accept commands in the address management.

Commander->WorkFlow Managers -> choice of workflow

Another interesting value is the value: exclusiveRunWorkFlow= “false

This parameter indicates if the entire Work Flows Engine may have at the same time more Work Flow in a state of running. For our purposes the setting to “false“, i.e. more Work Flow can be launched at the same time, is the correct option. It is to be recalled that in any case the Work Flows Engine does not allow, independently from this parameter, to simultaneously launch the same Work Flow. In other words if the Work Flow with normalPrimer name in a state of running can not be further launched while can be launched at the same time the Work Flow gracefulShutdown. In the case had been set  exclusiveRunWorkFlow=”true” and the Work Flow  normalPrimer was in a state of running not succeed to launch any other Work Flow including gracefulShutdown, until the end of normalPrimer.

Once the setting in order to make it accessible from the outside using the System Workflow goes to the next paragraph that contains the Work Flow with the actions divided in single steps.

In our case we should arrive to have two Work Flow in order to perform the start and the stop of the Tomcat process as per specifications that we posts initially.

The final result for the start will be similar to when described below where the chapter on <workFlow> you can see the parameters name=normalPrimer””, the name of that particular work flow, a brief description and the name of the step with which to start the workflow startName=startupTomcat””:

The parameters related to each step are fairly intuitive and then we focus only on some of them. A complete description of each parameter and their behavior can be found in the Reference Guide.

  <!– ******************************************************************* **

  START WORK FLOW NORMAL PRIMER

  ** ******************************************************************* –>

<workflow name=normalPrimer”” 

   Description=”Start Apache Tomcat Primary”

   The startName=startupTomcat””>

  <step enable= “true”

   Name=startupTomcat””

   Description=”Start Apache Tomcat primary”

   WaitBeforeExecute=”1000″

   SysCommandTimeOut=”600000″

   EvaluateReturnCode= “true”

   Command=tomcatStartup”.bat”  

     MaxRetry=”3″

   GotoWhenFalse=”abEnd”>

  <returncode value=”0″

   Description=”OK”

   Result= “true”/>

  <returncode value=”1″

   Description=”OK”

   Result= “true”/>

  </step>

<!– *********************************************************** **

  END FUNCTIONS

  ** *********************************************************** –>

   <step enable= “true”

   CommitWorkFlow= “true”

   Name=normalEnd””

   Description=”Normal end”

   WaitBeforeExecute= “10000”

   EvaluateReturnCode= “false”

   GotoWhenTrue=normalEnd””

   GotoWhenFalse=normalEnd””>

  </step>

  <step enable= “true”

   Name=”abEnd”

   Description=”Abnormal end”

   EvaluateReturnCode= “true”

   WaitBeforeExecute= “10000”

   GotoWhenTrue=”abEnd”

   GotoWhenFalse=”abEnd”>

  <returncode value=”0″

   Description=”KO FOR ABNORMAL END”

   Result= “false”/>

  </step>

</workflow>

The significant parameters on which dwell are command that will be run in step command=tomcatStartup”.bat”, and step normalEnd “” with the parameter commitWorkFlow= “true”. The latter parameter indicates that the step is a point of consistency and therefore will not produce the log after the first time. In this case the step enters a loop without output possibilities and this behavior is desired. As we will see later this behavior has been exploited so as not to give the possibility of double running the same workFlow at the same time. It is a good idea in production put in loop the workFlow by carrying out a periodic HealthCheck service, for example through wget in this case, and exit to return code with any repair actions as may be a restart or signalling of the problem. 

Already from this moment it is possible to initiate LBL®Commander Work Flow and check the effect. In order to be able to be viewed from the Web Console, you must have set the file monitor.xml with the references to the server LBL®Commander Work Flow that you want to display as specified in paragraph seen previously.

Following the hyperlink you can check step to step the path traveled by the Work Flow once started:

It is recommended already from this moment to run to verify its effects and operation. In the event of Work Flow complex it is possible to carry out the run one step at a time using the button “Step action” on the right side of each step.

If the tests of this first Work Flow went to good purpose it is possible to move on to the development of the next work flow for the execution of the  controlled shutdown of the service the Apache Tomcat.

  <!– ******************************************************************* **

  START WORK FLOW graceful shutdown

  ** ******************************************************************* –>

<workflow name=gracefulShutdown”” 

   Description=”graceful shutdown Apache Tomcat”

   The startName=shutdownTomcat””>

  <step enable= “true”

   Name=shutdownTomcat””

   Description=”graceful shutdown Apache Tomcat”

   WaitBeforeExecute=”1000″

   SysCommandTimeOut=”600000″

   EvaluateReturnCode= “true”

   Command=tomcatShutdown”.bat”  

     MaxRetry=”3″

   GotoWhenFalse=”abEnd”>

  <returncode value=”0″

   Description=”OK”

   Result= “true”/>

  <returncode value=”1″

   Description=”OK”

   Result= “true”/>

  </step>

<!– *********************************************************** **

  END FUNCTIONS

  ** *********************************************************** –>

   <step enable= “true”

   CommitWorkFlow= “true”

   Name=normalEnd””

   Description=”Normal end”

   WaitBeforeExecute= “10000”

   EvaluateReturnCode= “false”

   GotoWhenTrue=normalEnd””

   GotoWhenFalse=normalEnd””>

  </step>

  <step enable= “true”

   Name=”abEnd”

   Description=”Abnormal end”

   EvaluateReturnCode= “true”

   WaitBeforeExecute= “10000”

   GotoWhenTrue=”abEnd”

   GotoWhenFalse=”abEnd”>

  <returncode value=”0″

   Description=”KO FOR ABNORMAL END”

   Result= “false”/>

  </step>

</workflow>

Through its administration interface by performing the stop and the start of the process LBL®Commander Work Flow you can also manage this new Work Flow.

Following the hyperlink of the new Work Flow if it can verify the path:

In a very simple manner you can now manage remote from the start and the stop of the process Apache Tomcat.

LBL®Commander Work Flow distributed events (@RWC)

From version 7.1 you can perform within a step in your workflow is a further command start  remote workflow (@RWC). This functionality allows you to “System Commander Work Flow” to perform distributed operations. With this feature you can therefore propagate a process workflow from the layers the highest to the lowest or vice versa on different platforms to run then articulated operations on all the components that form the application. The application availability is in fact the sum of the availability of many layers, from data availability, mass storage systems, the availability of database services, availability of directory services until you get to the availability of application services provided by the application server and the availability of connectivity.

These services are on systems physically separated, sometimes even geographically, and hence the need to “govern” in a distributed manner but coordinated all these elements. Think for example of inducing a restart targeted application. To perform an operation of this type we need to identify the resources that contribute to making available the service and then once determined which is the component in “failure” check the possibility to restore operation. A activities thus articulated must be governed by procedures well defined and tested so that they can be scientifically repeatable. With the introduction of feature @RWC in one step of a workflow is possible from a  main WorkFlow to perform these tasks in a centralized manner, targeted, Localized, automatable.

  1. Centralized

    Centralized because in a single LBL®Commander Work Flow Server it is possible to indicate the workflow management that trigger WorkFlow on remote systems.
  2. Targeted

    It is possible to define  specialized WorkFlow to trigger actions of maintenance on individual services
  3. Localized

    Each individual WorkFlow system locates their managements of the application life cycle
  4. Automated

    Through LBL®Commander Decision engine it is possible to automate all operations to provide Business Continuity Consulting Services and/or Disaster Recovery.

Also in this case a particular attention must be placed in the ease of deployment and use with traceability documented and implicit.

To carry out an operation of @RWC is sufficient to insert in the parameter “command” a paragraph “<step>” the reserved keyword @RWC followed by indications of where you want to execute the command (hostName) and what you want to do (workflow). If you would then perform the RWC “startHealthCheckAndRestart” on the system “dbserver” would be sufficient to indicate:

Command=”@RWC hostName=dbserver workFlowstartHealthCheckAndRestart=”

The login and password are not present because they use LBL®Autonomous Delegated authentication.

The parameters login and password will also be used in the future to perform new Inner Command.

The parameterisation complete command @RWC allows to perform all the operations start, from the start of an entire workflow at the start of a workflow with departure from one  predetermined step at the start of a single step.

The full command with all parameters is as follows:

@RWC hosname=system name or target address of RWC

  [PortNumber=port number system of SCWF] default 54444

  [UriPath=Uripath of  SCWF system] default /SCWFCommand

  WorkFlow=workflow name

  [Step=name of step from launch, if “” from the first default] “”

  [Command=runWorkFlow|stopWorkFlow] default “runWorkFlow”

  [Frhd=if true only performs the step indicated] default false

Es. 1: start workflow from step secondStep

Command=”@RWC hostName=dbserver workFlow=startHealthCheckAndRestart step=secondStep”

Es. 2: start workflow only secondStep step

Command=”@RWC hostName=dbserver workFlow=startHealthCheckAndRestart step=secondStep frhd=true”

It is useful to observe that the launch of a @RWC on remote systems does not produce a significant return code. @RWC notification as an error only if it fails to produce a start of the procedure. Each remote procedure must be self-consistent for the drawing so that only the application availability is the discriminating factor of functionality.

LBL®Commander Work Flow Remote Batch

This service allows you to run the start of remote batch to perform HealthCheck distributed in safety. It consists of a Web service, which responds by default to port 5994, able to interpret an xml file of profile for the launch of executables or batch file.

The installation must include the following steps:

1- Verifcare that has been set to the license of use LBL®Commander Work Flow

   (LBL_HOME)/lib/confSurfaceClusterWF/license.xml

2- setting the configuration of the service

3- Run the start of service by checking the log with the successful listen to the port address indicated:

4- Writing and verification of the batch

  Example go.bat (Windows systems):

   Echo %0 %1 %2 %3 %4 >>C:\logvai.txt

      AA=%1

   Exit 0

   Example go.sh (UNIX/Linux systems)

   Echo startBatch $@ >>/tmp/logStartBatch.txt

   Exit 0

5- profile setting to launch in 

   (LBL_HOME)/lib/webroot_remotebatch/webapps/RemoteBatch/

The profile file has the following format:

<startBatch>

<params remoteBatchTimeOut= “10000” 

   RemoteBatchCheckRate= “300” 

   AllowURIParams= “false” debug=false””/>

<legacyCommand>T:\work0\tmp\go.bat 11 </legacyCommand>

</startBatch>

Being interpreted when calling, exactly as an HTML page, whose values can be changed dynamically to warm.

The file name must be equal to the main paragraph. In the case described above the file you will have to call startBatch and have an extension such as: xml; txt, etc. (Mime type text/html).

The parameters are few and fairly intuitive. Following is a detailed description. For further information on the single parameter please refer to documentation Reference Guide.

RemoteBatchTimeOut=: default value “10000”

It is the waiting time before releasing the batch process/executable.

RemoteBatchCheckRate=: default value “300”

It is the waiting time between an attempt and the other after having waited

RemoteBatchTimeOut. After 3 attempts the launch of the batch is declared bankrupt.

AllowURIParams=:default value “false”

If true, allows to insert more parameters from the URI. The default is false for safety reasons. If not set or set to false any parameters set in the URIPath will be ignored.

Debug=: default value “false”

If true logs warning with the values of the start of the process and of the obtained result.

<legacyCommand>=: default “null”

Is the batch from launch. For safety reasons this value cannot  be empty.

The launch of the batch will be through the request to a URL that returns the name of the file descriptor to interpret for the launch of the batch/executable.

It is possible to specify additional parameters during the launch of the program in the URL itself as in the example below:

Http://localhost:5994/RemoteBatch/startBatch.txt?parameters=2222 3333 4444

In this case the launch of batch will be thus composed in part parameters:

T:\work0\tmp\go.bat 11 2222 3333 4444

To verify the execution of batch you can use a simple browser and set the address es.:

Http://wileubuntudbenchbackend:5994/RemoteBatch/startBatch.xml

The error of interpretation XML is related to the body of the message that shows only the return code of the batch/executable. For return code equal to 0 the Response HTTP queues will be 200 OK. For return codes in the batch/executable different from 0 the response HTTP queues will be 500 Internal Server Error.

6- setting the automatic start of the service (from “manual” to “automatic”)

ATTENZIONEQuanto contained in batch or in the various compositions of the parameters are taken care of by the same batch. The service LBL will not check the correctness of the batch or various compositions of the parameters that will be completely under the care of the implementer

LBL®Commander Work Flow with Decision Engine

LBL®Commander Work Flow can be coupled to the functionality  LBL®Commander Decision Engine Automatic Start in coincidence with events of  application failure.

By default  LBL®Commander Decision engine uses 3 Work Flow that distinguish the processes of start, stop and switch to an operating environment.

The three moments are identified as:

  1. NormalPrimer

 Normal startup of applications on the main node

  1. GracefulShutdown

Stop applications in controlled mode. This operation is triggered by  LBL®Commander Decision Engine on the node/site where it was detected an event of failure before to take control to another node/site. This is to be considered as an attempt since the node/site may no longer be available.

  1. TakeControl

Is the set of processes that need to be undertaken in a node/secondary site to take control of application.

Note: From version 7.1 it is possible to identify an intermediate time that elapses between the detection of the failure and the decision to carry out a displacement of the services. This time, called restartWorkFlow was implemented to perform prior to the displacement of the services, a restart at the primary site of components that are believed to be more subjects in an infrastructure to criticality. Very often in fact the restart controlled only critical components allows to have a very low impact on the operation and better management of critical events without resorting to the displacement of the operations on another website/system with consequent successive action of restore operations.

These moments, henceforth called workFlow, must be available in order to be able to use  LBL®Commander Decision Engine. The names of the individual Work Flow can be changed according to the needs. It is also important to know that one instance LBL®Commander Work Flow can contain multiple times these actions with different names depending on the applications that you want to monitor by different LBL®Commander Decision Engine. In other words if a node coexist more areas of application in different contexts and if it wants to separately driving the possibility of fail-over is sufficient to insert additional Work Flow with different names for the execution of the actions relating to that field of application.

If you want to proceed with the installation of  LBL®Commander Decision Engine is therefore necessary to insert a new Work Flow to the two already created in the previous chapters: takeControl.

In this case the minimum of this new Work Flow will be exactly equal to the normalPrimer and i.e. will trigger the start of Apache Tomcat. In the distribution there are some examples, like for operating systems Unix/Linux and MS Windows. Examples are a “template” on which to build their own implementations.

You must then proceed with the creation of the Work Flow “takeControl” then it can be used as an event from LBL®Commander Decision Engine.

Below the fragment of the file surfaceclusterwf.xml with the Work Flow takeControl:

  <!– ******************************************************************* **

  START WORK FLOW TAKE CONTROL START

  ** ******************************************************************* –>

<workflow name=takeControl”” 

   Description=”Take Control Start Apache Tomcat Tomcat”

   The startName=startupTomcat””>

  <step enable= “true”

   Name=startupTomcat””

   Description=”Take Control Start Apache Tomcat Tomcat”

   WaitBeforeExecute=”1000″

   SysCommandTimeOut=”600000″

   EvaluateReturnCode= “true”

   Command=tomcatStartup”.bat”  

     MaxRetry=”3″

   GotoWhenFalse=”abEnd”>

  <returncode value=”0″

   Description=”OK”

   Result= “true”/>

  <returncode value=”1″

   Description=”OK”

   Result= “true”/>

  </step>

<!– *********************************************************** **

  END FUNCTIONS

  ** *********************************************************** –>

   <step enable= “true”

   CommitWorkFlow= “true”

   Name=normalEnd””

   Description=”Normal end”

   WaitBeforeExecute= “10000”

   EvaluateReturnCode= “false”

   GotoWhenTrue=normalEnd””

   GotoWhenFalse=normalEnd””>

  </step>

  <step enable= “true”

   Name=”abEnd”

   Description=”Abnormal end”

   EvaluateReturnCode= “true”

   WaitBeforeExecute= “10000”

   GotoWhenTrue=”abEnd”

   GotoWhenFalse=”abEnd”>

  <returncode value=”0″

   Description=”KO FOR ABNORMAL END”

   Result= “false”/>

  </step>

</workflow>

After stop and start the server LBL®Commander Work Flow the final result should look similar to this example:

In an environment with two specular nodes is sufficient to perform the installation of the image  LBL®Commander Work Flow by performing a zip file of the directory (LBL_HOME) and then change only the references to the host.

LBL®Commander Decision Engine Introduction

LBL®Commander Decision Engine is a geographical clusters designed to automate the operations of Fail-Over in mission-critical environments. This module can be considered as the module “thinking” to automate the management of procedures (Work Flow henceforth) located on other servers both locally and geographical. LBL®Commander Decision engine works in cooperation with LBL®Commander Work Flow to launch Remote Workflow Command (RWC).

The general architecture can be summarized in the following scheme:

The flexibility of the instrument can be expressed in different architectures. Below another example of possible architecture where the components LBL®Commander Decision Engine and Work Flow are on the same machine in each node/site…. 

…Or a scenario of Disaster Recovery with take-over automatic

LBL®Commander Decision Engine plane of addresses/resources

The design of an infrastructure based on  LBL®Commander Decision Engine is structured in the following essential aspects:

  1. Identifying the resources that identify the public network
  2. Identifying the resources that identify the network of backend
  3. Identification of other decision engines which contribute to quorum
  4. Identification of Application Resources

These four elements lead us to draw up a plan of addresses/application resources that describe the operating context.

The first table of the plan addresses serves to identify addresses/services that may be elected to the target for the determination of the reachability of the public network. The suggested minimum numbers of targets is 3 (three) the optimum is 5 (five). The logic is that if all targets are not reached then the public network is declared “down”. An example of a table of addresses/services is below synthesized:

The manager of the HealthCheck was designed to manage different modes of health checks on the basis of the parameters that are provided. If the is provided only the address parameter will perform a “ping”, if the is also supplied the parameter “port” will connect and disconnect TCP/IP, if we provide also the URIPath will perform an HTTP GET and will monitor the response-code that must be 200 (OK) in order to correctly terminate. You can also indicate whether the HTTP service target is in SSL mode with a further parameter.

Public Network Health Check

Address

Port Number

URIPath

SSL

Description

192.168.43.143

System A1 public

192.168.43.146

System A2 public

192.168.43.151

System A3 public

A similar table must be compiled for the services of backend:

The backend Network Health Check

Address

Port Number

URIPath

SSL

Description

192.168.45.143

System A1 public

192.168.45.146

System A2 public

Legendonebackend

54444

/HealthCheck

True

Commander WF A.A

Legendtwobackend

54444

/HealthCheck

True

Commander WF B.B

In the latter example you can observe the health check more complex in the afferent HTTPS services   LBL®Commander Work Flow that will be used for the automation of procedures.

The following table identifies the other decision engines which will contribute to the quorum. The decision-making Quorum, i.e. under conditions that allow an instance LBL®Commander Decision engine to be able to take a decision, is related to the possibility of achieving at least another   LBL®Commander Decision Engine. In the case had not reached any other LBL®Commander Decision Engine the node you would declare isolated and therefore unable to make an independent decision. The table describes other decision engines (peers) will have the following form:

LBL®Commander Decision Engine peers

URL

Description

Https://legendoneprivate:54445/

HalfSite to

Https://legendtwoprivate:54445/

HalfSite B

Https://legendquorumprivate:54445/

Quorum Site

The number of LBL®Commander Decision Engine must be three (3) for the determination of the quorum. A lower number of instances LBL®Commander Decision Engine would not decide in the absence of one of the two instances and a higher number could lead to have decision-making islands (split brain decision-making). Another consideration is relative to the network of Heart-Beat which must not necessarily be private but can be shared with other applications being able to get to work also in geographical scope (internet). 

The configuration files for each node will have always and only two (2) of the three references the table not having ever return the reference to himself.

LBL®Commander Decision Engine Web Console

LBL®Commander Decision engine requires one or more consoles, minimum 2 in mission-critical environments, in order to be able to be managed. You can use the same administration console installation LBL®Commander Decision Engine or, as in the drawing, use other instances LBL®Monitor located in different servers used for administrative console. From the same console can be managed both instances LBL®Commander Decision Engine that instances LBL®Commander Work Flow.



LBL®Commander Decision Engine directory and files

LBL®Commander Decision Engine uses only 3 directory and a configuration file if we exclude the license file.

Default directories are:

(LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterDE/conf

(LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterDE/surfaceClusterDEStatus

(LBL_HOME)/lib/notificationDir

————————————————————————————————————-

(LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterDE/conf

Contains the license and the configuration file surfaceclusterde.xml

(LBL_HOME)/procsProfiles/A03_LBLGoSurfaceClusterDE/surfaceClusterDEStatus

It is the default directory for the persistence of the  status flag. In this directory are created files that identify the resources declared down and then no more usable even after a subsequent restart of the engine or decision-making of the whole system.

(LBL_HOME)/lib/notificationDir

This is the directory of notification to LBL®ADC of the unavailability of the service. Is managed automatically by LBL®Commander Decision Engine by creating a file with the following characteristics:

OutOfOrder.SCDEGroup_00000

| | +————–> Number Identifier of the sequence of

  | |  Fail-Over in order of insertion.

  | | This value must be entered 

  | | in the appropriate parameter

   | |   < associateName endp=”SCDEGroup_0000″…

  | | in iproxy.xml

   | +——————————–> the Name of the cluster group of control

  | The application

  +————————————————> Notification Identifier Out of Order

LBL®Commander Decision Engine surfaceclusterde.xml

Once you have completed the data collection of environment you can start to set the configuration file of the cluster.

The setting of the configuration file must be carried out on all three nodes where you have installed the image LBL®Commander Decision Engine through one of its distributions.


The first operation is then put in a cluster all modules decision engine and act on variables to differentiate the behaviors:

The file contains two main sections, one for the Server Management LBL®Commander Decision Engine while others are functional to the Cluster Management applications.

In the page below are the paragraphs highlighted with a blue frame for the parameters relating to the management of the Server LBL®Commander Decision Engine and red in the paragraphs relating to the management of a single Cluster Application. It is possible in the same  LBL®Commander Decision Engine manage multiple clusters simply duplicating the paragraph <decisionEngine> and by setting a different groupName with the name of this new cluster.

LBL(tm) LoadBalancer is built on TCOProject(tm) SoftwareLibraryLBL TCOProject and

 are trademarks of F.Pieretti. All rights reserved. </copyright>

  <surfaceclusterde> 

  <params  

frequency=”10000″  

address=”solu6bench001monitor”  

port=54445″   

addressHeartBeat “=”solu6bench001private   

portHeartBeat “=”54445   

sysCommandRemoteURL “=”https://localhost:5992/SysCommand”>

  </params>

  <decisionEngineMgr>  

<decisionEngine enable= “true”  

groupName=SCDEGroup””  

description=”SCDE halfSite”  

frequency=”10000   

firstThinkingTime “= “45000”>

  <!– *******************************************************

  PEERS DECISION ENGINES nodes (Min 3 nodes)

  ** ******************************************************* –> <decisionEnginesPeers>  

<peer enable= “true”  

description=”HalfSite B”  

URL=Https://solu6bench002private:54445/”” login= “admin” password=adminadmin””/>  

<peer enable= “true”  

description=”QuorumSite”  

URL=Https://wilelbloneprivate:54445″” login= “admin” password=adminadmin””/>

  </decisionEnginesPeers>

  <!– *******************************************************

  APPLICATIONS SERVICES AND ASSOCIATED SURFACE WORK FLOW SERVER

  ** ******************************************************* –>  

<healthCheckServicesPolicy description=”Services switch policy  

surfaceClusterWorkFlowLogin”= “admin”  

surfaceClusterWorkFlowPassword=”adminadmin   

waitTimeAfterNormalPrimer “=”180000   

waitTimeAfterFlagBroken “=”60000   

waitTimeBeforeTakeControl “=”180000   

waitTimeAfterTakeControl “=”900000   

applicationLostTime “= “30000”>  

<failOverService enable= “true”  

description=”HalfSite A.A   

surfaceClusterWorkFlowURL “=”Https://wileubuntudbenchbackend:54444/”>  

<healthCheck enable= “true”  

description=”primary Application HalfSite A.A”  

address=wileubuntudbenchbackend””  

port=”8080 uriPath “=”/” SSL= “false”/>

  </failOverService>  

<failOverService enable= “true”  

description=”HalfSite B.B   

surfaceClusterWorkFlowURL “=”Https://roadubuntudbenchbackend:54444/”>  

<healthCheck enable= “true”  

description=”Secondary application HalfSite B.B”  

address=roadubuntudbenchbackend””  

port=”8080 uriPath “=”/” SSL= “false”/>

  </failOverService>

  </healthCheckServicesPolicy>

  <!– *******************************************************

  PUBLIC HEALTH CHECK

  ** ******************************************************* –> <healthCheckPublicPolicy> <healthCheck 

address=192.168.43.104″” port=”22″ description=”wilecoyote”/> <healthCheck 

address=192.168.43.103″” port=”22″ description=”roadrunner”/> <healthCheck 

address=192.168.43.101″” description=”gundam”/>

  </healthCheckPublicPolicy>

  <!– *******************************************************

  BACKEND HEALTH CHECK

  ** ******************************************************* –> <healthCheckBackendPolicy> <healthCheck 

address=192.168.45.104″” port=”22″ description=”wilecoyote”/> <healthCheck 

address=192.168.45.103″” port=”22″ description=”roadrunner”/> <healthCheck 

address=192.168.45.101″” description=”gundam”/>  

<healthCheck enable= “true” address=wileubuntudbenchbackend”” port=54444″   

uriPath “=”/HealthCheck” SSL= “true”  

description=”Surface Cluster Work Flow A.A”/>  

<healthCheck enable= “true” address=roadubuntudbenchbackend”” port=54444″   

uriPath “=”/HealthCheck SSL “=”true”  

description=”Surface Cluster Work Flow B.B”/>

  </healthCheckBackendPolicy>

  </decisionEngine>

  </decisionEngineMgr>

  <sysobserver> <service 

name=syslog”” id=syslogsurfaceclusterde””/> </sysobserver> 

  </surfaceclusterde></serviceconf>group identifier of clusterDecision EnginesPeersApplications managementPublicHealthCheckBackendHealthCheck

The setting of the first paragraph provides for the setting of addresses for the server management and for the remote control.

Below what was proposed in the distribution to the paragraph <params>:

  <params

Frequency=”10000″

Address=”___hostnameMonitor___”

Port=54445″”

AddressHeartBeat=”___hostnameHeartBeat___”

PortHeartBeat=54445″”

SysCommandRemoteURL=https://localhost:5992/SysCommand””> </params>

Frequency is the frequency of checking the state of the cluster.

Address is the public address on which to perform control operations while addressHeartBeat is a second address, normally afferent to the network of the HeartBeat, for transactions between nodes peers. The two addresses may coincide and in this case it is necessary to change the port for coexistence.

Once you have set the general parameters you pass to the cluster configuration application. The first paragraph be set is certainly the paragraph relating to the identification of the cluster.

  <decisionEngineMgr>

<decisionEngine enable= “true”

   GroupName=SCDEGroup””

   Description=”SCDE halfSite”

   Frequency=”10000″

   FirstThinkingTime= “45000”

   ApplicationLostTime= “30000”>…

Each paragraph <decisionEngine> defines a Cluster Application. There may be more <decisionEngine> and then more management of cluster different applications at the same time. 

Enable=:default=“false” mu=boolean

Enable or disable the interpretation of this paragraph in the instance.

GroupName=default value=“SCDEGroup”

It is the name of the group of Decision Engine. It is very important because the motors of decision contained on an instance are distinguished by this name.

Description=default value=“description: groupName”

This is the description of this decision engine. Must be brief but comprehensive.

Frequency=default value=“<params frequency>” mu=Milliseconds

Is the frequency of verification status changes. If not specified is assumed the frequency of paragraph <params>

FirstThinkingTime=default value=“45000” UM = Milliseconds

It is the waiting time of initialisation and the first tests. Exceeded this waiting time the decision engine will highlight a message. The engine decionale shall not however also exceeded this initial time and awaits the total initialization of the member.

ApplicationLostTime=default value=“30000” UM = Milliseconds

It is the waiting time from when the application in verification was declared in a state of down. This is a value that is very important because it exceeded this time period and the application were still not reachable and verified the quorum of switches will trigger the procedure for recovery. The 30″ are the minimum to avoid false positives.

Paragraph <decisionEnginePeers> delimits the information relating to the other two Decision Engine that contribute to decisions.

   <decisionEnginesPeers>

<peer enable= “true”

   Description=”HalfSite ___A_or_B___”

   URL=https://___hostnameHeartBeat___:54445/””/>

<peer enable= “true”

   Description=QuorumSite””

   URL=https://___hostnameHeartBeat___:54445″”/> </decisionEnginesPeers>

The parameters are very simple and require only the information about the connection and authentication.

Enable=:default=“false” mu=boolean

Enable or disable the interpretation of this paragraph in the instance.

URL=:default=“” UM URL=W3C

It is the connection URL to the service a joint through the network of the heartbeat.

Description=:default=“description peer: URL”

Description sintica but exhaustive in the service.

The paragraph relating to application services is the heart of the cluster because it identifies the application services with detailed rules of HealthCheck This paragraph also includes the Work Flow that must be triggered for actions of Fail-Over.

<!– *******************************************************

  APPLICATIONS SERVICES AND ASSOCIATED SURFACE WORK FLOW SERVER

  ** ******************************************************* –>

<healthCheckServicesPolicy description=”Services switch policy” 

     WaitTimeAfterNormalPrimer=”180000″

      WaitTimeAfterFlagBroken= “60000”

       WaitTimeBeforeTakeControl=”180000″

       WaitTimeAfterTakeControl=”900000″

       ApplicationLostTime= “30000”

       NormalPrimerWorkFlow=normalPrimer””

       GracefulShutdownWorkFlow=gracefulShutdown””

       TakeControlWorkFlow=takeControl””>

<failOverService enable=”true”

 Description=”HalfSite A.A”

 SurfaceClusterWorkFlowURL=”https://___hostnameBackend_A_A___:54444/”>

<healthCheck enable=”true”

   Description=”primary application HalfSite A.A”

   Address=”___hostnameBackend_A_to___”

   Port=”80″

   UriPath=”/”

  SSL=”false”/>

</failOverService>

<failOverService enable=”true”

    Description=”HalfSite B.B”

    SurfaceClusterWorkFlowURL=”https://___hostnameBackend_B_B___:54444/”>

  <healthCheck enable=”true”

     Description=”Secondary application HalfSite B.B”

     Address=”___hostnameBackend_B_b___”

     Port=”80″

      UriPath=”/”

    SSL=”false”/>

</failOverService>

 </healthCheckServicesPolicy>

The parameters in paragraph <healthCheckServicesPolicy> are related to management. The parameters are intuitive and recovering part of the document of reference guide and can be summarized as follows:

Description=:default=“”

General service description placed in high reliability.

WaitTimeAfterNormalPrimer=:default=“180000” UM = Milliseconds

It is the waiting time after starting the first available service (normal startup). Exceeded this time the decision engine will begin to verify if the service has arrived at the state of activity. Otherwise will be initiated procedures for recovery. If the application goes up before this value the Decision Engine shall immediately check.

WaitTimeAfterFlagBroken=:default=“60000” UM = Milliseconds

It is the waiting time after setting the  persistent flag for propagation to the other peer.

WaitTimeBeforeTakeControl=:default=“180000” UM = Milliseconds

It is the waiting time before the take control serves to allow time for an eventual graceful shutdown of the service previously active if the resource is still reachable.

WaitTimeAfterTakeControl=:default=“900000” (15′) mu=Milliseconds

It is the waiting time after the take control Serves to wait for the end of the complete take control before returning to check the status of activities and then decide if it is successful or retry with another resource if available. If the application goes up before this value the Decision Engine shall immediately check. This time is variable depending on the type of recovery and the amount of data if present a database.

ApplicationLostTime=:default=“<decisionEngine applicationLostTime>” mu=Mill.

It is the waiting time after the take control Serves to wait for the end of the complete take control before returning to check the status of activities and then decide if it is successful or retry with another resource if available. If the application goes up before this value the Decision Engine shall immediately check. This time is variable depending on the type of recovery and the amount of data if present a database.

NormalPrimerWorkFlow=:default=“normalPrimer” mu=Work Flow Name

It is the name of the Work Flow that will be triggered if it is determined the initial startup of the service.

GracefulShutdownWorkFlow=:default=“gracefulShutdown”mu=Work Flow Name

It is the name of the Work Flow that will be triggered if vine determined the down of service and immediately before performing the action of recovery.

TakeControlWorkFlow=:default=“takeControl”mu=Work Flow Name

It is the name of the Work Flow that will be triggered after the start of the gracefulShutdownWorkFlow to initiate action of recovery.

<!– *******************************************************

  APPLICATIONS SERVICES AND ASSOCIATED SURFACE WORK FLOW SERVER

  ** ******************************************************* –><healthCheckServicesPolicy

 description=”Services switch policy” 

     WaitTimeAfterNormalPrimer=”180000″

      WaitTimeAfterFlagBroken=”60000″

       WaitTimeBeforeTakeControl=”180000″

       WaitTimeAfterTakeControl=”900000″

       ApplicationLostTime=”30000″

       NormalPrimerWorkFlow=”normalPrimer”

       GracefulShutdownWorkFlow=”gracefulShutdown”

       TakeControlWorkFlow=”takeControl”>

<failOverService enable= “true”

 Description=”HalfSite A.A”

 SurfaceClusterWorkFlowURL=https://___hostnameBackend_A_A___:54444/””>

<healthCheck enable=”true”

   Description=”primary application HalfSite A.A”

   Address=”___hostnameBackend_A_to___”

   Port=”80″

   UriPath=”/”

  SSL=”false”/>

</failOverService>

<failOverService enable= “true”

    Description=”HalfSite B.B”

    SurfaceClusterWorkFlowURL=https://___hostnameBackend_B_B___:54444/””>

  <healthCheck enable=”true”

     Description=”Secondary application HalfSite B.B”

     Address=”___hostnameBackend_B_b___”

     Port=”80″

      UriPath=”/”

    SSL=”false”/>

</failOverService> </healthCheckServicesPolicy>

<failOverService> identifies in insertion sequence services to their servers. The order of insertion also identifies the priorities of the services. The proposed parameters from the template of distribution are few but in reality you can set many typical parameters for this application in that particular node/site. Refer to the Reference Guide for a complete description.

Enable=:default=“false” mu=boolean

Enable or disable the interpretation of this paragraph in the instance.

Description=:default=“”

Detailed description of the service placed in high reliability. You recommend, as far as possible, use the nomenclature LBL®Surface Clsuter es.: HalfSite A.A rather than HalfSite B.B with the brief but comprehensive description of the service.

SurfaceClusterWorkFlowURL=:default=“https://”

The URL of the instance LBL®Commander Work Flow afferent to this  application service.

<!– *******************************************************

  APPLICATIONS SERVICES AND ASSOCIATED SURFACE WORK FLOW SERVER

  ** ******************************************************* –><healthCheckServicesPolicy

 description=”Services switch policy” 

     WaitTimeAfterNormalPrimer=”180000″

      WaitTimeAfterFlagBroken=”60000″

       WaitTimeBeforeTakeControl=”180000″

       WaitTimeAfterTakeControl=”900000″

       ApplicationLostTime=”30000″

       NormalPrimerWorkFlow=”normalPrimer”

       GracefulShutdownWorkFlow=”gracefulShutdown”

       TakeControlWorkFlow=”takeControl”>

<failOverService enable=”true”

  Description=”HalfSite A.A”

  SurfaceClusterWorkFlowURL=”https://___hostnameBackend_A_A___:54444/”>

<healthCheck enable= “true”

   Description=”primary application HalfSite A.A”

   Address=”___hostnameBackend_A_to___”

   Port=”80″

   UriPath=”/”

   SSL= “false”/>

</failOverService>

<failOverService enable=”true”

    Description=”HalfSite B.B”

    SurfaceClusterWorkFlowURL=”https://___hostnameBackend_B_B___:54444/”>

    <healthCheck enable= “true”

     Description=”Secondary application HalfSite B.B”

     Address=”___hostnameBackend_B_b___”

     Port=”80″

      UriPath=”/”

     SSL= “false”/>

</failOverService>

 </healthCheckServicesPolicy>

The parameters concerning paragraph <healthCheck> indicate how HealthCheck applications. There can be multiple paragraphs <healthChekh> and the failure of only one of these HealthCheck determines the status of verification of achievement of the quorum for decisions on the switch. Obviously if the next review and within the limits of the parameter applicationLostTime the HealthCheck should return to be positive all Decision engine will return to normal audit causing the state of criticality.

Enable=:default=“false”

If true this paragraph namespace is active. If false the paragraph is not taken into consideration.

Description=:default=“”

Brief description of the HealthCheck.

Address=default value=“”

It is the address on which is made the health check.

Port=default value=“0”

It is the port on which meets the service of health check. If <=0 will run a check ICMP

SSL=default value=“false”

If set to true performs the healt check of service through a connection SSL (HTTPS).

UriPath=default value=“” 

It is the URIPath on which meets the service of health check. If not present will run a health check TCP connection.

LBL®Commander Decision Engine Start/Stop/Manage Status

LBL®Commander Decision Engine is a decision engine thought to have very little maintenance and above all a very low operational impact. The management of the operations is completely entrusted to the interfaces usable Web through any one instance LBL®Monitor properly instructed. After you set the parameters of the context to start LBL®Commander Decision Engine performs the initialization of the member to check the status conditions of application and environment.

It is possible to verify these member through the Web interface:

In this phase are started in parallel the application controls and environment. Once completed the checks the status of the node LBL®Commander Decision Engine change giving the exact picture of the state of the environment on which you have set the application control.

Just finished the initialization if the whole environment has been set in the correct manner and there are no faults found to a subsequent refresh will be displayed the following situation:

No quadrant of the state must be colored red or yellow. The red color indicates a fault important mind a yellow coloration indicates a fault that does not compromise the faculty decision making.

To check if the settings are correct, it is sufficient to follow the HiperLink as in this case: groupName->SCDEGroup

This view there highlights on which node is active the application. In this case the application site primary (A.A) is up and running properly while the secondary site has not the active application. This obviously is the normal condition, if it were also activates the application on the secondary site (B.B) we would be in a situation of Split Brain applicfativo. Wishing to simulate this condition we try to run the start of the instance Apache Tomcat also on the secondary site…

On LBL®Commander Work Flow of secondary site we perform the manual start of Apache Tomcat…

Immediately all instances LBL®Commander Decision Engine detects the fault reporting it back onto the logs through e-mail and HTTP POST. Also in this situation LBL®Commander Decision Engine is placed in a state of inconsistency and then fail to take a decision.

LBL®Commander Decision Engine has been designed to work in geographical contexts and maintains a behavior conservative decision identifying the situations in which it is not possible to take autonomous decisions for lack of information or identifying situations of inconsistency.

Always through the Web interface we operate on LBL®Commander Work Flow to restore the situation and delete the status signaling split brain:

Execute the Work Flow “gracefulShutdown” in the secondary site…

…Instances LBL®Commander Decision Engine detects the occurred restoring a situation consistent…

Try now to generate a problem by performing a forced stop of the instance Apache Tomcat on primary site….

Immediately instances LBL®Commander Decision Engine will detect the problem…


Regulated by lease-time fully settable in relation to the application and operating environment instances LBL®Commander Decision Engine occur if the abnormal condition is a condition of the actual error event or was caused by a sporadic event and once you have verified the persistence of the error event and the condition to be able to take decisions “autonomous” instances LBL®Commander Decision Engine pose permanently in a state of “Failure” node/site source of problem…

All decisions taken by the instances LBL®Commander Decision Engine are part of a sophisticated decision engine with parallel detection of events coming from the surrounding environment and in every moment the actions will take place in an autonomous way but consistent in all decision-making nodes….

…Once the decision-making nodes survivors reach the consistency of information relating to the explicit statement of “down” of application in autonomy and in competition seek to restore a situation of application of operation…

…The operations that follow the declaration of “down” of the primary site are a last attempt to carry out the primary site one  controlled shutdown to try if possible to minimize problems due to a take-over forced.

…After shutdown  controlled applications in primary site LBL®Commander Decision Engine waits for a reasonable time that the node/primary site have been able to carry out their operations of controlled shutdown….

….And in any case exceeded the time limit is forced the take-over of the services on the node/secondary site…. the expectation of the startup of the application on the secondary site.

Return to normal operational and obviously indicating the error status of the secondary site…

LBL®Commander Decision Engine restore wizard application

LBL®Commander Decision Engine is a decision engine thought to have very little maintenance and above all a very low operational impact. The management of the operations is completely entrusted to the interfaces usable Web through any one instance LBL®Monitor appropriately instructed. In an event of  application failure LBL®Commander Decision Engine uses indicators persistent and distributed across multiple nodes as the trace of the event.

To restore the situation to normal operational with the main site active and the secondary site in replication is sufficient to operate through the Web interfaces of LBL®Commander Decision Engine and LBL®Commander Work Flow.

If for example it is wished to restore the operations on the main site of the instance Apache Tomcat is sufficient to perform the stop logical decision engines for that group of applications….

The stop of the decision engines not alter with instances LBL®Commander Work Flow and therefore it is possible at any time to stop the motors of decision-making. 

Stopping the decision engines it is possible to manage in a guided manner through instances LBL®Commander Work Flow recovery activities without having to intervene directly on the nodes but by activating the procedures (Work Flow) provided for recovery.

Having to manually restore the initial situation with Apache Tomcat Running on node/main site once stopped logically the decision engines is sufficient to go on the secondary site and carry out the operations in reverse: gracefulShutdown of the application on the secondary site and then takeControl on the main website.

Before carrying out the operations on the nodes/application sites make sure that all instances LBL®Commander Decision Engine related to application group, SCDEGroup in this case are in the state of stopped:

At this point once you are satisfied that the decision engines for the application group were placed in a state of stopped it will be possible to proceed with the restoration of the initial condition.

From this moment onwards the operation has been restored from an operational point of view.

There remains only to return the decision engines in one was useful at the next event of failure… Returning on instances LBL®Commander Decision Engine stops at the beginning of the recovery process you will notice that on the right was made available a X. This button is used to delete all the flags of persistent indication of failure on the nodes/seats. Run the clean of the flags of persistence on all nodes LBL®Commander Decision Engine.

The situation should appear in all of the nodes in the following situation: 0 BrokenApps and X button no longer enabled. Caution: Make sure that all decision engines have been cleaned of the flags of persistence. The Flag of persistence are automatically propagated to all nodes and then in the case it remains also are one in three nodes  LBL®Commander Decision Engine also all other nodes at startup the erediterebbero causing a new “takeControl” the secondary site.

Once you are satisfied that all nodes LBL®Commander Decision Engine are aligned from the point of view of flag of persistence (zero) is sufficient to restart them without a particular boot order…


After the start of the first decision engine you will have a wrong condition determined by the fact that it is the only decision engine operational…

Already from this start however is detected the applications running on the main site….

Thus starting the second decision engine you will restore the situation in which the engines are able to take autonomous decisions….

The status of the heartbeat of yellow color indicates that not all decision engines are operational but that already from this moment the two instances can take autonomous decisions….

The start also of the third decision Engine returns all at the initial state…. 

LBL®Commander Restart phase

From release 7.1 LBL®Commander Decision Engine maintains an intermediate phase between the detection of the failure and the switch to another node/site operations. This intermediate step, optional, controls a restart to check if an intervention less invasive the switch over it is possible to restore the original operation.

This step was introduced by checking the field that very often it is sufficient a restart of the application server or database, or both, to resolve the problems that had made themselves known.

The implementation of this phase provides as first thing the development of workflow to restart and then their setting on decision engines.

  • Development of the Workflow to Restart

The development of the Workflow to Restart provides to carry out a stop “graceful” operating environment with subsequent start at the end of which it will occur again the operativity overall application. This step is marked by two important considerations, the first is certainly that the restart has not already been carried out in a moment relatively close not to trigger restart chain, the other consideration is that of still run a fail-over if all procedures of restart were not successful.

The solution not to trigger restart chain is very easy to implement because in one step of LBL®Commander Work Flow it is possible to set a time delay before carrying out step. With this parameter it is sufficient to insert a wait in the last step of the Workflow to restart in order to avoid  multiple restart. The calculation of the waiting time before to make available again the workflow to another possible command to restart must be carried out as follows:

The formula is obviously guidance and tends to highlight that the total waiting time in the last step of a workflow to restart should be more than the waiting time of the promotion of the fail-over on the other node/site. Using this simple expedient Are you sure you do not incur in Split Brain caused by contemporary start applications on two sites and is also sure not to incur restart multiple chain of an application now compromised.

You can lengthen the total waiting time in the last step a pleasure. For example, if the application that you want to verify normally requires a restart every two days, it can be inferred that cause a further restart due to malfunction detection within 6 hours after the previous restart can be considered as a problem at the “system” and not to the application. 

Upon detection of a further event of failure and consequent command restart to work of the Server LBL®Commander Decision Engine, in the temporal window of 6 hours from the previous Restart, being the Workflow still waiting in the last step will not be performed. Expired the lease time impostatto, in LBL®Commander Decision Engine, declaration of  application failure will be promoted the fail over to another node/site.

To develop the workflow to Restart is probably sufficient to use part of the steps of the workflow of gracefulShutdown and part of the steps of the Workflow normalPrimer highlighting more reusability of script modules previously created.

In the last step, normally normalEnd or abEnd will be sufficient to insert the  waitBeforeExecute parameter=21600000″

  • Setting the Restart in LBL®Commander Decision Engine

The setting of the phase of restart in the parameters of the decision engines is very simple because it only adds new parameters applicationLostTimeBeforeRestart restartWorkFlow and respectively the lease time between the detection of the failure and the momemnto Retsart of and the name of the Workflow to restart to invoke. Below a fragment of file surfaceclusterde parameters.xml (for details see the document Reference Guide):

LBL®Commander Decision Engine Split Brain Assassin

This service manages, in clustered environments distributed and joint committees (stretch cluster), any possibility of Split brain.

LBL®SCDE Split Brain Assassin has very few parameters and is therefore usable with great simplicity. Simplicity required in these cases.

Normally LBL®SCDE Split Brain Assassin will be configured in the nodes that have instances LBL®ADC. This is because the main purpose of LBL®SCDE Split Brain Assassin is to notify the ADC layer to exclude access to some services up to exclude the whole traffic and eliminating any possible interaction with the backend.

The service LBL®SCDE Split Brain Assassin will be activated for last, after you have installed and configured the balancing layer LBL®Commander Decision Engine, LBL®Commander Work Flow and tried everything together. Only then will we assess the possibility of Split Brain geographical, see LBL®White Paper for a comprehensive explanation, and proceed with the installation and startup LBL®SCDE Split Brain assassin.

The necessary steps for configuring LBL®SCDE Split Brain Assassin are summarized as follows:

1- Verifcare that has been set to the license of use LBL®Commander Decision Engine

   (LBL_HOME)/lib/confSurfaceClusterDE/license.xml

2- setting the service configuration file

  (LBL_HOME)/lib/confSurfaceClusterDE/splitbrainassassin.xml

There are two main paragraphs, the first is, as usual, the paragraph <params>. This paragraph defines the generic behavior as the frequency of the HealthCheck and some default parameters.

The first thing to identify targets reason of health check. These targets are easily identified in two nodes LBL®SCDE, the joint and the quorum. LBL®SCDE Split Brain Assassin only in the absence of both targets (or all targets if more than two) will notify the critical event at LBL®ADC by setting the notification files.  

LBL®SCDE Split Brain Assassin can be set to notify LBL®ADC a complete disconnection of all services, through the elimination of virtual addresses, or to exclude only the part of the departments concerned at the event.

Example of configuration file: splitbrainassassin.xml

   <splitbrainassassin>

<params

Frequency=”10000″

CreateConnectionTimeOut=”4000″

NumRetryConnection=”3″>

</params>

<decisionEnginesPeers>

<peer enable= “true”

Description=”Sys 001″

URL=”https://peerOne:54445/” <==== Change

HealthCheckUriPath=”/HealthCheck?decisionEngine=SCDEGroup”/>

<peer enable= “true”

Description=”Sys 002″

URL=”https://peerTwo:54445/” <==== Change

HealthCheckUriPath=”/HealthCheck?decisionEngine=SCDEGroup”/>

<notification fileName=”lib/notificationDir/outOfOrder.systemsMonitorGroup” description=”to vips”/>

<notification fileName=”lib/notificationDir/outOfOrder.gr1″ enable= “false”/>

<notification fileName=”lib/notificationDir/outOfOrder.gr2″/>

<notification fileName=”lIb/notificationDir/outOfOrder.gr3″/>

<notification fileName=”lib/notificationDir/outOfOrder.gr4″/>

</decisionEnginesPeers>

<sysobserver>

<service name=syslog”” id=syslogsplitbrainassassin””/>

</sysobserver>

  </splitbrainassassin>

</serviceconf>

3- Run the start of the process and check the logs:

3- setting the automatic start of the service

   (LBL_HOME)/lib/confMonitor/A03_LBLGoSurfaceClusterWFRemoteBatch.xml

    <A03_LBLGoSurfaceClusterDESplitBrainAssassin>

   <!–

   Start: automatic (default), manual

  –>

    <process enable= “true”

   Description=”LBL(r)Commander Split Brain Assassin”

   Start=”automatic” <===== change from “manual” to “automatic”

   NumberTryStartOnFailure=”-1″

   WaitBeforeKill= “60000”

   WaitBeforeKillOnFailure= “10000”>

  <start osName=”Windows”>

  <env>CLASSPATH=.;lib;lib\LBLADC.jar</env>

4- Perform the same operation in the other nodes of balancing…

Note:

In order to have a test more reliable you can add of the HealthCheck on other services. For this purpose in the example below it is provided to also check a service of HelthCheck offered by LBL Commander Work Flow Remote Batch.

<peer enable= “true”

Description=”Sys 002″

URL=”https://peerTwo:54445/

HealthCheckUriPath=”/HealthCheck?decisionEngine=SCDEGroup”/>

<peer enable= “true”

Description=”Sys 002″

URL=”https://peerTwo:5994/” <==== Change

HealthCheckUriPath=”/RemoteBatch/startBartch.xml”/>  <=== change

ATTENZIONEUna time LBL

®SCDE Split Brain Assassin  has set the notification file questinon will be more deleted. To delete the notification file is necessary to turn off the process LBL®SCDE Split Brain Assassin attrraverso the WebConsole. This behavior is used to ensure the consistency of the routing even after a possible restoration of the network.