MonALISA Grid Monitoring
Menu mode: dynamic | fixed
  HOME       CLIENTS       REPOSITORIES       DOWNLOADS       LOOKING GLASS       FAST DATA TRANSFER  
Last update on:
Dec 03, 2015

Uptime: 169 days, 18h, 2m
Number of requests: 5642903
since 28 October 2005
MonALISA Extensions Guide

MonALISA Extensions Guide


Chapter 1. Monitoring Modules Base

Collecting the monitoring information can be done in several ways using dynamically loadable modules.

It is possible to collect information using:

  • SNMP demons;

  • Ganglia;

  • LSF or PBS batch queueing systems;

  • Local or remote procedures to read /proc files;

  • User modules based on dedicated scripts or procedures.

1. Local System Monitoring

1.1. Kernel /proc files

1.1.1. Description

The monProc* modules are using the local /proc files to collect information about the cpu, load and IO. In this case this is a master node for a cluster, where in fact MonALISA service is running and simple modules using the /proc files used to collect data. These modules are mainly design to be used on the node MonALISA service is running but they may also be used on remote systems via rsh or ssh.

  • monProcLoad
  • monProcStat
  • monProcIO
  • monDiskIO

1.1.2. Activation

 *Master   
 >ramen gateway    
 monProcLoad%30   
 monProcIO%30   
 monProcStat%30   

The first line (*Master) defines a Functional Unit (or Cluster). The Second line ( >ramen gateway) adds a node in this Functional Unit class. In this case ramen is a computer name and optionally the user may add an alias (gateway to this name).

The lines:

 monProcLoad%30  
 monProcIO%30   
 monProcStat%30

define three monitoring modules to be used on the node "ramen". These measurements are done every 30s. The monProc* modules are using the local /proc files to collect information about the cpu, load and IO.

1.2. monStatusCmd

1.2.1. Description

This Module can be used to run a command / script to get the status of one or several services. It expects that the output of the command looks like this:

 Service1  Status  2 Memory  23
 Service2 Status  4 Memory  238
      ...
 SCRIPTRESULT status 3 Message error message
    

The first word is the name of the service. Then, on the same line, a variable number of pairs (parameter name, value). The value can be number or string. If it can be interpreted as number, it will produce a result, otherwise an eResult will be created. Service name, parameters and the values must be TAB sepparated. The last line must consist of only the word DONE to confirm the normal termination of the script. Alternatively, you can put something like: SCRIPTRESULT and then some parameters to describe the end status. This kind of line will be considered like a DONE line.

1.2.2. Module Activation

The module can be put in myFarm.conf like this:

*ClusterName{monStatusCmd, localhost, "full command to be executed"}%RunInterval

or, if you want to change the default timeout from 120 sec to 300 sec:

*ClusterName{monStatusCmd, localhost, "full command to be executed, timeOut=300"}%RunInterval

2. Cluster/Farm Monitoring

2.1. Ganglia

Ganglia is a well known monitoring system which is using a multi cast messaging system to collect system information from large clusters. MonALISA can be easily interfaced with Ganglia. This can be done using the multicast messaging system or the gmon interface which is based on getting the cluster monitoring information in XML format. In the MonALISA distribution we provide modules for both these two possibilities. If the MonALISA service runs in the multicast range for the nodes sending monitoring data, we suggest using the Ganglia module which is a multicast listener. The code for interfacing MonALISA with Ganglia using gmon is Service/usr_code/GangliaMod and using the multicast messages is Service/usr_code/GangliaMCAST. The user may modify these modules. Please look at the service configuration examples to see how these modules may be used.

2.1.1. Monitoring a Farm using Ganglia gmon module

The configuration file should look like this:

Example 1.1. Farm configuration with Ganglia gmon

 *PN_popcrn {IGanglia, popcrn01.fnal.gov, 8649}   
     

The line:

 *PN_popcrn {IGanglia, popcrn01.fnal.gov, 8649}%30
    

defines a cluster named "PN_popcrn" for which the entire information is provided by the IGanglia module. This module is using telnet to get an XML based output from the Ganglia gmon. The telnet request will be sent to node popcrn01.fnal.gov on port 8649.

All the nodes which report to ganglia will be part of this cluster unit and for all of them the parameters selected in the IGanglia module will be recorded. This measurement will de done every 30s.

The Ganglia module is located in the Service/usr_code/GangliaMod . The user may edit the file and customize it. This module is NOT in the MonaLISA jar files and for using it the user MUST add the path to this module to the MonaLISA loader. This can be done in ml.propreties by adding this line:

 lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/GangliaMod/
    

2.1.2. Monitoring a Farm using Ganglia Multicast module

For getting copies of the monitoring data sent by the nodes running the ganglia demons (using a multicast port) it is necessary that the system on which MonaLISA is running to be in muticast range for these messages.

Adding such a line:

 *PN_cit{monMcastGanglia, tier2, "GangliaMcastAddress=239.2.11.71; GangliaMcastPort=8649"}

in the configuration file, will use the Ganglia multicast module to listen to all the monitoring data and then to select certain values which will be recorded into MonALISA. The service system will automatically create a configuration for all the nodes which report data in this way.

The PN_cit is the name of the cluster of processing nodes. Is is important for the cluster name of processing nodes to contain the "PN" string. This is used by farm filters to report global views for the farms.

The tier2 is the name of the system corresponding to the real IP address on which this MonALISA service is running. The second parameter defines the multicast address and port used by Ganglia.

The GangliaMcat module is located in the Service/usr_code/GangliaMCAST . The user may edit the file and customize it. This module is NOT in the MonaLISA jar files and to be used, the user MUST add the path to this module to the MonaLISA loader. This can be done in ml.propreties by adding this line:

      lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/GangliaMCAST/ 
    

2.2. Monitoring a Farm using snmp

MonALISA provides snmp modules to collect:

  • IO traffic from nodes and network elements

  • CPU usage

  • System Load

  • Disk IO traffic

Here are the OIDs that must be "exported" by the snmpd daemon in order to allow various dedicated MonALISA snmp modules to collect the data:

 snmp_IO 
 Incoming / outgoing network traffic:
 IN: .1.3.6.1.2.1.2.2.1.10 
 OUT: .1.3.6.1.2.1.2.2.1.16
 
 snmp_Load 
 Load5, Load10 and Load15:
 .1.3.6.1.4.1.2021.10.1.3
 
 snmp_CPU 
 CPU_usr, CPU_nice and CPU_idle:
 .1.3.6.1.4.1.2021.11
 
 snmp_MEM 
 MEM_free, Swap_MEM_Free
 .1.3.6.1.4.1.2021.4
 
 snmp_Disk 
 FreeDSK, UsedDsk:
 .1.3.6.1.4.1.2021.9

The service configuration file (i.e. Service/myFarm/myFarm.conf should look like this:

Example 1.2. Farm configuration with SNMP

 *Master   
 >citgrid3.cacr.caltech.edu citgrid3   
 monProcLoad%30   
 monProcStat%30   
 monProcIO%30
 
 *PN_CIT
 >c0-0   
 snmp_Load%30   
 snmp_IO%30   
 snmp_CPU%30   
 >c0-1   
 snmp_Load%30   
 snmp_IO%30   
 snmp_CPU%30   
 >c0-2   
 snmp_Load%30   
 snmp_IO%30   
 snmp_CPU%30   
 >c0-3  
 snmp_Load%30   
 snmp_IO%30   
 snmp_CPU%30

The first line ( *Master ) defines a Functional Unit (or Cluster). The second line ( >citgrid3.cacr.caltech.edu citgrid3 ) adds a node in this Functional Unit class and optionally an alias. The lines:

 monProcLoad%30   
 monProcIO%30   
 monProcStat%30

define three monitoring modules to be used on the node "citgrid3". These measurements are done periodically, every 30s. The monProc* modules are using the local /proc files to collect information about the cpu, load and IO. In this case this is a master node for a cluster, were in fact MonALISA service is running and simple modules using the /proc files used to collect data.

      *PN_CIT
    

defines a new cluster name. This is for a set of processing nodes used by the site. The string "PN" in the name is necessary if the user wants to automatically use filters to generate global views for all this processing units.

Then it has a list of nodes in the cluster and for each node a list of modules to be used for getting monitoring information from the nodes. For each module a repetition time is defined (%30). This means that each such module is executed once every 30s. Defining the repeating time is optional and the default value is 30s.

2.3. Monitoring Applications

MonALISA monitor external applications using ApMon API. In order to configure MonALISA to listen on UDP port 8884 for incoming datagrams (XDR encoded, using ApMon) you should add the following line in your config file:

 ^monXDRUDP{ListenPort=8884}%30 

The Clusters, Nodes and Parameters are dynamically created in MonALISA's configuration tree every time a new one is received. It is possible, also, to dynamically remove "unused" Clusters/Nodes/Parameters, if there are no datagrams to match them for a period of time. The timeouts are in seconds:

 ^monXDRUDP{ParamTimeout=10800,NodeTimeout=10800,ClusterTimeout=86400,ListenPort=8884}%30 

In the example above the parameters and the nodes are automatically removed from ML configuration tree if there are no data received for 3 hours (10800 seconds). The Cluster is removed after one day (24 hours - 86400 seconds).

For further informations how to send data into MonALISA please see ApMon API documentation.

3. Grid Monitoring

3.1. VO_JOBS

3.1.1. Description

The OsgVoJobs module collects information from different queue managers in order to obtain accounting statistics for VOs. The current version is able to work with Condor, PBS, LSF and SGE; if there are several queue managers on a cluster, the values obtained from them are summed.

The module parses the output of some specific commands that give information about the current jobs (like condor_q from Condor, qstat from PBS etc.) and produces results for each job (CPU time consumed, run time, the amount of memory etc.). In the case of Condor, the history file is also read in order to obtain more detailed information.

These results are then added to the statistics made per VO; the association between the Unix account from which a job is run and the VO to which the job belongs is made on the base of a grid map file which specifies the corresponding VO for each account.

3.1.2. Results Provided by the Module

The module provides two categories of parameters: parameters specific to a single job and parameters for a VO.

Job Parameters: The job parameters provided by this module are:

  • CPUTime - the CPU time consumed so far by the job, in seconds (available in Condor, PBS, LSF, SGE)
  • RunTime - wall clock time, in minutes (available in Condor, LSF)
  • WallClockTime - wall clock time, in seconds (available in Condor, LSF and in PBS if PbsQuickMode is disabled)
  • Size - the size of the job, in MB (available in Condor, LSF, SGE)
  • DiskUsage - the disk usage of the job, in MB (available in Condor)

VO Parameters:

There are two categories of VO parameters: parameters that represent values obtained in the last time interval (between the previous run of the module and the current one) and parameters that represent rates (calculated as the difference between the current value of a parameter and the value obtained at the previous run, divided by the length of the time interval between runs).

The parameters that represent values obtained in the last time interval are:

  • RunningJobs - the number of running jobs owned by the VO
  • IdleJobs
  • HeldJobs
  • UnknownJobs
  • TotalJobs
  • SubmittedJobs - the number of jobs submitted in the last time interval
  • FinishedJobs
  • FinishedJobs_Success - the number of jobs finished successfully (with 0 as exit status); this parameter is provided only if the parsing of the Condor history file / PBS accounting logs is enabled.
  • FinishedJobs_Error - the number of jobs finished with error (with non-zero exit status); this parameter is provided only if the parsing of the Condor history file / PBS accounting logs is enabled.
  • CPUTime - CPU time in seconds (sum for all the VO's jobs)
  • CPUTimeCondorHist - the CPU time for Condor jobs, obtained from the history file
  • RunTime - wall clock time in minutes
  • JobsSize - the size of the jobs in MB
  • DiskUsage - disk usage for the VO, in MB
  • VO_Status - this is a flag that shows whether the VO currently has jobs (it is 0 if there are jobs and 1 otherwise).

These parameters are grouped under the main module's cluster (which is usually named "osgVO_JOBS"- see the section on module activation) and are reported for each VO, if the VO currently has jobs. If the VO does not have any job, only the VO_Status parameter will be reported.

The parameters that represent rates are:

  • SubmittedJobs_R - rate for the SubmittedJobs parameter
  • FinishedJobs_R
  • RunTime_R
  • CPUTime_R

These parameters are grouped in a Rates cluster, usually named "osgVO_JOBS_Rates" (instead of "osgVO_JOBS" there will be the main module's cluster name, if it is different), and are only reported for the VOs that currently have jobs.

There are also some "total" parameters , which represent the sum of the parameters above for all the VOs. Under the Totals cluster (usually named "osgVO_JOBS_Totals") there are also some nodes that give general information:

  • Status - indicates the status of the module (0 if the execution was correct and non-zero in case of error).
  • ExecTime_<manager_name> - the execution time, in ms, for the job manager command
  • TotalProcessingTime - the total amount of time, in ms, needed for the module's execution.
  • <manager_name - the job manager's version

3.1.3. Module Activation

In order to use the VO accounting modules, you should have MonALISA 1.2.38 or newer. If you have the OSG distribution, it is necessary to source two scripts:

. /OSG/setup.sh
. /OSG/MonaLisa/Service/CMD/ml_env
      

(replace "/OSG" with the path to your OSG directory)

To compile, just run the "comp" script from the modules' directory:

./comp

Compiling is only necesarry if you run the version of the module placed in the usr_code/ directory ( Note: see the "Pluggable Components" section for more details on running modules from the usr_code directory).

When the module is run, there are some environment variables that should be set, which indicate the location of the available queue managers. Usually, the name of the variables are of the form <JOB_MANAGER>_LOCATION.

  • for PBS: if you have PBS, you should set the PBS_LOCATION variable; this variable should be set such that the path to the qstat command is ${PBS_LOCATION}/bin/qstat.
  • for Condor: if you have Condor, you should set the CONDOR_LOCATION variable; this variable should be set such that the path to the condor_q command is ${CONDOR_LOCATION}/bin/condor_q. The module also parses the Condor history file(s) - see the paragraph below, "Settings for the Condor history files".
  • for LSF: if you have LSF, you should set the LSF_LOCATION variable; this variable should be set such that the path to the bjobs command is ${LSF_LOCATION}/bin/bjobs.
  • for SGE: if you have SGE, you should set the SGE_LOCATION variable; this variable should be set such that the path to the qstat command is ${SGE_LOCATION}/bin/glinux/qstat.

If you have the OSG distribuition and you sourced the OSG/setup.sh script, all the needed variables are already set and it is not necessary to set any other environment variables. However, it may be necessary to specify the location of the Condor history files, as explained in the following paragraph.

Settings for the Condor history files:

  • If you want the module to gather data only from the Condor submit server on the local machine (i.e., the module is not configured with the parameters CondorUseGlobal or Server - see the following paragraph for parameters configuration), there is a single history file. The file is assumed to be in the default location, which is spool/history under the directory known in Condor as LOCAL_DIR. If you want to see which is the LOCAL_DIR directory, use the command "condor_config_val LOCAL_DIR". If the Condor file is in another location, you must add the CondorHistoryFile parameter to the module, in order to specify the location (see the following paragraph for parameters configuration).
  • If you configure the module to collect data from multiple submit servers (by adding the parameters CondorUseGlobal or Server), there will be multiple history files, one for each submit server. If there is a shared file system in the Condor pool, the history files are available, but you must specify the location of each file, by adding CondorHistoryFile parameters. If the CondorHistoryFile parameters are not specified or if there is no shared file system, history information will not be collected.

To enable the module you should add to the farm configuration file a line of the following form:

*<cluster_name>{OsgVoJobs, localhost [,arguments]}%<time_interval>

where cluster_name is the main module's cluster name; it is recommended that the cluster name be "osgVO_JOBS". If you use the module included in the MonALISA service and not the one form usr_code , you should replace "OsgVoJobs" with "monOsgVoJobs". The possible arguments for the module are:

  • doNotPublishJobInfo - the module will not produce results for each running job, but only VO statistics
  • mapfile=<mapfile> - the location of the mapfile which contains the associations between user accounts and VOs. By default it is considered to be in ${MONALISA_HOME}/../monitoring/grid3-user-vo-map.txt.
  • CheckCmdExitStatus = ON | OFF (default is ON) - flag that specifies whether the module should verify the exit status of the commands that it executes. If this is enabled, the commands' output is only taken into account if the exit status is 0.
  • CondorUseGlobal - if Condor is available, the information will be collected from all the submit machines in the pool (i.e., from all the machines on which there are condor_schedd daemons running). This is done by using the "-g" option for the "condor_q" command.
  • Server=<hostname> - if Condor is available, the information will be collected from the submit machine that has the specified hostname. This is done by using the "-name" option for the "condor_q" command. The "Server" argument may appear more than once, to specify multiple submit machines.
  • CondorFormatOption = ON | OFF | ALTERNATIVE (the default is ALTERNATIVE) - this argument specifies whether the condor_q command should be used with the -format option. If the argument's value is ALTERNATIVE, the -format option is used only if the output of the regular condor_q command cannot be parsed correctly.
  • CondorConstraints = <constraints> With this argument you can specify, for Condor, a constraint expression that will be used with condor_q (for example, CondorConstraints = JobUniverse==1). Multiple Condor constraints can be specified with an expression containing"&&"-s, "||"-s, etc. (for example: CondorConstraints = JobUniverse==5&&TotalSuspensions<3). To use quoted strings in the constraints expressions it is a little more complicated because the quotes should be also quoted with 3 backslashes: CondorConstraints = Owner==\\\\\\\"condor\\\\\\\"
  • CondorQuickMode - only the "condor_q -l" command will be used to obtain information on the running jobs. By default, "plain" condor_q is also used to obtain a more accurate value of the run time.
  • CondorHistoryCheck = ON | OFF - the module will parse the Condor history log, obtaining additional data like the exit status of the jobs (default is ON). You should not enable this option if you enabled CondorUseGlobal or if you specified a Server argument, and there is no shared file system in the pool.
  • CondorHistoryFile=<history_file_location> - this parameter specifies the exact locations of the Condor history file(s). Use it if the history file is in a non-default location, or if you wish to collect information from non-local submit servers (i.e., if you enabled CondorUseGlobal or added Server arguments to the module). In the latter case, history parsing can be done only if there is a shared file system in the Condor pool; in this case, the CondorHistoryFile parameter must be added mutiple times, once for each Condor submit server in the pool (or once for each Condor server specified with the Server argument).
  • PBSHistoryCheck - the module will parse the PBS accounting logs
  • NoPBSHistoryCheck - the module will not parse the PBS accounting logs. This is the default behavior.
  • PBSLogDir=<pbs_log_dir_location> - this parameter specifies the exact location of the PBS accounting log directory. Use it if you enabled PBSHistoryCheck if the history file is in the default location /usr/spool/PBS/server_priv/accounting.
  • PBSQuickMode = ON | OFF (default is ON) - if this flag is set to ON, additional information (like job runtime) will be obtained for PBS jobs by running the "qstat -a" command.
  • MixedCaseVOs - if this argument is given, the names of the VOs will be displayed with mixed cases (by default, they are displayed in upper cases).
  • FinishedJobResults = ON | OFF (default is ON) - if this flag is ON, some additional results will be sent by the module when a job is finished; the results will be in a separate cluster, usually named "osgVO_JOBS_Finished".
  • IndividualUsersResults = ON | OFF (default is OFF) - if this flag is ON, the module will send additional results containing per-user statistics. The results will be grouped in a cluster usually named "osgVO_JOBS_Users".
  • AdjustJobStatistics = ON | OFF (default is OFF) - this flag determines the behaviour of the module when the job manager reports decreasing CPU time or run time for a job. If the value is OFF, no Results will be created for the job until the job manager reports again a correct value (greater or equal than the previous one). If the value is ON, the module will add the new (smaller) value to the previous one.
  • NoVoStatistics = ON | OFF (default is ON) - if this flag is ON, the jobs of the users that do not belong to any VO will be reported under the name "NO_VO". If the flag is OFF, the jobs will not be reported.
  • VerboseLogging = ON | OFF (default is OFF) - if the flag is ON, all the error messages will be reported in the logs every time they occur with the WARNING level; otherwise, the same type of error messages will be reported from time to time with WARNING level and the other times with FINEST level.
  • CanSuspend = ON | OFF (default is OFF) - if this flag is enabled, the module is suspended for a period of time if there are 3 consecutive executions with errors.

Note: The parsing of the module's argument names is not case sensitive. The "boolean" arguments (the ones having ON/OFF as possible values) are considered to be ON if their names appear in the list without any associated value.

Examples:

*osgVO_JOBS{OsgVoJobs, localhost, mapfile=/mymapfile.txt}%120

Here, the module is initialized with another user-VO mapfile than the default one and will be run at every 120 seconds. Apart from the "non-standard" mapfile, the default settings are used (this means that information will be collected only from the local condor_schedd daemon).

*osgVO_JOBS{OsgVoJobs, localhost, mapfile=/mymapfile.txt, CondorUseGlobal}%180

In this example the module is initialized with a non-default mapfile, and it will execute the "condor_q" command with the "-g" option, thus providing information from all the submit machines in the Condor pool.

*osgVO_JOBS{OsgVoJobs, localhost, Server=tier2b.cacr.caltech.edu}%100

In this example the module will collect information from the submit machine tier2b.cacr.caltech.edu, not from the machine on which it is run.

*osgVO_JOBS{OsgVoJobs, localhost, Server=lcfg.rogrid.pub.ro, Server=wn1.rogrid.pub.ro}%60

Here, the module will collect information from two submit machines: lcfg.rogrid.pub.ro and wn1.rogrid.pub.ro.

*osgVO_JOBS{OsgVoJobs, localhost, CondorUseLocal, Server=tier2b.cacr.caltech.edu}%120

In this example, the module will gather data from tier2b.cacr.caltech.edu, and also from the local submit machine (because the argument CondorUseLocal was given).

3.1.4. Logging levels

To change the logging level for this module logger, add/modify the following line in ml.properties file.

lia.Monitor.modules.monOsgVoJobs.level = LEVEL

Value for LEVEL can be: SEVERE, WARNING, INFO, FINE.

3.2. VO_IO

3.2.1. Description

The OsgVO_IO module holds statistical information about the ftp trafic for OSG. The input and output and the rates represent the value for the last time interval (this interval is set before you run the ML service). These values are displayed in the ML client and in the OSG repository (integrated values).

3.2.2. Results Provided by the Module

There are two categories of VO parameters: parameters that represent values obtained in the last time interval (between the previous run of the module and the current one) and parameters that represent rates (calculated as the difference between the current value of a parameter and the value obtained at the previous run, divided by the length of the time interval between runs).

The parameters that represent the values obtained in the last time interval are:

  • ftpInput and ftpOutput (in KB) represents the total ftp transter in the last time interval
  • ftpRateIn and ftpRateOut (in KB/s) the rates for ftp trafic.
  • ftpInput_SITENAME and ftpOutput_SITENAME
  • ftpRateIn_SITENAME and ftpRateOut_SITENAME the same semnification. The values represent the ftp transfet for a domain (SITENAME) (for example ftpInput_caltech.edu)

Another type of parameters are the ones representing rates for each ftp trafic (under the VO_IO_Transfers cluster). The value of these parameters represent a rate that is the value of one transfer divided by the length of the time interval specified in the log file by START and DATA fields (difference between DATA and START).

There are also some "total" parameters, which represent the sum of the parameters above for all the VOs. Under the VO_IO_Totals cluster there is "Status" node that indicates the status of the module (0 if the execution was correct and non-zero in case of error).

3.2.3. Module activation

In order to use the VO accounting modules, you should have MonALISA 1.2.38 or newer. If you have the OSG distribution, it is necessary to source two scripts:

 . $VDT_LOCATION/setup.sh
 . $VDT_LOCATION/MonaLisa/Service/CMD/ml_env

(replace "$VDT_LOCATION" with the path to your OSG directory)

To compile, just run the "comp" script from the modules' directory:

./comp

Initialization of the VO_IO module with node and arguments configuration file entry:

*osgVO_IO{OsgVO_IO, localhost, <arguments>}%<TIME>

where <arguments> is a comma separated list. Accepted arguments are:

  • ftplog - gridftp.log
  • mapfile=/path-to-mapfile (grid3-user-vo-map.txt)
  • debug - argument for displaying debug informations in ML log file. This is an optional argument.
  • MixedCaseVOs - if this argument is given, the names of the VOs will be displayed with mixed cases (by default, they are displayed in upper cases)

TIME represents the interval in seconds between two calls of doProcess method.

The module needs two environment variables to be set:

  • for Globus: if you have Globus, you should set the GLOBUS_LOCATION variable. This environment variable should be set by sourcing the setup.sh file form your OSG/ folder.
  • for VDT: if you have vdt, you should set the VDT_LOCATION variable. This environment variable should be set by sourcing the setup.sh file form your OSG/ folder. For OSG the vdt folder is in OSG folder (OSG/vdt).

For example, in the OSG distribution ftplog and mapfile are:

ftplog=$VDT_LOCATION/globus/var/gridftp.log
mapfile=/OSG/monitoring/grid3-user-vo-map.txt

In farm's config file you should put the line:

*osgVO_IO{OsgVO_IO, localhost, ftplog=$VDT_LOCATION/globus/var/gridftp.log, mapfile=$VDT_LOCATION/monitoring/grid3-user-vo-map.txt, debug}%180

It is not necessary to initialize the first two arguments if the environment variables exist and are set. In this case you can initialize the module in this way:

*osgVO_IO{OsgVO_IO, localhost, }%180

3.2.4. Logging levels

To change the logging level for this module logger, add/modify the following line in ml.properties file.

lia.Monitor.modules.monOsgVO_IO.level = LEVEL

Value for LEVEL can be: SEVERE, WARNING, INFO, FINE.

3.3. PN_Condor, PN_PBS and PN_LSF

3.3.1. Description

The PN modules offer monitoring information about the processing nodes from a cluster. The metrics provided are a subset of the Ganglia metrics (see section 2), but the information is obtained from a job manager running on the cluster instead of Ganglia. Currently the modules work with Condor, OpenPBS/Torque and LSF and the commands used to obtain the nodes' status are:

For Condor:

condor_status [-pool <server_name>] [-constraint <constraint_expr>] -l
          

For OpenPBS/Torque:

pbsnodes [-s <server_name>] -a
           

For LSF:

bhosts -l
lshosts
          

3.3.2. Results Provided by the Modules

The parameters provided by the PN_Condor and PN_PBS modules are:

PN_Condor/PN_PBS
|____node1
|____node2
|        (parameters)
|        |____NoCPUs
|        |____VIRT_MEM_free
|        |____MEM_total
|        |____Load1
|......
|____nodeN
             

The parameters provided by the PN_LSF module are:

PN_LSF
|____node1
|____node2
|        (parameters)
|        |____NoCPUs
|        |____MEM_free
|        |____MEM_total
|        |____SWAP_free
|        |____SWAP_total
|        |____Load1
|        |____Load15
|......
|____nodeN
            

where:

  • No_CPUs - the number of CPUs on the node
  • MEM_free - the amount of free physical memory (in MB)
  • MEM_total - the total amount of physical memory, in MB
  • SWAP_free - the amount of free swap memory, in MB
  • SWAP_total - the total amount of swap memory, in MB
  • Load1 - load average for 1 minute on the node
  • Load15 - load average for 15 minutes on the node

If the modules are initialized with the Statistics argument, an additional cluster with statistical information about the number of nodes is provided:

For PBS:

PN_PBS_Statistics
|____Statistics
|         (parameters)
|         |____Total Nodes
|         |____Total Available Nodes
|         |____Total Free Nodes
|         |____Total Down Nodes
|____...
            

where:

  • Total Nodes - total number of nodes registered to the PBS server
  • Total Available Nodes - total number of nodes that are currently communicating with the server
  • Total Free Nodes - total number of nodes which are in the "free" state (can execute incoming jobs)
  • Total Down Nodes - number of nodes whose state is unknown to the server

For Condor:

PN_Condor_Statistics
|____Statistics
|         (parameters)
|         |____Total Nodes
|         |____Total Slots
|         |____Total CPUs
|         |____Total Available Slots
|         |____Total Free Slots
|         |____Total Owner Slots
|____...
          

where:

  • Total Nodes - total number of nodes from the Condor pool (a multi-processor machine counts as a single node)
  • Total Slots - total number of slots (virtual machines in Condor). For a multi-processor machine, separate virtual machines are usually created for each processor.
  • Total CPUs - total number of CPUs (should be equal withthe number of slots; if it is not the case, you should add the SlotsFactor argumet to the module - see below).
  • Total Available Slots - total number of slots that are available for Condor (i.e., the user is not executing his/her own jobs on them)
  • Total Free Slots - total number of nodes which are in the "free" state (can execute incoming jobs)
  • Total Owner Slots - number of slots in "Owner" state (the user is executing his/her own jobs on them)

The Statistics cluster also contains a "Status" node which indicates the module's status (0 if it was executed correctly and non-zero if there was an error).

For LSF:

PN_LSF_Statistics
|____Statistics
|         (parameters)
|         |____Total Nodes
|         |____Total Slots
|         |____Total Free Slots
|         |____Total Down Nodes
|____...
            

where:

  • Total Nodes - total number of nodes
  • Total Slots - total number of job slots
  • Total Free Slots - total number of free job slots
  • Total Down Nodes - number down nodes (nodes for which LSF bhosts does not report the "ok" status)

3.3.3. Modules activation

In order to use these modules, you should have MonALISA 1.2.38 or newer.

If you have the OSG distribution and put the modules in your folder in urs_code folder from Monalisa/Service, it is necessary to source two scripts:

. /OSG/setup.sh
. /OSG/MonaLisa/Service/CMD/ml_env                         
            

(replace "/OSG" with the path to your OSG directory)

To compile, just run the "comp" script from the modules' directory:

./comp

Compiling is only necessary if you use the version of the module from the usr_code/ directory.

To enable the modules you should add to the farm configuration file a line of the following form:

*<cluster_name>{moduleName, localhost, <arguments>}%<time_interval>

where:

  • cluster_name - the cluster name for the results that this module produces (PN_Condor, PN_PBS or PN_LSF)
  • moduleName - name of module: monPN_Condor, monPN_PBS, monPN_LSF (or, if running from usr_code: PN_Condor, PN_PBS, PN_LSF)
  • <arguments> - list of arguments. The arguments that may be passed to the modules are Statistics, Server, SlotsFactor (only for PN_Condor) and NodesLabel (only for PN_PBS).

If the Statistics argument appears in the list of arguments, the module will provide an aditional "cluster" that contains statistics about number of nodes in the cluster, as described above.

The Server argument indicates the name of PBS server / Condor central manager that will be queried. For example:

Server=lcfg.rogrid.pub.ro

is a valid entry for this parameter. If this argument is used for the PN_Condor module, the "condor_status" command will be run with the "-pool" option, and for the PN_PBS module the "pbsnodes" command will be run with the "-server" option. The "Server" argument is optional and it can appear more than once in the list, to specifiy multiple servers from which information should be collected; if it doesn't appear, the PBS server / Condor central manager corresponding to the local machine will be used.

The "SlotsFactor" argument can be used for Condor, in order to display correctly the number of CPUs (the Total CPUs result), if it is different from the number of Condor slots. The number of CPUs will be calculated as the number of Condor slots times the SlotsFactor; for example, if you have 100 CPUs and 400 Condor slots, you should set "SlotsFactor = 0.25".

CondorConstraints = <constraints> -with this argument you can specify a constraint expression that will be used with condor_status (for example, CondorConstraints = HasCheckpointing==TRUE). Multiple Condor constraints can be specified with an expression containing "&&"-s, "||"-s, etc. (for example: CondorConstraints = HasCheckpointing==TRUE&&TotalVirtualMachines<4). To use quoted strings in the constraints expressions it is a little more complicated because the quotes should be also quoted with 3 backslashes: Condorconstraints = FileSystemDomain==\\\\\\\"cithep90.ultralight.org\\\\\\\"

NodesLabel= <label< - for PN_PBS, with this argument you can specify a property label; the module will create statitstics only for the nodes that have this label. A column must be placed at the beginning of the label string (e.g., NodesLabel=:mylabel).

Examples:

*PN_Condor{monPN_Condor, localhost}%120

Here, the PN_Condor module is used with the default settings. The information will be obtained from the local Condor central manager and no statistics about the number of nodes will be created.

*PN_Condor{monPN_Condor, localhost, Statistics}%240

In this example the module will provide statistical information about the number of nodes.

*PN_Condor{monPN_Condor, localhost, Server=pccil.cern.ch, Statistics}%80

Here, only information from the Condor manager running on pccil.cern.ch will be collected; statistical information about the number of nodes will also be provided.

*PN_Condor{monPN_Condor, localhost, Server=lcfg.rogrid.pub.ro, Server=wn1.rogrid.pub.ro}%180

In this example the module will provide information collected from the lcfg.rogrid.pub.ro and wn1.rogrid.pub.ro Condor managers.

*PN_Condor{monPN_Condor, localhost, Server=cithep90.ultralight.org, CondorConstraints = HasCheckpointing==TRUE}%80

In this example the module will provided information collected from the cithep90.ultralight.org Condor manager, restricted to the nodes that satisfy the condition HasCheckpointing==TRUE.

*PN_PBS{monPN_PBS, localhost}%120

In this example the PN_PBS module is used with the default settings. The information will be obtained from the local PBS server and no statistics about the number of nodes will be created.

*PN_PBS{monPN_PBS, localhost, Statistics}%120

In this example the module will provide statistical information about the number of nodes.

*PN_PBS{monPN_PBS, localhost, Server=pccil.cern.ch, Statistics}%90

Here, only information from the PBS server running on pccil.cern.ch will be collected; statistical information about the number of nodes will also be provided.

*PN_PBS{monPN_PBS, localhost, Statistics, Server=gw01.rogrid.pub.ro, server=lcfg.rogrid.pub.ro}%180

In this example, information is collected from the gw01.rogrid.pub.ro and lcfg.rogrid.pub.ro servers, and statistical data about the number of nodes is provided.

*PN_LSF{monPN_LSF, localhost, Statistics}%120

In this example the PN_LSF module will provide statistical information about the number of nodes.

Note: The verification of the parameter names for these modules is case insensitive (i.e., you can write "statistics" or "Statistics").

When the modules are run, there are some environment variables that should be set, which indicate the location of the available queue managers:

  • for PBS: if you have PBS, you should set the PBS_LOCATION variable; this variable should be set such that the path to the pbsnodes command is ${PBS_LOCATION}/bin/pbsnodes.
  • for Condor: if you have Condor, you should set the CONDOR_LOCATION variable; this variable should be set such that the path to the condor_status command is ${CONDOR_LOCATION}/bin/condor_status.
  • for LSF: if you have LSF, you should set the LSF_LOCATION variable; this variable should be set such that the path to the lshosts command is ${LSF_LOCATION}/bin/lshosts.

If you have the OSG distribuition and you sourced the OSG/setup.sh script, all the needed variables are already set and it is not necessary to set any other environment variables.

3.3.4. Logging levels

To change the logging level for this module logger, add/modify the following line in ml.properties file.

lia.Monitor.modules.<module_name>.level = LEVEL

Value for LEVEL can be: SEVERE, WARNING, INFO, FINE. Value for module_name can be: monPN_Condor, monPN_PBS, monPN_LSF.

4. Network Monitoring

4.1. Network Traffic Monitoring using SNMP

Network I/O traffic from network elements can be collected using one of the following modules:

  • snmp_IOpp - supports 32bit SNMP counters

  • snmp_IOpp_HC - supports 64bit SNMP counters (if the device supports HC counters)

  • snmp_IOpp_v2 - uses a new SNMP library, supports both 32bit and 64bit counters and is available since version 1.3.41

4.1.1. Configuration

  • snmp_IOpp and snmp_IOpp_HC

    Firstly, you should first figure out which interfaces (identified by IfIndex in SNMP) you want to monitor:
     $snmpwalk -v1 -c public RouterIP .1.3.6.1.2.1.2.2.1.2
     IF-MIB::ifDescr.1 = STRING: fxp0
     IF-MIB::ifDescr.2 = STRING: fxp1
     IF-MIB::ifDescr.3 = STRING: fxp2
     IF-MIB::ifDescr.4 = STRING: lsi
          
    
    In order to monitor, for example, interfaces with 1 and 4 SNMP indices (IfIndex), the module configuration should look like this:
     >NetDevice_IPAddress
     snmp_IOpp{1=description_for_fxp0;4=description_for_lsi_interface}%60
          
    
    The second term in mappings specifies the description of the monitored interface and is also the name of parameter which is being displayed in ML clients.
  • snmp_IOpp_v2

    This module is available since version 1.2.41. The configuration is similar to the snmp_IOpp and snmp_IOpp_HC modules. New features was added:
    • module configuration accept an optional SNMP configuration parameter (the first one in parameter list) used to override the general SNMP configuration.
    • module configuration accept an optional SNMP configuration parameter (the first one in parameter list) used to override the general SNMP configuration.
    • it permits to specify the monitored interface in farm configuration by its SNMP description (IfDescr).
    • it is able to autodetect the high-counters support in SNMP agent so it's not necessary to have two different modules for this purpose
    • for every monitored interface the link SPEED is also reported.
     #Configuration example: 
     >NetDevice_IPAddress
     snmp_IOpp_v2 {
     [comma_separated_list_of_snmp_params];
     fxp0=description_for_fxp0;
     lsi=description_for_lsi_interface
     }%60
          
    
    where, parameters are:
     SNMP_community=mycommunity
     SNMP_Version=1|2c
     SNMP_RemoteAddress=x.x.x.x
     SNMP_RemotePort=port
     SNMP_LocalAddress=x.x.x.x
     SNMP_Timeout=xx #ms            
          
    

4.1.2. General SNMP Configuration

For all internal SNMP modules the following general parameters can be set in ml.properties :

 # if you want a different community than public 
 # Default "public" community is used
 lia.Monitor.SNMP_community=mycommunity
 
 #snmp version 
 #(default ver 1 is used)
 lia.Monitor.SNMP_version=2c
 
 # Port for SNMP queries
 # Default is 161
 lia.Monitor.SNMP_port=1611
 
 # UDP connections settings
 #local address for UDP connections (default is )
 #lia.Monitor.SNMP_localAddress=
 
 #receive timeout (in ms)
 #Default is 15000
 lia.Monitor.SNMP_timeout = 20000
     

4.1.3. Display routers on GUI Client/Repository map

In the GUI we use the cluster names for certain functions. The data from "WAN" clusters can be used to show WAN links on the map and the traffic in real-time. In order to have the routers shown in the GUI and Repository's interactive map you need to use "WAN" cluster, so your myFarm.conf file should look like this:

*WAN
 >router1_IPaddress
 snmp_IOpp_v2{1=description_for_fxp0;4=description_for_lsi_interface}%60
 >router2_IPaddress
 snmp_IOpp_v2{1=description_for_fxp0;4=description_for_lsi_interface}%60

The configuration of the WAN links and how they appear in the clients' map is configured by us, for the moment, so let us know which are the endpoints for each connection(location and IP address), that will appear on the 3D / 2D maps.

4.2. Tracepath and Traceroute

4.2.1. Description

The Tracepath module collects topology information by using a list of hosts gathered by a central synchronization service. These hosts are usually the ones in the same group with the requester. This data is then aggregated by the Monalisa GUI to display a full view of the routers/hosts the data passes through.

Before using this module, please verify other farms in your group are using this module as well. Otherwise please write us an email at to add your group to the allowed tracepath groups.

4.2.2. Results Provided by the Module

The parameters provided by the Tracepath modules are:

Tracepath/Traceroute
|____node1_ip
|____node2_ip
|        (parameters)
|        |____1:ip_address
|        |____2:no reply
|        |____3:ip_address
|        |____...
|        |____15:ip_address
|        |____status
|......
|____nodeN_ip             

where:

  • x:ip_addres - ip address of the router found at hop x.

  • x:no_reply - designates a hop that did not reply to the ICMP request.

  • status - the exit status of the tracepath/traceroute measurement.

Status values are as follows:

Table 1.1. Tracepath status information

Value Name Description
-1 STATUS_NOT_TRACED_YET Not traced yet. The peer has just been added and there is no data about it yet.
0 STATUS_OK Current reported trace is ok.
1 STATUS_TRACEPATH_FAILED The tracepath has failed during the trace to the given node.
2 STATUS_TRACEROUTE_FAILED The traceroute has failed during trace to the given node.
3 STATUS_DESTINATION_UNREACHED The destination (given node) was unreachable.
4 STATUS_REMOVED This peer has been removed from the the configuration and will be deleted from the clients.
5 STATUS_DISABLED Neither tracepath nor traceroute cannot be run - either don't exist, either are both disabled.
6 STATUS_INTERNAL_ERR There is an internal config problem with this peer; this should never appear.

4.2.3. Module activation

In order to activate this module please add the following line to the farm configuration file e.g. myFarm.conf.

*Tracepath{monTracepath, localhost, " "}

4.2.4. Configuration

There is no actual configuration needed for this module. Configuration updates like service URL and IPID services is done by using the LUS service.

The module needs outbound connectivity to the other hosts in his group by using ICMP, and TCP access to the Tracepath services (URL service and IPID). Also, inbound connectivity for ICMP packets is requested for traceroute to work. There are currently four topology services running at these sites:

Table 1.2. Current topology services

  Primary site Backup site
ML Farms monalisa.cern.ch:9095 monalisa-chi.uslhcnet.org:8095
Vrvs monalisa.cern.ch:9090 monalisa-chi.uslhcnet.org:8090

4.2.5. Logging Levels

To change the logging level for this module logger, add/modify the following line in ml.properties file.

lia.Monitor.modules.monTracepath.level = LEVEL

Value for level can be: SEVERE, WARNING, INFO, FINE.

4.3. ABPing

4.3.1. Description

This monitoring module is used to perform simple network measurements in a ping-like behaviour. The difference between ping and ABPing is the use of UDP packets instead of ICMP. A synchronization service provides per group and per node configuration data .

4.3.2. Results provided by the Module

The parameters provided by the ABPing module are:

ABPing
|____node1_ip
|____node2_ip
|        (parameters)
|        |____Jiiter
|        |____PacketLoss
|        |____RTT
|        |____RTime
|......
|____nodeN_ip

where:

  • Jiiter - computed packet jiiter,

  • PacketLoss - packet loss percentage,

  • RTT - round trip time,

  • RTime - overall link quality coefficient.

4.3.3. Module Activation

In order to activate this module please add the following line to the farm configuration file e.g. myFarm.conf.

 *ABPing{monABPing, localhost, " "}    

4.3.4. Configuration

There is no actual configuration needed for this module. Configuration updates are done by the synchronization service. Each group can host its own service. In this case, the ml.properties needs to be modified.

lia.Monitor.ABPing.ConfigURL=http://<server.fdqn>:<port>/ABPingConfig/ABPingAutoConfig

This module needs inbound and outbound connectivity on port 9000 UDP, and outbound connectivity to the sync. service. (by default HTTP on ).

4.3.5. Logging Levels

To change the logging level for this module logger, add/modify the following line in ml.properties file.

lia.Monitor.modules.monTracepath.level = LEVEL

Value for level can be: SEVERE, WARNING, INFO, FINE.

4.4. Pathload Monitoring Module

4.4.1. Service Module

Available bandwidth information may be collected and exported by MonALISA services by activating module

The measurements are controlled by a coordination service. By using a token passing algorithm, the service ensures measurement fairness within a given group of hosts. No parralel measurements that cross the same network segment are allowed to take and the measurements are also timed

For detailed information on how the measurements are performed please refer Service Application/AvailableBandwidth section.

The resulting ML parameters are AwBandwidth_Low,AwBandwidth_High - the available bandwidth interval and MeasurementDuration, MeasurementStatus used to check the sanity level of the module, MegaBytesReceived and FleetsSent. A positive MeasurementStatus indicates a successful Pathload measurement, negative values represent errors. A detailed explanation of each status code can be viewed here.

The measurement result is the available bandwidth from the Pathload sender (Pathload Node) to the Pathload receiver (Pathload Cluster owner). For measuring the av. bw. from host A to B, the coordination service usually schedules the reverse trip B->A right after A->B

DstFarmName (rcv) 
   ---- Pathload 
      -----SrcFarmName1(sender) 
      -----SrcFarmName2 
         ----- AwBandwidth_Low (sender to receiver) 
         ----- AwBandwidth_High 
         ----- MeasurementDuration 
         ----- MeasurementStatus 
         ----- FleetsSent 
         ----- MegaBytesReceived 
      -----SrcFarmName3
4.4.1.1.  Requirements

The module is available starting with version 1.4.10 of MonALISA Service.

Pathload needs in- and outbound connectivity on the following ports:

UDP port 55001           Pathload control connection  
TCP port 55002           Pathload data connection 
outbound connectivity to the Coordination Service.
4.4.1.2. Service Configuration

Edit ml.properties and add:

lia.util.Pathload.client.PathloadConnector=http://coordination.service.host:port/PathloadConfig/PathloadConnector

Edit service configuration file (e.g. myFarm.conf) adn add

*Pathload{monPathload, localhost, "if=eth0; speed=1000"}%30

(Re)start the ML Service

monalisa@ML$ cd $MonaLisa_HOME/Service/CMD 
monalisa@ML$ ML_SER restart
4.4.1.3. Configuration options

There are some additional properties that may be set in the ml.properties file

Table 1.3. Pathload Configuration Options

Property name Default value Description
lia.util.Pathload.client.PathloadConnector (not set) This is a critical value. It sets the URL address of the Section 4.4.2, “Coordination Service”.Without it, Pathload won't start.
lia.util.Pathload.client.binDir Control/bin This is the place MonALISA Service will extract the Pathload executables. This path is relative to $MonaLisa_HOME. You can change this if the MonALISA user can't write to Control/bin.
lia.util.Pathload.client.senderLogFile (not set) If set, the Pathload module will save pathload_snd output to this file. The path can be relative to $MonaLisa_HOME or it can be an absolute path if it begins with /. Ex: Service/myFarm/pathload_snd.log
lia.util.Pathload.client.receiverLogFile (not set) If set, the Pathload module will save pathload_rcv output to this file. The path can be relative to $MonaLisa_HOME or it can be an absolute path if it begins with /. Ex: Service/myFarm/pathload_rcv.log
4.4.1.4.  Troubleshooting

Other common options in configuring monPathload are:

lia.Monitor.modules.monPathload.level = FINEST
lia.app.abping.pathload.level = FINEST

for monitoring monPathload module activity (Preliminary checks are done and if the environment is not right, the module won't start. If this is the case, please enable lia.app.abping.pathload logging, and check the logs).

lia.util.Pathload.client.receiverLogFile=Service/myFarm/pathload_rcv.log
lia.util.Pathload.client.senderLogFile=Service/myFarm/pathload_snd.log

for catching the output of the pathload_snd and rcv programs to a file.

The path can either be relative to the MonaLisa_HOME or it can be an absolute path by beginning with /. By default, no output is kept.

4.4.2. Coordination Service

4.4.2.1.  Requirements

The control service needs an Aplication Server (e.g. Tomcat) to run.

4.4.2.2.  Installation

Download the PathloadConfig webarchive and copy it in your webapps directory.

wget http://monalisa.cern.ch/download/pathload/PathloadConfig-v1.0.2.war -O PathloadConfig.war
cp PathloadConfig.war $CATALINA_HOME/webapps

Add a role named pathloadadmins and a user with that role. Edit $CATALINA_HOME/conf/tomcat- users.xml:

<?xml version='1.0' encoding='utf-8'?>
<tomcat-users>
   <role rolename="pathloadadmins"/>
   <user username="pathloadadmin" password="pass" roles="pathloadadmins"/>
</tomcat-users>

Your new connector for the site will be: http://fully.quallified.hostname.com:port/PathloadConfig/PathloadConnector.

The admin page will be http://fully.quallified.hostname.com:port/PathloadConfig and will require you to authenticate with the username and password form above.

4.4.2.3.  Configuration

The servlet can be configured by selecting the Pathload Setup menu.

Table 1.4. Pathload Configuration Service Options

Property name Default Value Description
Peer minWaitingTime 30 (s) Time to wait before a Peer is allowed to aquire a token again. This is used for preventing peers of becoming intrusive.
Peer maxAgingTime 300 (s) Time of inactivity after which a peer has become old aged and kicked out of the cached Peers. Each peer has to report each 30 s to the servlet.
Token maxTokenAgingTime 150 (s) Time after which a token will be declared lost. A new token will be released. This usually happens in case of firewalled connections.
4.4.2.4.  Troubleshooting
1. I get no data from some hosts, and the servlet logs the following events
[77]       INFO Wed Feb 01 00:54:40 PST 2006 A new token Token xXgmmKR3DO63ln6B4kos1Q== from [yyy/x.x.x.x] to [zzz/x.x.x.x]was created.
[76]       FINE Wed Feb 01 00:54:38 PST 2006 [xxx/x.x.x.x] refreshed its status.
[75]       FINE Wed Feb 01 00:54:08 PST 2006 [xxx/x.x.x.x] refreshed its status.
[74]       FINE Wed Feb 01 00:53:38 PST 2006 [xxx/x.x.x.x] refreshed its status.
[73]       FINE Wed Feb 01 00:53:07 PST 2006 [xxx/x.x.x.x] refreshed its status.
[72]       FINE Wed Feb 01 00:52:37 PST 2006 [xxx/x.x.x.x] refreshed its status.
[71]       INFO Wed Feb 01 00:52:07 PST 2006 Token Token C2BCLcaY/Dexx+SI1+nr4g== from [xxx/x.x.x.x] to [ttt/x.x.x.x] aquired by [xxx/x.x.x.x]

The most probable cause is that of a firewall. If you have access to one of the machines xxx or ttt, please look at the output log of pathload. (pathload_snd.log and pathload_rcv.log). If you see something like:

Sending fleet 1#
Waiting for connections ... =>

the connection is firewalled on one end. Both sender and receiver require inbound connections to ports TCP 55002 and UDP 55001.

2. I installed the WAR, the ML Service seems to get its configuration but the PathloadStatus page is blank.

You are probably running Tomcat 5.0.x and JDK 1.5. There is an issue regarding the Xalan XML Library explained here: http://forum.java.sun.com/thread.jspa?tstart=30&forumID=34&threadID=542044&trange=15.

A quickfix would be:

  • Run the service inside a Tomcat v.5.5.x, or

  • Run Java 1.4 and Tomcat 5.0.x, or

  • Remove the xml-apis.jar file from $CATALINA_HOME/common/endorsed.