Collecting the monitoring information can be done in
several ways using dynamically loadable modules.
It is possible to collect information using:
SNMP demons;
Ganglia;
LSF or PBS batch queueing systems;
Local or remote procedures to read /proc files;
User modules based on dedicated scripts or
procedures.
1. Local System
Monitoring
1.1. Kernel /proc files
1.1.1. Description
The monProc* modules
are using the local /proc files to collect
information about the cpu, load and IO. In this case
this is a master node for a cluster, where in fact
MonALISA service is running and simple modules using
the /proc files used to collect data. These modules
are mainly design to be used on the node MonALISA
service is running but they may also be used on
remote systems via rsh or ssh.
The first line (*Master) defines a
Functional Unit (or Cluster). The Second line (
>ramen
gateway) adds a node in this
Functional Unit class. In this case ramen is a
computer name and optionally the user may add an
alias (gateway to this name).
The lines:
monProcLoad%30
monProcIO%30
monProcStat%30
define three monitoring modules to be used on the
node "ramen". These measurements are done every 30s.
The monProc* modules
are using the local /proc files to collect information
about the cpu, load and IO.
1.2. monStatusCmd
1.2.1. Description
This Module can be used to run a command / script
to get the status of one or several services. It
expects that the output of the command looks like
this:
Service1 Status 2 Memory 23
Service2 Status 4 Memory 238
...
SCRIPTRESULT status 3 Message error message
The first word is the name of the service. Then,
on the same line, a variable number of pairs
(parameter name, value). The value can be number or
string. If it can be interpreted as number, it will
produce a result, otherwise an eResult will be
created. Service name, parameters and the values must
be TAB sepparated. The last line must consist of only
the word DONE to confirm the normal termination of
the script. Alternatively, you can put something
like: SCRIPTRESULT and then some parameters to
describe the end status. This kind of line will be
considered like a DONE line.
1.2.2. Module
Activation
The module can be put in myFarm.conf like
this:
*ClusterName{monStatusCmd, localhost, "full command to be executed"}%RunInterval
or, if you want to change the default timeout from
120 sec to 300 sec:
*ClusterName{monStatusCmd, localhost, "full command to be executed, timeOut=300"}%RunInterval
2. Cluster/Farm
Monitoring
2.1. Ganglia
Ganglia is a well known monitoring system which is
using a multi cast messaging system to collect system
information from large clusters. MonALISA can be easily
interfaced with Ganglia. This can be done using the
multicast messaging system or the gmon interface which
is based on getting the cluster monitoring information
in XML format. In the MonALISA distribution we provide
modules for both these two possibilities. If the
MonALISA service runs in the multicast range for the
nodes sending monitoring data, we suggest using the
Ganglia module which is a multicast listener. The code
for interfacing MonALISA with Ganglia using gmon is
Service/usr_code/GangliaMod and using
the multicast messages is Service/usr_code/GangliaMCAST. The
user may modify these modules. Please look at the
service configuration examples to see how these modules
may be used.
2.1.1. Monitoring
a Farm using Ganglia gmon module
The configuration file should look like this:
Example 1.1. Farm
configuration with Ganglia gmon
*PN_popcrn {IGanglia, popcrn01.fnal.gov, 8649}
The line:
*PN_popcrn {IGanglia, popcrn01.fnal.gov, 8649}%30
defines a cluster named "PN_popcrn" for which the
entire information is provided by the IGanglia
module. This module is using telnet to get an XML
based output from the Ganglia gmon. The telnet
request will be sent to node popcrn01.fnal.gov
on port 8649.
All the nodes which report to ganglia will be part
of this cluster unit and for all of them the
parameters selected in the IGanglia module will be
recorded. This measurement will de done every
30s.
The Ganglia module is located in the Service/usr_code/GangliaMod . The
user may edit the file and customize it. This module
is NOT in the MonaLISA jar files and for using it the
user MUST add the path to this module to the MonaLISA
loader. This can be done in ml.propreties by adding this
line:
2.1.2. Monitoring
a Farm using Ganglia Multicast module
For getting copies of the monitoring data sent by
the nodes running the ganglia demons (using a
multicast port) it is necessary that the system on
which MonaLISA is running to be in muticast range for
these messages.
in the configuration file, will use the Ganglia
multicast module to listen to all the monitoring data
and then to select certain values which will be
recorded into MonALISA. The service system will
automatically create a configuration for all the
nodes which report data in this way.
The PN_cit is the name
of the cluster of processing nodes. Is is important
for the cluster name of processing nodes to contain
the "PN" string. This is used by farm filters to
report global views for the farms.
The tier2 is the name
of the system corresponding to the real IP address on
which this MonALISA service is running. The second
parameter defines the multicast address and port used
by Ganglia.
The GangliaMcat module is located in the Service/usr_code/GangliaMCAST
. The user may edit the file and customize it. This
module is NOT in the MonaLISA jar files and to be
used, the user MUST add the path to this module to
the MonaLISA loader. This can be done in ml.propreties by adding this
line:
The first line ( *Master ) defines a
Functional Unit (or Cluster). The second line ( >citgrid3.cacr.caltech.edu
citgrid3 ) adds a node in this
Functional Unit class and optionally an alias. The
lines:
monProcLoad%30
monProcIO%30
monProcStat%30
define three monitoring modules to be used on the
node "citgrid3". These measurements are done
periodically, every 30s. The monProc* modules are
using the local /proc
files to collect information about the cpu, load and
IO. In this case this is a master node for a cluster,
were in fact MonALISA service is running and simple
modules using the /proc
files used to collect data.
*PN_CIT
defines a new cluster name. This is for a set of
processing nodes used by the site. The string "PN" in
the name is necessary if the user wants to
automatically use filters to generate global views for
all this processing units.
Then it has a list of nodes in the cluster and for
each node a list of modules to be used for getting
monitoring information from the nodes. For each module
a repetition time is defined (%30). This means that
each such module is executed once every 30s. Defining
the repeating time is optional and the default value is
30s.
2.3. Monitoring
Applications
MonALISA monitor external applications using ApMon
API. In order to configure MonALISA to listen on UDP
port 8884 for incoming datagrams (XDR encoded, using
ApMon) you should add the following line in your config
file:
^monXDRUDP{ListenPort=8884}%30
The Clusters, Nodes and Parameters are dynamically
created in MonALISA's configuration tree every time a
new one is received. It is possible, also, to
dynamically remove "unused" Clusters/Nodes/Parameters,
if there are no datagrams to match them for a period of
time. The timeouts are in seconds:
In the example above the parameters and the nodes
are automatically removed from ML configuration tree if
there are no data received for 3 hours (10800 seconds).
The Cluster is removed after one day (24 hours - 86400
seconds).
For further informations how to send data into
MonALISA please see ApMon API documentation.
3. Grid Monitoring
3.1. VO_JOBS
3.1.1. Description
The OsgVoJobs module collects information from
different queue managers in order to obtain
accounting statistics for VOs. The current version is
able to work with Condor, PBS, LSF and SGE; if there
are several queue managers on a cluster, the values
obtained from them are summed.
The module parses the output of some specific
commands that give information about the current jobs
(like condor_q from Condor, qstat from PBS etc.) and
produces results for each job (CPU time consumed, run
time, the amount of memory etc.). In the case of
Condor, the history file is also read in order to
obtain more detailed information.
These results are then added to the statistics
made per VO; the association between the Unix account
from which a job is run and the VO to which the job
belongs is made on the base of a grid map file which
specifies the corresponding VO for each account.
3.1.2. Results
Provided by the Module
The module provides two categories of parameters:
parameters specific to a single job and parameters
for a VO.
Job Parameters: The job parameters provided by
this module are:
CPUTime - the CPU time consumed so far by the
job, in seconds (available in Condor, PBS, LSF,
SGE)
RunTime - wall clock time, in minutes
(available in Condor, LSF)
WallClockTime - wall clock time, in seconds
(available in Condor, LSF and in PBS if
PbsQuickMode is disabled)
Size - the size of the job, in MB (available
in Condor, LSF, SGE)
DiskUsage - the disk usage of the job, in MB
(available in Condor)
VO
Parameters:
There are two categories of VO parameters:
parameters that represent values obtained in the last
time interval (between the previous run of the module
and the current one) and parameters that represent
rates (calculated as the difference between the
current value of a parameter and the value obtained
at the previous run, divided by the length of the
time interval between runs).
The parameters that represent values obtained in
the last time interval are:
RunningJobs - the number of running jobs
owned by the VO
IdleJobs
HeldJobs
UnknownJobs
TotalJobs
SubmittedJobs - the number of jobs submitted
in the last time interval
FinishedJobs
FinishedJobs_Success - the number of jobs
finished successfully (with 0 as exit status);
this parameter is provided only if the parsing of
the Condor history file / PBS accounting logs is
enabled.
FinishedJobs_Error - the number of jobs
finished with error (with non-zero exit status);
this parameter is provided only if the parsing of
the Condor history file / PBS accounting logs is
enabled.
CPUTime - CPU time in seconds (sum for all
the VO's jobs)
CPUTimeCondorHist - the CPU time for Condor
jobs, obtained from the history file
RunTime - wall clock time in minutes
JobsSize - the size of the jobs in MB
DiskUsage - disk usage for the VO, in MB
VO_Status - this is a flag that shows whether
the VO currently has jobs (it is 0 if there are
jobs and 1 otherwise).
These parameters are grouped under the main
module's cluster (which is usually named
"osgVO_JOBS"- see the section on module activation)
and are reported for each VO, if the VO currently has
jobs. If the VO does not have any job, only the
VO_Status parameter will be reported.
The parameters that represent rates are:
SubmittedJobs_R - rate for the SubmittedJobs
parameter
FinishedJobs_R
RunTime_R
CPUTime_R
These parameters are grouped in a Rates cluster,
usually named "osgVO_JOBS_Rates" (instead of
"osgVO_JOBS" there will be the main module's cluster
name, if it is different), and are only reported for
the VOs that currently have jobs.
There are also some "total" parameters , which
represent the sum of the parameters above for all the
VOs. Under the Totals cluster (usually named
"osgVO_JOBS_Totals") there are also some nodes that
give general information:
Status - indicates the status of the module
(0 if the execution was correct and non-zero in
case of error).
ExecTime_<manager_name> - the execution
time, in ms, for the job manager command
TotalProcessingTime - the total amount of
time, in ms, needed for the module's
execution.
<manager_name - the job manager's
version
3.1.3. Module
Activation
In order to use the VO accounting modules, you
should have MonALISA 1.2.38 or newer. If you have the
OSG distribution, it is necessary to source two
scripts:
(replace "/OSG" with
the path to your OSG directory)
To compile, just run the "comp" script from the modules'
directory:
./comp
Compiling is only necesarry if you run the version
of the module placed in the usr_code/ directory ( Note: see the "Pluggable
Components" section for more details on running
modules from the usr_code directory).
When the module is run, there are some environment
variables that should be set, which indicate the
location of the available queue managers. Usually,
the name of the variables are of the form
<JOB_MANAGER>_LOCATION.
for PBS: if you have PBS, you should set the
PBS_LOCATION variable; this variable should be
set such that the path to the qstat command is
${PBS_LOCATION}/bin/qstat.
for Condor: if you have Condor, you should
set the CONDOR_LOCATION variable; this variable
should be set such that the path to the condor_q
command is ${CONDOR_LOCATION}/bin/condor_q. The
module also parses the Condor history file(s) -
see the paragraph below, "Settings for the Condor
history files".
for LSF: if you have LSF, you should set the
LSF_LOCATION variable; this variable should be
set such that the path to the bjobs command is
${LSF_LOCATION}/bin/bjobs.
for SGE: if you have SGE, you should set the
SGE_LOCATION variable; this variable should be
set such that the path to the qstat command is
${SGE_LOCATION}/bin/glinux/qstat.
If you have the OSG distribuition and you sourced
the OSG/setup.sh
script, all the needed variables are already set and
it is not necessary to set any other environment
variables. However, it may be necessary to specify
the location of the Condor history files, as
explained in the following paragraph.
Settings for the Condor
history files:
If you want the module to gather data only
from the Condor submit server on the local
machine (i.e., the module is not configured with
the parameters CondorUseGlobal or Server - see
the following paragraph for parameters
configuration), there is a single history file.
The file is assumed to be in the default
location, which is spool/history under the
directory known in Condor as LOCAL_DIR. If you
want to see which is the LOCAL_DIR directory, use
the command "condor_config_val LOCAL_DIR". If the
Condor file is in another location, you must add
the CondorHistoryFile parameter to the module, in
order to specify the location (see the following
paragraph for parameters configuration).
If you configure the module to collect data
from multiple submit servers (by adding the
parameters CondorUseGlobal or Server), there will
be multiple history files, one for each submit
server. If there is a shared file system in the
Condor pool, the history files are available, but
you must specify the location of each file, by
adding CondorHistoryFile parameters. If the
CondorHistoryFile parameters are not specified or
if there is no shared file system, history
information will not be collected.
To enable the module you should add to the farm
configuration file a line of the following form:
where cluster_name is the main module's cluster
name; it is recommended that the cluster name be
"osgVO_JOBS". If you use the module included in the
MonALISA service and not the one form usr_code , you should replace
"OsgVoJobs" with "monOsgVoJobs". The possible
arguments for the module are:
doNotPublishJobInfo - the module will not
produce results for each running job, but only VO
statistics
mapfile=<mapfile> - the location of the
mapfile which contains the associations between
user accounts and VOs. By default it is
considered to be in
${MONALISA_HOME}/../monitoring/grid3-user-vo-map.txt.
CheckCmdExitStatus = ON | OFF (default is ON)
- flag that specifies whether the module should
verify the exit status of the commands that it
executes. If this is enabled, the commands'
output is only taken into account if the exit
status is 0.
CondorUseGlobal - if Condor is available, the
information will be collected from all the submit
machines in the pool (i.e., from all the machines
on which there are condor_schedd daemons
running). This is done by using the "-g" option
for the "condor_q" command.
Server=<hostname> - if Condor is
available, the information will be collected from
the submit machine that has the specified
hostname. This is done by using the "-name"
option for the "condor_q" command. The "Server"
argument may appear more than once, to specify
multiple submit machines.
CondorFormatOption = ON | OFF | ALTERNATIVE
(the default is ALTERNATIVE) - this argument
specifies whether the condor_q command should be
used with the -format option. If the argument's
value is ALTERNATIVE, the -format option is used
only if the output of the regular condor_q
command cannot be parsed correctly.
CondorConstraints = <constraints> With
this argument you can specify, for Condor, a
constraint expression that will be used with
condor_q (for example, CondorConstraints =
JobUniverse==1). Multiple Condor
constraints can be specified with an expression
containing"&&"-s, "||"-s, etc. (for
example: CondorConstraints =
JobUniverse==5&&TotalSuspensions<3).
To use quoted strings in the constraints
expressions it is a little more complicated
because the quotes should be also quoted with 3
backslashes: CondorConstraints =
Owner==\\\\\\\"condor\\\\\\\"
CondorQuickMode - only the "condor_q -l"
command will be used to obtain information on the
running jobs. By default, "plain" condor_q is
also used to obtain a more accurate value of the
run time.
CondorHistoryCheck = ON | OFF - the module
will parse the Condor history log, obtaining
additional data like the exit status of the jobs
(default is ON). You should not enable this
option if you enabled CondorUseGlobal or if you
specified a Server argument, and there is no
shared file system in the pool.
CondorHistoryFile=<history_file_location> -
this parameter specifies the exact locations of
the Condor history file(s). Use it if the history
file is in a non-default location, or if you wish
to collect information from non-local submit
servers (i.e., if you enabled CondorUseGlobal or
added Server arguments to the module). In the
latter case, history parsing can be done only if
there is a shared file system in the Condor pool;
in this case, the CondorHistoryFile parameter
must be added mutiple times, once for each Condor
submit server in the pool (or once for each
Condor server specified with the Server
argument).
PBSHistoryCheck - the module will parse the
PBS accounting logs
NoPBSHistoryCheck - the module will not parse
the PBS accounting logs. This is the default
behavior.
PBSLogDir=<pbs_log_dir_location> - this
parameter specifies the exact location of the PBS
accounting log directory. Use it if you enabled
PBSHistoryCheck if the history file is in the
default location
/usr/spool/PBS/server_priv/accounting.
PBSQuickMode = ON | OFF (default is ON) - if
this flag is set to ON, additional information
(like job runtime) will be obtained for PBS jobs
by running the "qstat -a" command.
MixedCaseVOs - if this argument is given, the
names of the VOs will be displayed with mixed
cases (by default, they are displayed in upper
cases).
FinishedJobResults = ON | OFF (default is ON)
- if this flag is ON, some additional results
will be sent by the module when a job is
finished; the results will be in a separate
cluster, usually named
"osgVO_JOBS_Finished".
IndividualUsersResults = ON | OFF (default is
OFF) - if this flag is ON, the module will send
additional results containing per-user
statistics. The results will be grouped in a
cluster usually named "osgVO_JOBS_Users".
AdjustJobStatistics = ON | OFF (default is
OFF) - this flag determines the behaviour of the
module when the job manager reports decreasing
CPU time or run time for a job. If the value is
OFF, no Results will be created for the job until
the job manager reports again a correct value
(greater or equal than the previous one). If the
value is ON, the module will add the new
(smaller) value to the previous one.
NoVoStatistics = ON | OFF (default is ON) -
if this flag is ON, the jobs of the users that do
not belong to any VO will be reported under the
name "NO_VO". If the flag is OFF, the jobs will
not be reported.
VerboseLogging = ON | OFF (default is OFF) -
if the flag is ON, all the error messages will be
reported in the logs every time they occur with
the WARNING level; otherwise, the same type of
error messages will be reported from time to time
with WARNING level and the other times with
FINEST level.
CanSuspend = ON | OFF (default is OFF) - if
this flag is enabled, the module is suspended for
a period of time if there are 3 consecutive
executions with errors.
Note: The
parsing of the module's argument names is not case
sensitive. The "boolean" arguments (the ones having
ON/OFF as possible values) are considered to be ON if
their names appear in the list without any associated
value.
Here, the module is initialized with another
user-VO mapfile than the default one and will be run
at every 120 seconds. Apart from the "non-standard"
mapfile, the default settings are used (this means
that information will be collected only from the
local condor_schedd daemon).
In this example the module is initialized with a
non-default mapfile, and it will execute the
"condor_q" command with the "-g" option, thus
providing information from all the submit machines in
the Condor pool.
In this example, the module will gather data from
tier2b.cacr.caltech.edu, and also from the local
submit machine (because the argument CondorUseLocal
was given).
3.1.4. Logging
levels
To change the logging level for this module
logger, add/modify the following line in
ml.properties file.
lia.Monitor.modules.monOsgVoJobs.level = LEVEL
Value for LEVEL can be: SEVERE, WARNING, INFO,
FINE.
3.2. VO_IO
3.2.1. Description
The OsgVO_IO module holds statistical information
about the ftp trafic for OSG. The input and output
and the rates represent the value for the last time
interval (this interval is set before you run the ML
service). These values are displayed in the ML client
and in the OSG repository (integrated values).
3.2.2. Results
Provided by the Module
There are two categories of VO parameters:
parameters that represent values obtained in the last
time interval (between the previous run of the module
and the current one) and parameters that represent
rates (calculated as the difference between the
current value of a parameter and the value obtained
at the previous run, divided by the length of the
time interval between runs).
The parameters that represent the values obtained
in the last time interval are:
ftpInput and ftpOutput (in KB) represents the
total ftp transter in the last time interval
ftpRateIn and ftpRateOut (in KB/s) the rates
for ftp trafic.
ftpInput_SITENAME and ftpOutput_SITENAME
ftpRateIn_SITENAME and ftpRateOut_SITENAME
the same semnification. The values represent the
ftp transfet for a domain (SITENAME) (for example
ftpInput_caltech.edu)
Another type of parameters are the ones
representing rates for each ftp trafic (under the
VO_IO_Transfers cluster). The value of these
parameters represent a rate that is the value of one
transfer divided by the length of the time interval
specified in the log file by START and DATA fields
(difference between DATA and START).
There are also some "total" parameters, which
represent the sum of the parameters above for all the
VOs. Under the VO_IO_Totals cluster there is "Status"
node that indicates the status of the module (0 if
the execution was correct and non-zero in case of
error).
3.2.3. Module
activation
In order to use the VO accounting modules, you
should have MonALISA 1.2.38 or newer. If you have the
OSG distribution, it is necessary to source two
scripts:
where <arguments> is a comma separated list.
Accepted arguments are:
ftplog - gridftp.log
mapfile=/path-to-mapfile
(grid3-user-vo-map.txt)
debug - argument for displaying debug
informations in ML log file. This is an optional
argument.
MixedCaseVOs - if this argument is given, the
names of the VOs will be displayed with mixed
cases (by default, they are displayed in upper
cases)
TIME represents the interval in seconds between
two calls of doProcess method.
The module needs two environment variables to be
set:
for Globus: if you have Globus, you should
set the GLOBUS_LOCATION variable. This
environment variable should be set by sourcing
the setup.sh file form your OSG/ folder.
for VDT: if you have vdt, you should set the
VDT_LOCATION variable. This environment variable
should be set by sourcing the setup.sh file form
your OSG/ folder. For OSG the vdt folder is in
OSG folder (OSG/vdt).
For example, in the OSG distribution ftplog and
mapfile are:
It is not necessary to initialize the first two
arguments if the environment variables exist and are
set. In this case you can initialize the module in
this way:
*osgVO_IO{OsgVO_IO, localhost, }%180
3.2.4. Logging
levels
To change the logging level for this module
logger, add/modify the following line in
ml.properties file.
lia.Monitor.modules.monOsgVO_IO.level = LEVEL
Value for LEVEL can be: SEVERE, WARNING, INFO,
FINE.
3.3. PN_Condor,
PN_PBS and PN_LSF
3.3.1. Description
The PN modules offer monitoring information about
the processing nodes from a cluster. The metrics
provided are a subset of the Ganglia metrics (see
section 2), but the information is obtained from a
job manager running on the cluster instead of
Ganglia. Currently the modules work with Condor,
OpenPBS/Torque and LSF and the commands used to
obtain the nodes' status are:
Total Nodes - total number of nodes from the
Condor pool (a multi-processor machine counts as
a single node)
Total Slots - total number of slots (virtual
machines in Condor). For a multi-processor
machine, separate virtual machines are usually
created for each processor.
Total CPUs - total number of CPUs (should be
equal withthe number of slots; if it is not the
case, you should add the SlotsFactor argumet to
the module - see below).
Total Available Slots - total number of slots
that are available for Condor (i.e., the user is
not executing his/her own jobs on them)
Total Free Slots - total number of nodes
which are in the "free" state (can execute
incoming jobs)
Total Owner Slots - number of slots in "Owner" state (the user
is executing his/her own jobs on them)
The Statistics cluster also contains a "Status"
node which indicates the module's status (0 if it was
executed correctly and non-zero if there was an
error).
cluster_name - the cluster name for the
results that this module produces (PN_Condor,
PN_PBS or PN_LSF)
moduleName - name of module: monPN_Condor,
monPN_PBS, monPN_LSF (or, if running from usr_code: PN_Condor,
PN_PBS, PN_LSF)
<arguments> - list of arguments. The
arguments that may be passed to the modules are
Statistics, Server, SlotsFactor (only for
PN_Condor) and NodesLabel (only for PN_PBS).
If the Statistics argument appears in the list of
arguments, the module will provide an aditional
"cluster" that contains statistics about number of
nodes in the cluster, as described above.
The Server argument indicates the name of PBS
server / Condor central manager that will be queried.
For example:
Server=lcfg.rogrid.pub.ro
is a valid entry for this parameter. If this
argument is used for the PN_Condor module, the "condor_status" command will
be run with the "-pool"
option, and for the PN_PBS module the "pbsnodes" command will be run with
the "-server" option.
The "Server" argument
is optional and it can appear more than once in the
list, to specifiy multiple servers from which
information should be collected; if it doesn't
appear, the PBS server / Condor central manager
corresponding to the local machine will be used.
The "SlotsFactor"
argument can be used for Condor, in order to display
correctly the number of CPUs (the Total CPUs result),
if it is different from the number of Condor slots.
The number of CPUs will be calculated as the number
of Condor slots times the SlotsFactor; for example,
if you have 100 CPUs and 400 Condor slots, you should
set "SlotsFactor =
0.25".
CondorConstraints =
<constraints> -with this argument you
can specify a constraint expression that will be used
with condor_status (for example, CondorConstraints =
HasCheckpointing==TRUE). Multiple Condor
constraints can be specified with an expression
containing "&&"-s, "||"-s, etc. (for example:
CondorConstraints =
HasCheckpointing==TRUE&&TotalVirtualMachines<4).
To use quoted strings in the constraints expressions
it is a little more complicated because the quotes
should be also quoted with 3 backslashes: Condorconstraints =
FileSystemDomain==\\\\\\\"cithep90.ultralight.org\\\\\\\"
NodesLabel=
<label< - for PN_PBS, with this argument
you can specify a property label; the module will
create statitstics only for the nodes that have this
label. A column must be placed at the beginning of
the label string (e.g., NodesLabel=:mylabel).
Examples:
*PN_Condor{monPN_Condor, localhost}%120
Here, the PN_Condor module is used with the
default settings. The information will be obtained
from the local Condor central manager and no
statistics about the number of nodes will be
created.
Here, only information from the Condor manager
running on pccil.cern.ch will be collected;
statistical information about the number of nodes
will also be provided.
In this example the module will provided
information collected from the
cithep90.ultralight.org Condor manager, restricted to
the nodes that satisfy the condition
HasCheckpointing==TRUE.
*PN_PBS{monPN_PBS, localhost}%120
In this example the PN_PBS module is used with the
default settings. The information will be obtained
from the local PBS server and no statistics about the
number of nodes will be created.
*PN_PBS{monPN_PBS, localhost, Statistics}%120
In this example the module will provide
statistical information about the number of
nodes.
Here, only information from the PBS server running
on pccil.cern.ch will be collected; statistical
information about the number of nodes will also be
provided.
In this example, information is collected from the
gw01.rogrid.pub.ro and lcfg.rogrid.pub.ro servers,
and statistical data about the number of nodes is
provided.
*PN_LSF{monPN_LSF, localhost, Statistics}%120
In this example the PN_LSF module will provide
statistical information about the number of
nodes.
Note: The
verification of the parameter names for these modules
is case insensitive (i.e., you can write "statistics"
or "Statistics").
When the modules are run, there are some
environment variables that should be set, which
indicate the location of the available queue
managers:
for PBS: if you have PBS, you should set the
PBS_LOCATION variable; this variable should be
set such that the path to the pbsnodes command is
${PBS_LOCATION}/bin/pbsnodes.
for Condor: if you have Condor, you should
set the CONDOR_LOCATION variable; this variable
should be set such that the path to the
condor_status command is ${CONDOR_LOCATION}/bin/condor_status.
for LSF: if you have LSF, you should set the
LSF_LOCATION variable; this variable should be
set such that the path to the lshosts command is
${LSF_LOCATION}/bin/lshosts.
If you have the OSG distribuition and you sourced
the OSG/setup.sh
script, all the needed variables are already set and
it is not necessary to set any other environment
variables.
3.3.4. Logging
levels
To change the logging level for this module
logger, add/modify the following line in
ml.properties file.
lia.Monitor.modules.<module_name>.level = LEVEL
Value for LEVEL can be: SEVERE, WARNING, INFO,
FINE. Value for module_name can be: monPN_Condor,
monPN_PBS, monPN_LSF.
4. Network
Monitoring
4.1. Network
Traffic Monitoring using SNMP
Network I/O traffic from network elements can be
collected using one of the following modules:
snmp_IOpp -
supports 32bit SNMP counters
snmp_IOpp_HC -
supports 64bit SNMP counters (if the device
supports HC counters)
snmp_IOpp_v2 -
uses a new SNMP library, supports both 32bit and
64bit counters and is available since version
1.3.41
4.1.1. Configuration
snmp_IOpp and
snmp_IOpp_HC
Firstly, you should first figure out which
interfaces (identified by IfIndex in SNMP) you
want to monitor:
The second term in mappings specifies the
description of the monitored interface and is
also the name of parameter which is being
displayed in ML clients.
snmp_IOpp_v2
This module is available since version 1.2.41.
The configuration is similar to the snmp_IOpp
and snmp_IOpp_HC modules. New features was
added:
module configuration accept an optional
SNMP configuration parameter (the first one
in parameter list) used to override the
general SNMP configuration.
module configuration accept an optional
SNMP configuration parameter (the first one
in parameter list) used to override the
general SNMP configuration.
it permits to specify the monitored
interface in farm configuration by its SNMP
description (IfDescr).
it is able to autodetect the
high-counters support in SNMP agent so it's
not necessary to have two different modules
for this purpose
for every monitored interface the link
SPEED is also reported.
For all internal SNMP modules the following
general parameters can be set in ml.properties :
# if you want a different community than public
# Default "public" community is used
lia.Monitor.SNMP_community=mycommunity
#snmp version
#(default ver 1 is used)
lia.Monitor.SNMP_version=2c
# Port for SNMP queries
# Default is 161
lia.Monitor.SNMP_port=1611
# UDP connections settings
#local address for UDP connections (default is )
#lia.Monitor.SNMP_localAddress=
#receive timeout (in ms)
#Default is 15000
lia.Monitor.SNMP_timeout = 20000
4.1.3. Display routers on
GUI Client/Repository map
In the GUI we use the cluster names for certain
functions. The data from "WAN" clusters can be used
to show WAN links on the map and the traffic in
real-time. In order to have the routers shown in the
GUI and Repository's interactive map you need to use
"WAN" cluster, so your myFarm.conf file should look
like this:
The configuration of the WAN links and how they
appear in the clients' map is configured by us, for
the moment, so let us know which are the endpoints
for each connection(location and IP address), that
will appear on the 3D / 2D maps.
4.2. Tracepath and
Traceroute
4.2.1. Description
The Tracepath module collects topology information
by using a list of hosts gathered by a central
synchronization service. These hosts are usually the
ones in the same group with the requester. This data
is then aggregated by the Monalisa GUI to display a
full view of the routers/hosts the data passes
through.
Before using this module, please verify other
farms in your group are using this module as well.
Otherwise please write us an email at <support@monalisa.cern.ch>
to add your group to the allowed tracepath
groups.
4.2.2. Results Provided by
the Module
The parameters provided by the Tracepath modules
are:
x:ip_addres - ip address of the router found
at hop x.
x:no_reply - designates a hop that did not
reply to the ICMP request.
status - the exit status of the
tracepath/traceroute measurement.
Status values are as follows:
Table 1.1. Tracepath
status information
Value
Name
Description
-1
STATUS_NOT_TRACED_YET
Not traced yet. The peer has just been
added and there is no data about it yet.
0
STATUS_OK
Current reported trace is ok.
1
STATUS_TRACEPATH_FAILED
The tracepath has failed during the trace
to the given node.
2
STATUS_TRACEROUTE_FAILED
The traceroute has failed during trace to
the given node.
3
STATUS_DESTINATION_UNREACHED
The destination (given node) was
unreachable.
4
STATUS_REMOVED
This peer has been removed from the the
configuration and will be deleted from the
clients.
5
STATUS_DISABLED
Neither tracepath nor traceroute cannot
be run - either don't exist, either are both
disabled.
6
STATUS_INTERNAL_ERR
There is an internal config problem with
this peer; this should never appear.
4.2.3. Module
activation
In order to activate this module please add the
following line to the farm configuration file e.g.
myFarm.conf.
*Tracepath{monTracepath, localhost, " "}
4.2.4. Configuration
There is no actual configuration needed for this
module. Configuration updates like service URL and
IPID services is done by using the LUS service.
The module needs outbound connectivity to the
other hosts in his group by using ICMP, and TCP
access to the Tracepath services (URL service and
IPID). Also, inbound connectivity for ICMP packets is
requested for traceroute to work. There are currently
four topology services running at these sites:
Table 1.2. Current
topology services
Primary site
Backup site
ML
Farms
monalisa.cern.ch:9095
monalisa-chi.uslhcnet.org:8095
Vrvs
monalisa.cern.ch:9090
monalisa-chi.uslhcnet.org:8090
4.2.5. Logging Levels
To change the logging level for this module
logger, add/modify the following line in ml.properties file.
lia.Monitor.modules.monTracepath.level = LEVEL
Value for level can be: SEVERE, WARNING, INFO,
FINE.
4.3. ABPing
4.3.1. Description
This monitoring module is used to perform simple
network measurements in a ping-like behaviour. The
difference between ping and ABPing is the use of UDP
packets instead of ICMP. A synchronization service
provides per group and per node configuration data
.
In order to activate this module please add the
following line to the farm configuration file e.g.
myFarm.conf.
*ABPing{monABPing, localhost, " "}
4.3.4. Configuration
There is no actual configuration needed for this
module. Configuration updates are done by the
synchronization service. Each group can host its own
service. In this case, the ml.properties needs to be
modified.
This module needs inbound and outbound
connectivity on port 9000
UDP, and outbound connectivity to the
sync. service. (by default HTTP on <monalisa.cern.ch>).
4.3.5. Logging Levels
To change the logging level for this module
logger, add/modify the following line in ml.properties file.
lia.Monitor.modules.monTracepath.level = LEVEL
Value for level can be: SEVERE, WARNING, INFO,
FINE.
4.4. Pathload
Monitoring Module
4.4.1. Service Module
Available bandwidth information may be collected
and exported by MonALISA services by activating module
The measurements are controlled by a coordination
service. By using a token passing
algorithm, the service ensures measurement fairness
within a given group of hosts. No parralel
measurements that cross the same network segment are
allowed to take and the measurements are also
timed
For detailed information on how the measurements
are performed please refer Service
Application/AvailableBandwidth
section.
The resulting ML parameters are AwBandwidth_Low,AwBandwidth_High -
the available bandwidth interval and MeasurementDuration, MeasurementStatus
used to check the sanity level of the module, MegaBytesReceived
and FleetsSent. A positive
MeasurementStatus indicates a successful Pathload
measurement, negative values represent errors. A
detailed explanation of each status code can be
viewed here.
The measurement result is the available bandwidth
from the Pathload sender (Pathload Node) to the
Pathload receiver (Pathload Cluster owner). For
measuring the av. bw. from host A to B, the
coordination service usually schedules the reverse
trip B->A right after A->B
This is the place MonALISA Service will
extract the Pathload executables. This path
is relative to $MonaLisa_HOME. You can
change this if the MonALISA user can't
write to Control/bin.
lia.util.Pathload.client.senderLogFile
(not set)
If set, the Pathload module will save
pathload_snd output to this file. The path
can be relative to $MonaLisa_HOME or it can
be an absolute path if it begins with /.
Ex: Service/myFarm/pathload_snd.log
lia.util.Pathload.client.receiverLogFile
(not set)
If set, the Pathload module will save
pathload_rcv output to this file. The path
can be relative to $MonaLisa_HOME or it can
be an absolute path if it begins with /.
Ex: Service/myFarm/pathload_rcv.log
4.4.1.4.
Troubleshooting
Other common options in configuring monPathload
are:
for monitoring monPathload
module activity (Preliminary checks are done and if
the environment is not right, the module won't
start. If this is the case, please enable
lia.app.abping.pathload logging, and check the
logs).
Your new connector for the site will be: http://fully.quallified.hostname.com:port/PathloadConfig/PathloadConnector.
The admin page will be http://fully.quallified.hostname.com:port/PathloadConfig
and will require you to authenticate with the
username and password form above.
4.4.2.3.
Configuration
The servlet can be configured by selecting the
Pathload Setup menu.
Table 1.4. Pathload
Configuration Service Options
Property name
Default Value
Description
Peer minWaitingTime
30 (s)
Time to wait before a Peer is allowed
to aquire a token again. This is used for
preventing peers of becoming
intrusive.
Peer maxAgingTime
300 (s)
Time of inactivity after which a peer
has become old aged and kicked out of the
cached Peers. Each peer has to report each
30 s to the servlet.
Token maxTokenAgingTime
150 (s)
Time after which a token will be
declared lost. A new token will be
released. This usually happens in case of
firewalled connections.
4.4.2.4.
Troubleshooting
1. I get no data from some hosts, and the
servlet logs the following events
[77] INFO Wed Feb 01 00:54:40 PST 2006 A new token Token xXgmmKR3DO63ln6B4kos1Q== from [yyy/x.x.x.x] to [zzz/x.x.x.x]was created.
[76] FINE Wed Feb 01 00:54:38 PST 2006 [xxx/x.x.x.x] refreshed its status.
[75] FINE Wed Feb 01 00:54:08 PST 2006 [xxx/x.x.x.x] refreshed its status.
[74] FINE Wed Feb 01 00:53:38 PST 2006 [xxx/x.x.x.x] refreshed its status.
[73] FINE Wed Feb 01 00:53:07 PST 2006 [xxx/x.x.x.x] refreshed its status.
[72] FINE Wed Feb 01 00:52:37 PST 2006 [xxx/x.x.x.x] refreshed its status.
[71] INFO Wed Feb 01 00:52:07 PST 2006 Token Token C2BCLcaY/Dexx+SI1+nr4g== from [xxx/x.x.x.x] to [ttt/x.x.x.x] aquired by [xxx/x.x.x.x]
The most probable cause is that of a firewall.
If you have access to one of the machines xxx or
ttt, please look at the output log of pathload.
(pathload_snd.log and pathload_rcv.log). If you see
something like:
Sending fleet 1#
Waiting for connections ... =>
the connection is firewalled on one end. Both
sender and receiver require inbound connections to
ports TCP 55002 and UDP 55001.
2. I installed the WAR, the ML Service
seems to get its configuration but the
PathloadStatus page is blank.