MonALISA Grid Monitoring
Menu mode: dynamic | fixed
  HOME       CLIENTS       REPOSITORIES       DOWNLOADS       LOOKING GLASS       FAST DATA TRANSFER  
Last update on:
Dec 03, 2015

Uptime: 74 days, 8h, 7m
Number of requests: 5798449
since 28 October 2005
MonALISA Service Configuration Guide

MonALISA Service Configuration Guide


Chapter 1. MonALISA Configuration Guide

There are three configuration files that the user can modify for specifying farm service environment and characteristics: a global configuration file ($MonaLisa_HOME/Service/CMD/ml_env), and the others used by MonALISA itself ( $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/ml.properties and $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/<YOUR_FARM_CONF_FILE>.conf).

1. Global configuration file

The file for the global configuration is $MonaLisa_HOME/Service/CMD/ml_env. The variables that the user has to set or can set are:

MONALISA_USER

the name of the user that is running the service. It will not start from other account or from the root account.

JAVA_HOME

the path to your current JDK.

SHOULD_UPDATE

whether or not MonALISA should check for updates when it is started. If this parameter is "true" when MonALISA is started, first it will check for updates and after that it will start. If set to "false" it will not check for updates. This parameter is also used to check for autoupdates when it is running. Please see Section 8, “How to start a Monitoring Service with Autoupdate” from this user guide.

MonaLisa_HOME

path to your MonALISA installation directory. Environment variables can also be used. (e.g ${HOME}/MonaLisa)

FARM_HOME

path to a directory where reside your farm specific files. It's better to place this directory in the Service directory. (e.g. You can use the variable MonaLisa_HOME defined above. ${MonaLisa_HOME}/Service/MyTest. MonALISA comes with a simple example in ${MonaLisa_HOME}/Service/myFarm.

FARM_CONF_FILE

the file used at the startup of the services to define the clusters, nodes and the monitor modules to be used. It should be in the ${FARM_HOME} directory. (e.g FARM_CONF_FILE="${FARM_HOME}/mytest.conf").

FARM_NAME

the name for your farm. (e.g FARM_HOME="MyTest"). We would like to ask the users to use short names to describe the SITE on which they are running MonALISA.

JAVA_OPTS

is an optional parameter to pass parameters directly to the Java Virtual Machine (e.g JAVA_OPTS="-Xmx128m").

2. The MonALISA properties

The file $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/ml.properties is specific for your farm configuration.

You can specify here:

  • what lookup services to use (lia.Monitor.LUSs);

  • the jini groups that your service should join (lia.Monitor.group);

  • the location of the farm server (MonaLisa.LAT, MonaLisa.LONG, MonaLisa.Country);

  • Web Services settings (lia.Monitor.startWSDL=true starts the MonALISA web service, lia.Monitor.wsdl_port);

  • database configuration (lia.Monitor.keep_history how long to keep data in farm database, parameters to configurate database tables, etc.);

  • parameters for logging (.level - the logging level - defaults to INFO, etc.)

You will find explanations before every field for setting it correctly).

3. The MonALISA Service Configuration

The MonALISA service is using a very simple configuration file to generate the site configuration and the modules to be used for collecting monitoring information. By using the administrative interface with SSL connection the user may dynamically change the configuration and modules used to collect data.

It is possible to use the built-in modules (for snmp, local or remote /proc file...) or external modules. We provide several modules which allow exchanging information with other monitoring tools. These modules are really very simple and the user can also develop its own modules.

Below we will present a simple configuration example. This file is the .conf file from your Service/<FARM_DIRECTORY> directory.

For a complete list of the available monitoring modules please refer to Monitoring Modules Base.

3.1. Service monitoring configuration

Example 1.1. 

        *Master   
        >citgrid3.cacr.caltech.edu citgrid3   
        monProcLoad%30   
        monProcStat%30   
        monProcIO%30

        *ABPing{monABPing, citgrid3.cacr.caltech.edu, " "}   

        *PN_CIT
        >c0-0   
        snmp_Load%30   
        snmp_IO%30   
        snmp_CPU%30   
        >c0-1   
        snmp_Load%30   
        snmp_IO%30   
        snmp_CPU%30   
        >c0-2   
        snmp_Load%30   
        snmp_IO%30   
        snmp_CPU%30   
        >c0-3  
        snmp_Load%30   
        snmp_IO%30   
        snmp_CPU%30

The first line (*Master) defines a Functional Unit (or Cluster). The second line (>citgrid3.cacr.caltech.edu citgrid3) adds a node in this Functional Unit class and optionally an alias. The lines:

        monProcLoad%30   
        monProcIO%30   
        monProcStat%30

define three monitoring modules to be used on the node "citgrid3". These measurements are done periodically, every 30s. The monProc* modules are using the local /proc files to collect information about the cpu, load and IO. In this case this is a master node for a cluster, were in fact MonALISA service is running and simple modules using the /proc files used to collect data.

The line:

        *ABPing{monABPing, citgrid3.cacr.caltech.edu, " "}

defines a Functional unit named "ABPing" which is using an internal module monABPing. This module is used to perform simple network measurements using small UDP packages. It requires as the first parameter the full name of the system corresponding the real IP on which the ABping server is running (as part of the MonALISA service). The second parameter is not used. These ABPing measurements are used to provide information about the quality of connectivity among different centers as well as for dynamically computing optimal trees for connectivity (minimum spanning tree, minimum path for any node to all the others...)

    *PN_CIT

defines a new cluster name. This is for a set of processing nodes used by the site. The string "PN" in the name is necessary if the user wants to automatically use filters to generate global views for all this processing units.

Then it has a list of nodes in the cluster and for each node a list of modules to be used for getting monitoring information from the nodes. For each module a repetition time is defined (%30). This means that each such module is executed once every 30s. Defining the repeating time is optional and the default value is 30s.

4. Database support configuration

The configuration options relevant to the storage are set in the FARMNAME/ml.properties file:

    lia.Monitor.use_emysqldb=true|false

this will unpack the embedded mysql (if any)

    lia.Monitor.use_epgsqldb=true|false

for the embedded postgresql

If none of these options is enabled then the following options are relevant for the database server selection:

    lia.Monitor.jdbcDriverString=
    com.mckoi.JDBCDriver      or
    com.mysql.jdbc.Driver     or
    org.postgresql.Driver

McKoi is the default database if nothing else is available, but we don't recommend using it for storing large data structures. If you have a standalone database server you should disable the embedded databases and specify the mysql or postgresql driver here accordingly. The following parameters are the connection parameters for the JDBC driver.

    lia.Monitor.ServerName=IP_ADDRESS
    lia.Monitor.DatabasePort=TCP_PORT
    lia.Monitor.DatabaseName=DB_NAME
    lia.Monitor.UserName=DB_USERNAME
    lia.Monitor.Pass=DB_PASSWORD

The actual database structure is determined by the following options:

    lia.Monitor.Store.TransparentStoreFast.web_writes=N

this option specify the number of tables that are used. For each X=0..N-1 you should have:

    lia.Monitor.Store.TransparentStoreFast.writer_X.total_time=SECONDS
    lia.Monitor.Store.TransparentStoreFast.writer_X.table_name=UNIQUE_NAME
    lia.Monitor.Store.TransparentStoreFast.writer_X.writemode=MODE
    lia.Monitor.Store.TransparentStoreFast.writer_X.samples=SAMPLES
    lia.Monitor.Store.TransparentStoreFast.writer_X.descr=UNIQUE_STRING

SECONDS specify the time period for which the data is stored in the database. Data older than now()-SECONDS will be automatically deleted.

The "table_name" and "descr" should be unique among the other options of the same kind. "table_name" must be a valid database table name (no spaces and so on), "descr" can be any string you like.

You can store data in either averaged or raw modes. When using and averaged mode the SAMPLES value determine the number of values that are kept for the specified interval. For example if you want to store a single value each minute for an year you should specify SECONDS=31536000 and SAMPLES=SECONDS/60=525600. This is applied separately for each parameter that you store, so such a database can become rather large.

MODE has these possible values:

0: averaged mode

the table structure will be

        rectime | farm | cluster | node | function | mval   | mmin   | mmax
        long    | text | text    | text | text     | double | double | double
1: raw mode

same structure as 0

2: raw mode for storing abstract Object values

seldom used

3: averaged mode, data is only kept in memory

to control the maximum size of the in-memory buffer use:

    lia.Monitor.Store.TransparentStoreFast.writer_X.countLimit

if set to -1 then only the time limit given by SECONDS is relevant

4: raw mode, in memory, same as 3 but without data averaging
5, 6 : averaged / raw modes for an optimized table structure

each farm/cluster/node/function combination is given an unique ID, stored in monitor_ids table, and the database structure is now:

    rectime | id | mval | mmin | mmax

this option is the best for large data but with always-changing parameter names (for example netflow data aquisition)

7,8 : averaged / raw modes for another ID-related structure

for each unique ID a separate table is kept with the data from that series only, the table name will be UNIQUE_NAME_id and the structure is

    rectime | mval | mmin | mmax

this option is the best one when the data series are constant in time, it works well with up to 10000 table names (10000 unique ids if you have a single table writer, 5000 unique ids if you define 2 separate writers and so on).

Important

Modes 7 and 8 only work with PostgreSQL because of some stored procedures needed to improve response times.

For a large data repository we would recommend using PostgreSQL with something like:

    lia.Monitor.Store.TransparentStoreFast.web_writes = 2
    
    lia.Monitor.Store.TransparentStoreFast.writer_0.total_time=31536000
    lia.Monitor.Store.TransparentStoreFast.writer_0.samples=525600
    lia.Monitor.Store.TransparentStoreFast.writer_0.table_name=monitor_1y_1min
    lia.Monitor.Store.TransparentStoreFast.writer_0.descr=1y 1min
    lia.Monitor.Store.TransparentStoreFast.writer_0.writemode=7
        
    lia.Monitor.Store.TransparentStoreFast.writer_1.total_time=31536000
    lia.Monitor.Store.TransparentStoreFast.writer_1.samples=5256
    lia.Monitor.Store.TransparentStoreFast.writer_1.table_name=monitor_1y_100min
    lia.Monitor.Store.TransparentStoreFast.writer_1.descr=1y 100min
    lia.Monitor.Store.TransparentStoreFast.writer_1.writemode=7

We define 2 separate writers with different averaging intervals (1min and 100min) so the repository can use the proper one in different situations. For example when plotting a 1-hour chart it will choose the 1min table, but if you plot a 6-months chart it will choose the 100min one, reducing the number of operations needed to plot that data. A single writer would either limit the data resolution or response speed, more than 2 writers add much overhead and supplemental disk usage without much benefit.

Whatever storage type you use there is a memory buffer that is used in parallel with the disk storage (if any). Its size depends on the maximum JVM memory (-Xmx parameter) and is dinamically adjusted so that it doesn't use all of the available memory. When making a history query this is the first source of data, if more data is needed then a separate database query is executed to retrieve the remaining interval. In a repository you can see the current buffer status in http://......./info.jsp, look for something like:

    Data cache: values: 252275/262144 (max 262144), time frame: 2:13:29, served  requests: 16490

this tells you the number of values in the buffer, what period of time it holds and how many requests were served from this buffer.

5. How to setup the configuration files for your site

  • Go to "MonaLisa"/Service directory and create a directory for your site (e.g MySite). You may copy the configuration files from one of the available site directory (e.g.: those from the "MonaLisa"/Service/TEST directory). You must include the following files in you new Farm (ml.properties, db.conf.embedded and my_test.conf)

  • Edit the configuration file (my_site.conf) to reflect the environment you want to monitor.

  • Edit ml.properties if you would like to change the Lookup Discovery Services that will be used or if you would like to use another DB System.

  • You may add a myIcon.gif file with an icon of your organization in "MonaLisa"/Service/ml_dl.

The only script used to start/stop/restart "MonaLisa" is ML_SER from this directory. After you have done what is described in Section 3, “The MonALISA Service Configuration” section you can start using MonALISA:

    Service/CMD/ML_SER start

6. How to start a Monitoring Service from init.d

Please set correctly MonaLisa_HOME and MONALISA_USER variables from ${MonaLisa_HOME}/Service/CMD/MLD.

For 'Redhat like'
    #cp ${MonaLisa_HOME}/Service/CMD/MLD /etc/init.d 
    #chkconfig --add MLD 
    #chkconfig --level 345 MLD on 
For Debian
    #cp ${MonaLisa_HOME}/Service/CMD/MLD /etc/init.d 
    #update-rc.d MLD start 80 3 4 5 . 
    #update-rc.d MLD stop 86 3 4 5 .

7. Connectivity requirements for Monitoring Service

MonALISA service needs only outbound TCP connectivity to the following hosts: LUS Servers: TCP ports 4160, 8765 and 8288 - monalisa.cern.ch - monalisa.cacr.caltech.edu Port 8765 is used for lease renewal and service discovery. When the service is started, or there are network problems and it looses registration from the LUSs, it should be able to reach another two ports 4160 and 8288. This should be very short lived TCP connections, used only to bootstrap the registration mechanism. Proxy Servers: TCP ports 6001, 6002, 6003 - monalisa.cern.ch - monalisa2.cern.ch - monalisa.caltech.edu - monalisa-ul.caltech.edu - monalisa.cacr.caltech.edu The TCP connection to port 6001 is long lived and used for communication between ML services and proxy services. Very short lived TCP connections are also needed to ports 6002, 6003, whenever a ML service first discovers the proxy service. WEB Servers: TCP port 80 - monalisa.cern.ch - monalisa.cacr.caltech.edu - monalisa.caltech.edu This are used to autoupdate the service. Before starting the service you should check that the above hosts/ports can be reached.

8. How to start a Monitoring Service with Autoupdate

This allows to automatically update your Monitoring Service. The cron script will periodically check for updates using a list of URLs. When a new version is published the system will check its digital signature and then will download the new distribution as a set of signed jar files. When this operation is completed the MonALISA service will restart automatically. The dependecies and the configurations related with the service are done in a very similar way like the Web Start technology.

This functionality makes it very easy to maintain and run a MonALISA service. We recomnend to use it!

In this case you should add "MonaLisa"/Service/CMD/CHECK_UPDATE to the user's crontab that runs MonALISA. To edit your crontab use: $crontab -e

Add the following line:

    */20 * * * * /<path_to_your_MonaLisa>/Service/CMD/CHECK_UPDATE

This would check for update every twenty minutes. It is resonable value that this value should be >= twenty minutes. To check for update every 30 minutes add the following line instead of the one above.

    */30 * * * * /<path_to_your_MonaLisa>/Service/CMD/CHECK_UPDATE

To disable autoupdate you cand edit the ml_env file in "MonaLisa"/Service/CMD and set SHOULD_UPDATE="false">. It is no need to remove the script CHECK_UPDATE from your crontab.

Launch "MonaLisa"/Service/CMD/ML_SER start. MonALISA should check for updates now.