There are three configuration files that the user can
modify for specifying farm service environment and
characteristics: a global configuration file ($MonaLisa_HOME/Service/CMD/ml_env
), and
the others used by MonALISA itself ( $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/ml.properties
and $MonaLisa_HOME/Service/<YOUR_FARM_DIRECTORY>/<YOUR_FARM_CONF_FILE>.conf
).
3. The MonALISA Service
Configuration
The MonALISA service is using a very simple
configuration file to generate the site configuration and
the modules to be used for collecting monitoring
information. By using the administrative interface with
SSL connection the user may dynamically change the
configuration and modules used to collect data.
It is possible to use the built-in modules (for snmp,
local or remote /proc
file...) or external modules. We provide several modules
which allow exchanging information with other monitoring
tools. These modules are really very simple and the user
can also develop its own modules.
Below we will present a simple configuration example.
This file is the .conf
file from your
Service/<FARM_DIRECTORY>
directory.
For a complete list of the available monitoring
modules please refer to Monitoring Modules Base.
3.1. Service
monitoring configuration
Example 1.1.
*Master
>citgrid3.cacr.caltech.edu citgrid3
monProcLoad%30
monProcStat%30
monProcIO%30
*ABPing{monABPing, citgrid3.cacr.caltech.edu, " "}
*PN_CIT
>c0-0
snmp_Load%30
snmp_IO%30
snmp_CPU%30
>c0-1
snmp_Load%30
snmp_IO%30
snmp_CPU%30
>c0-2
snmp_Load%30
snmp_IO%30
snmp_CPU%30
>c0-3
snmp_Load%30
snmp_IO%30
snmp_CPU%30
The first line (*Master
) defines a
Functional Unit (or Cluster). The second line (>citgrid3.cacr.caltech.edu
citgrid3
) adds a node in this
Functional Unit class and optionally an alias. The
lines:
monProcLoad%30
monProcIO%30
monProcStat%30
define three monitoring modules to be used on the
node "citgrid3". These measurements are done
periodically, every 30s. The monProc*
modules are
using the local /proc
files to collect information about the cpu, load and
IO. In this case this is a master node for a cluster,
were in fact MonALISA service is running and simple
modules using the /proc
files used to collect data.
The line:
*ABPing{monABPing, citgrid3.cacr.caltech.edu, " "}
defines a Functional unit named "ABPing" which is
using an internal module monABPing
. This
module is used to perform simple network measurements
using small UDP packages. It requires as the first
parameter the full name of the system corresponding the
real IP on which the ABping server is running (as part
of the MonALISA service). The second parameter is not
used. These ABPing measurements are used to provide
information about the quality of connectivity among
different centers as well as for dynamically computing
optimal trees for connectivity (minimum spanning tree,
minimum path for any node to all the others...)
*PN_CIT
defines a new cluster name. This is for a set of
processing nodes used by the site. The string "PN" in
the name is necessary if the user wants to
automatically use filters to generate global views for
all this processing units.
Then it has a list of nodes in the cluster and for
each node a list of modules to be used for getting
monitoring information from the nodes. For each module
a repetition time is defined (%30). This means that
each such module is executed once every 30s. Defining
the repeating time is optional and the default value is
30s.
4. Database support
configuration
The configuration options relevant to the storage are
set in the FARMNAME/ml.properties
file:
lia.Monitor.use_emysqldb=true|false
this will unpack the embedded mysql (if any)
lia.Monitor.use_epgsqldb=true|false
for the embedded postgresql
If none of these options is enabled then the following
options are relevant for the database server
selection:
lia.Monitor.jdbcDriverString=
com.mckoi.JDBCDriver or
com.mysql.jdbc.Driver or
org.postgresql.Driver
McKoi is the default database if nothing else is
available, but we don't recommend using it for storing
large data structures. If you have a standalone database
server you should disable the embedded databases and
specify the mysql or postgresql driver here accordingly.
The following parameters are the connection parameters
for the JDBC driver.
lia.Monitor.ServerName=IP_ADDRESS
lia.Monitor.DatabasePort=TCP_PORT
lia.Monitor.DatabaseName=DB_NAME
lia.Monitor.UserName=DB_USERNAME
lia.Monitor.Pass=DB_PASSWORD
The actual database structure is determined by the
following options:
lia.Monitor.Store.TransparentStoreFast.web_writes=N
this option specify the number of tables that are
used. For each X=0..N-1 you should have:
lia.Monitor.Store.TransparentStoreFast.writer_X.total_time=SECONDS
lia.Monitor.Store.TransparentStoreFast.writer_X.table_name=UNIQUE_NAME
lia.Monitor.Store.TransparentStoreFast.writer_X.writemode=MODE
lia.Monitor.Store.TransparentStoreFast.writer_X.samples=SAMPLES
lia.Monitor.Store.TransparentStoreFast.writer_X.descr=UNIQUE_STRING
SECONDS specify the time period for which the data is
stored in the database. Data older than now()-SECONDS
will be automatically deleted.
The "table_name" and "descr" should be unique among
the other options of the same kind. "table_name" must be
a valid database table name (no spaces and so on),
"descr" can be any string you like.
You can store data in either averaged or raw modes.
When using and averaged mode the SAMPLES value determine
the number of values that are kept for the specified
interval. For example if you want to store a single value
each minute for an year you should specify
SECONDS=31536000 and SAMPLES=SECONDS/60=525600. This is
applied separately for each parameter that you store, so
such a database can become rather large.
MODE has these possible values:
- 0: averaged mode
-
the table structure will be
rectime | farm | cluster | node | function | mval | mmin | mmax
long | text | text | text | text | double | double | double
- 1: raw mode
-
same structure as 0
- 2: raw mode for storing
abstract Object values
-
seldom used
- 3: averaged mode, data is only
kept in memory
-
to control the maximum size of the in-memory
buffer use:
lia.Monitor.Store.TransparentStoreFast.writer_X.countLimit
if set to -1 then only the time limit given by
SECONDS is relevant
- 4: raw mode, in memory, same
as 3 but without data averaging
- 5, 6 : averaged / raw modes
for an optimized table structure
-
each farm/cluster/node/function
combination is given an unique ID, stored in monitor_ids
table, and the
database structure is now:
rectime | id | mval | mmin | mmax
this option is the best for large data but with
always-changing parameter names (for example
netflow data aquisition)
- 7,8 : averaged / raw modes for
another ID-related structure
-
for each unique ID a separate table is kept with
the data from that series only, the table name will
be UNIQUE_NAME_id
and
the structure is
rectime | mval | mmin | mmax
this option is the best one when the data series
are constant in time, it works well with up to
10000 table names (10000 unique ids if you have a
single table writer, 5000 unique ids if you define
2 separate writers and so on).
Important
Modes 7 and 8 only work with PostgreSQL because of
some stored procedures needed to improve response
times.
For a large data repository we would recommend using
PostgreSQL with something like:
lia.Monitor.Store.TransparentStoreFast.web_writes = 2
lia.Monitor.Store.TransparentStoreFast.writer_0.total_time=31536000
lia.Monitor.Store.TransparentStoreFast.writer_0.samples=525600
lia.Monitor.Store.TransparentStoreFast.writer_0.table_name=monitor_1y_1min
lia.Monitor.Store.TransparentStoreFast.writer_0.descr=1y 1min
lia.Monitor.Store.TransparentStoreFast.writer_0.writemode=7
lia.Monitor.Store.TransparentStoreFast.writer_1.total_time=31536000
lia.Monitor.Store.TransparentStoreFast.writer_1.samples=5256
lia.Monitor.Store.TransparentStoreFast.writer_1.table_name=monitor_1y_100min
lia.Monitor.Store.TransparentStoreFast.writer_1.descr=1y 100min
lia.Monitor.Store.TransparentStoreFast.writer_1.writemode=7
We define 2 separate writers with different averaging
intervals (1min and 100min) so the repository can use the
proper one in different situations. For example when
plotting a 1-hour chart it will choose the 1min table,
but if you plot a 6-months chart it will choose the
100min one, reducing the number of operations needed to
plot that data. A single writer would either limit the
data resolution or response speed, more than 2 writers
add much overhead and supplemental disk usage without
much benefit.
Whatever storage type you use there is a memory buffer
that is used in parallel with the disk storage (if any).
Its size depends on the maximum JVM memory (-Xmx
parameter) and is dinamically adjusted so that it doesn't
use all of the available memory. When making a history
query this is the first source of data, if more data is
needed then a separate database query is executed to
retrieve the remaining interval. In a repository you can
see the current buffer status in http://......./info.jsp
, look for
something like:
Data cache: values: 252275/262144 (max 262144), time frame: 2:13:29, served requests: 16490
this tells you the number of values in the buffer,
what period of time it holds and how many requests were
served from this buffer.
8. How
to start a Monitoring Service with Autoupdate
This allows to automatically update your Monitoring
Service. The cron script will periodically check for
updates using a list of URLs. When a new version is
published the system will check its digital signature and
then will download the new distribution as a set of
signed jar files. When this operation is completed the
MonALISA service will restart automatically. The
dependecies and the configurations related with the
service are done in a very similar way like the Web Start
technology.
This functionality makes it very easy to maintain and
run a MonALISA service. We recomnend to use it!
In this case you should add "MonaLisa"/Service/CMD/CHECK_UPDATE
to
the user's crontab that runs MonALISA. To edit your
crontab use: $crontab
-e
Add the following line:
*/20 * * * * /<path_to_your_MonaLisa>/Service/CMD/CHECK_UPDATE
This would check for update every twenty minutes. It
is resonable value that this value should be >= twenty
minutes. To check for update every 30 minutes add the
following line instead of the one above.
*/30 * * * * /<path_to_your_MonaLisa>/Service/CMD/CHECK_UPDATE
To disable autoupdate you cand edit the ml_env
file in "MonaLisa"/Service/CMD
and set SHOULD_UPDATE="false"
>.
It is no need to remove the script CHECK_UPDATE
from your crontab.
Launch "MonaLisa"/Service/CMD/ML_SER
start
. MonALISA should check for updates
now.