If you want full host monitoring and control from a single point using MonALISA, this is what you have to do. Careful, if you only need monitoring of a bunch of hosts and devices there is a much simpler approach, MLSensor, but more about this in the next episode.

Preparing the host
Please make sure you have a correctly configured system. One important point here is /etc/hosts. You should find there something like hostname hostname.f.q.d.n

If you don’t have this, please take now the time to edit the file and have it correctly set, you will need it later on. Also make sure `hostname -f` returns the same FQDN of your machine. The same name that is known as forward or reverse name for your IP and so on, please keep your system consistent 🙂

Installing the MonALISA Service
This piece has to run on every host. You need a Java Runtime version 6+, it’s probably best to install it from your system repository but you can also manually install it from the official package. For Ubuntu you are better off doing:

sudo add-apt-repository 'deb http://archive.canonical.com/ lucid partner'
sudo apt-get update
sudo apt-get install sun-java6-jdk sun-java6-plugin sun-java6-fonts
sudo update-java-alternatives -s java-6-sun

Once `java -version` reports version 1.6.0+ you can download and install the package:
tar -xzf MonaLisa.v<major>.<minor>.tar.gz
cd MonaLisa.v<major>

You will be asked a few questions, if the default is good for you just hit Enter. For the farm name please indicate the FQDN of the machine, if in doubt or if the installer could not determine the correct name.

When asked about ApMon support, it’s a good idea to enable it from the beginning, in case you want to gather information from more sources / applications.

By default the service will start in the ‘test’ group. So don’t start it yet, but configure it first. Go to ~/MonaLisa/Service/myFarm (or where you have installed it, this is the default location) and edit ml.properties:

lia.Monitor.group=myClusterName # a simple string, identical on all your installations

By default each service runs a PostgreSQL database instance to keep history in case you ask it directly. Recent history is anyway kept in memory so you might want to disable the database if the monitoring information is not critical. To do this simply set:
lia.Monitor.Store.TransparentStoreFast.web_writes = 0

Then if you want more modules than what is by default available (CPU usage, Load, Swap activity, Network IO, Disk IO, LM Sensors) you can enable more modules in myFarm.conf in the same directory. Below are a few that give you a more complete picture of your host:

*IPs{monIPAddresses, localhost, ""}%60
*MonaLisa_DiskDF{DiskDF, localhost, ""}%60
*MonaLisa_MemInfo{MemInfo, localhost, ""}%60
*MonaLisa_Netstat{Netstat, localhost, ""}%60
*MonaLisa_ProcessesStatus{ProcessesStatus, localhost, ""}%60
*MonaLisa_SysInfo{SysInfo, localhost, ""}%300
*MonaLisa_NetworkConfiguration{NetworkConfiguration, localhost, ""}%300

If you have an APC UPS attached to the box you can also enable the module for this:
*UPS{monAPCUPS, localhost, "/sbin/apcaccess"}%60

Or you can query a closeby switch by SNMP:

And there are many other modules that you can readily instantiate. Or you can write your own module or use a very simple module to call a system command and parse its tab-separated output:
*ExternalScripts{monStatusCmd, localhost, "/path/to/the/command.sh"}%60

Now you can start the service:
~/MonaLisa/Service/CMD/ML_SER start

Then start the interactive ML client to browse the parameters that are now available. Make sure you enable your newly created group in the Groups menu. If you still don’t see the services there, check why the service is not running, the logs in MonaLisa/Service/myFarm/*.log are a good starting point. Other times a too strict firewall could be a problem, though ML only needs outgoing connectivity, the log messages should indicate these problems though.

Installing the Repository
The above Services are collecting and exposing the data to the world. Another MonALISA client, the Repository, can subscribe to subsets of this monitoring data, store the values in a local database and present them on the embedded web interface. One of the largest ones is the repository for ALICE that you can browse for inspiration on the views that can be created from the collected data.

For the Repository you will need the full JDK (for later developments on the web interface, that you will very probably want to do). Once you have it installed, download the repository package. The stable release is on the official site or you can download the nightly build (which I always use 🙂 ).

When installing the Repository you should take a moment to evaluate what parameters you want monitored and the update frequency. For example, assuming 20 hosts with 60 parameters that update once every minute, you have a 20Hz continuous update flux to the database. This is very easy for any system to handle, but a 10x factor over this is probably at the limits of SATA disks and you need better hardware (fast disks, RAID, SSD etc) to handle it. Memory is also important for a heavily used system, and the Repository can make use of as many cores as you have, so give it as many resources as possible.

The installation is straightforward:

tar -xjf MLrepository.tar.bz2
cd MLrepository

If the installer finds Java in the PATH and you have a 64-bit Linux that is decently new (SLC5 or anything newer) you’re set. Otherwise you’ll have to compile PostgreSQL with --prefix=/where/is/MLrepository/pgsql_store, or point the client to use a standalone database instance instead of the embedded one.

Now, you have to decide what you want to collect and store, and what and how to display it.

Configuring data collection

There is only one configuration file where you define what data is collected and how it is stored: MLrepository/JStoreClient/conf/App.properties

Here you will want to change for sure the following variables:

lia.Monitor.group = myClusterName # the same that you chose for the services

lia.Monitor.JiniClient.Store.predicates = the list of predicates, more about this below

The predicates parameter is a comma-separated list of filters that define cuts in the monitoring parameters to which the client will subscribe and (by default) store. A predicate has 4 fields (corresponding to the 4 hierarchical levels of MonALISA parameters), separated by forward slash (/) :

  • Farm (MonALISA Service name)
  • Cluster
  • Node
  • Function (parameter name)

Any of these support “*” as wildcard, either for the entire name or for substrings of it. Here are some examples based on the modules instantiated above at the Service level:

*/MonaLisa/*/CPU_*|Load* # gathering CPU usage per component and machine Load
*/MonaLisa_LocalSysMon/MAXIOUTIL_* # the busiest device at every iteration
*/MonaLisa_MemInfo/*/* # all Memory parameters
*/MonaLisa_DiskDF/TOTAL_*|Status|Message # `df` output and the status and warning message (if any)

So putting everything together:

lia.Monitor.JiniClient.Store.predicates = */MonaLisa/*/CPU_*|Load*,\

To browse the parameters use the interactive client, and only subscribe to the minimal set of parameters that define your views. This would save disk space and IO. And you can dynamically add parameters while the repository is running, changes are applied on the fly without even restarting it. So, start small, add when you must.

Now, start the repository so we can move on:

Displaying the values

You can quickly check if the values are arriving to the client by accessing http://localhost:8080/dump_cache.jsp (change the address to where you have started the repository from, if it’s not the local machine; to change the default port see MLrepository/tomcat/conf/server.xml). Put here one of the predicates from the configuration file (or simply a * to dump all series that are collected for a quick check).

If you don’t see here what you expect, check MLrepository/JStoreClient/log.out for errors.

To create new views meaning creating new configuration files in MLrepository/tomcat/webapps/ROOT/WEB-INF/conf. A number of examples are already here and you can see some of them from the left-side menu of the site. In general if you add there a file called for example test.properties you can see it with http://localhost:8080/display?page=test&dont_cache=true (the name without .properties, the last parameter forces a reload of the configuration file in case you are tuning it).

Here is a simple example that displays the load on all machines in the cluster:

# history chart

# all machines for which we have this parameter
Farms=$QSELECT distinct split_part(mi_key,'/',1) FROM monitor_ids WHERE mi_key LIKE '%/MonaLisa/localhost/Load1';

# one series per farm name

# clarify a bit the contents
title = Load1 on the machines
ylabel = Load1

Sample chart

A simple example of history chart

Take a look at the other examples that only mean configuration of existing servlets, and keep in mind if some custom views are required you can write your own custom JSP pages and make use of the data API to create dashboard like this or this.

Establishing the trust between Repository and Services
Only trusted clients can control the services, so if you need to propagate commands from the client (Repository or interactive client just the same) you need to establish this trust relation. For this you need to generate a public-private key pair for the client, extract the public key and give it to the service(s). The detailed sequence is:
1. on the client side:
cd MLrepository
keytool -genkey -keystore control.jks -alias repository -validity 10000
(answer truthfully to the questions, and remember the password 🙂 )

2. still on the client side:
keytool -export -keystore control.jks -alias repository -file repository.pub

3. configure the client to use this keystore for identification:
keystore_pwd=your secret password

Then restart the client with MLrepository/scripts/restart_jstoreclient.sh

4. import the public key of the repository in the service
cd MonaLisa/Service/SSecurity
./importCert repository repository.pub
cd ../CMD
./ML_SER restart

That’s it! Now you can instruct the repository to automatically restart applications that are reported as failed or implement other automatic procedures orchestrated centrally. More about automatic actions at a later time.