Introduction

When an integration layer is in place not every user has an overview on where his request is heading, through what layers it is passing or where it encountered error.

In this article I will describe a basic solution to monitor, debug and be alerted using Red Hat Fuse on premise as an integration layer, the TICK stack, Graylog and Grafana. I am assuming the OS is Centos.

The first part describes the architecture, the second a more technical approach.

I will not discuss the reason for the selection of the technical stack. It is mainly because I have experience with it and it is working stable to monitor a big infrastructure.

Also it is a logical choice in an open source Java oriented environment.

Simple Use Case as a base example

2 backend applications are integrated with a partner over a public network.

An event triggers a synchrone request to a web service at a partner.

The partner’s web service is exposed internally over the integration layer.

The integration layer is Highly Available (HA).

Basic infrastructure design diagram

KPIs and debug requirements

Here is the list of what we want at the end.

KPIs monitoring

  • Throughput capacity has to be at least X requests per second
    = the amount of requests hitting the integration layer
  • Average response time less than X
    = how much time does the integration layer take to answer the backend application request
  • Success rate has to be at least X%
    = a failure is a request that had a connection issue towards the partner or with an http response code >= 400

Debug requirements

  • Where did my request fail?
    If I have an error, where did it occur and what is the error at that level
  • Where did my request end?
    If I send a request over the integration layer, what is the endpoint at the partner API?
  • What path did my request follow toward the partner?
    Which intranet elements did the request pass through?
    What log was produced at each element?
  • What was the result of the request on the integration layer?
    Potentially, the request handling between the integration layer and the partner is different than between the backend and the integration layer.

Tooling solutions

Red Hat Fuse out of the box monitoring

Red Hat Fuse comes with monitoring.

It is accessible via the so-called Fuse Console. It is an Hawtio web interface on top of out of the box JMX capabilities of Apache Camel, the core routing engine of Red Hat Fuse.

To be honest this is good enough for a developer to check what is going on a single Fuse instance but not to use as an enterprise tool.

If you are running Red Hat Fuse in a cluster then you have to know where your message did/will pass before you can log onto the correct console.

On the other hand, the fact that all the metrics are exposed over JMX allows you to collect them on the VM.

Metrics collection solution

I will be collecting the metrics in InfluxDB.

InfluxDB is a very performant open source time series database.
InfluxDB comes with the Telegraf agent to collect metrics and send them to InfluxDB.
Telegraf has plenty of plugins to collect metrics. It can out of the box collect data from the Jolokia an tool that exposes JMX over HTTP.
In order to monitor our KPIs we will collect each 10 seconds on each interesting Camel route the total exchanges count, total exchanges failed count and last exchange processing time. An exchange is one request inside a Camel route. A Camel route is the logic plugged on the exposed endpoint (=consumer in Camel terminology) on Red Hat Fuse to produce a request towards the partner endpoint.

Logs Collections Solution

All the logs from the important infrastructure and network elements will be collected on Graylog. Graylog is an open source log management software. It uses Elasticsearch to store the logs and MongoDB to store its metadata.

We will send it the logs from the backend application, the firewalls, the load balancer, the forward proxy, Red Hat Fuse.

Logs can be pushed to Graylog by various means including Syslog, GELF, TCP, UDP.

Monitoring Solution

To achieve monitoring I will use Grafana, a flexible and powerful open source monitoring software that can read from many sources including InfluxDB.

Once you connect it to your InfluxDB database you can immediately start to create monitoring dashboards using InfluxQL.

Alerting Solution

The base engine that I will use is Kapacitor. Is part of the open source TICK stak from Influxdata.

Kapacitor publishes alerts on topics consumed by handlers to send them out.

Kapacitor basically analyses streams of metrics out of InfluxDB enabling more than alerting capabilities like aggregating metrics or anomalies detection but this is out of the scope of this article.
In order to simplify the definition of alerts we are going to add Chronograf which in combination with Kapacitor offer an easy to set up basic alerting capability. It can easily define a Kapacitor alert on threshold checks (ceiling, deviation %, …).

Putting it all together into practice

Now that the technology stack is defined I will detail how to use it.

It is a high level description salted with some technical inputs assuming that the components are installed.

Adding the monitoring and alerting elements to the infrastructure design diagram

Tracing your request

The idea is to generate a request id as soon as possible, add it to an HTTP header and carry it along all layers in request and responses.
That request id has to be logged in logs sent to Graylog where searching it would return all logs of each layer having processed it.

The request id will be generated by the load balancer and put in the X-RequestId HTTP header. That header will be propagated towards the partner through all the layers and added in the response to the backend.

Each layer will then log it and send it to Graylog.

With that in place you can answer all your debug requirements.

  • Where did my request fail?
    • If you don’t have a request id then it failed between the backend application and the load balancer included. The error should be visible in graylog around the time of the error in the log of the backend application, the firewall or the load balancer.
    • If you have a request id then you search it on Graylog and you will see all the layers where the request passed. With that you know between which layers it errored.
  • Where did my request end?
    Search Graylog for the log of the forward proxy for the request id. That log will have to destination URI.
  • What path did my request follow toward the partner?
    Which intranet elements did the request pass through?
    What log was produced at each element?
    Again, search Graylog for the request id and you’ll find it.
  • What was the result of the request on the integration layer?
    Well search the integration layer logs in graylog with your request id to collect the details.

The nice thing is that your partner will have the request id too and can use it as a reference to ask for details.

It is important to analyse what are your debugging requirements in order to log what is required for debugging on the right element.

Monitoring your integration

The examples I am providing below are really basic and not fully detailed in order to demonstrate the theory but can be and have to be fine tuned. You could tune your graph per route, per VM. You could and should filter out sub routes (a route can be and often is composed of the chaining of multiple routes but you want only to monitor the wrapping one).

Collecting the metrics

All the monitoring is based on the collection of JMX data on each Red Hat Fuse VM in InfluxDB. By default Red Hat Fuse exposes the JMX though Jolokia.

To collect the metrics, install a Telegraf agent on each Red Hat Fuse VM and configure it to send the metrics on the configured port and database of InfluxDB.

Ensure that you have the Telegraf jolokia2 agent input plugin installed. You can check it with the command: telegraf –input-list | grep jolokia2_agent

In your Telegraf configuration at /etc/telegraf/telegraf.conf add these relevant pieces of configuration and restart your telegraf agent:

[agent]

  interval = “10s”

[[outputs.influxdb]]

  database = “telegraf”

  urls= [“https://influxdb.yourprivatedomain:8086”]

[[inputs.jolokia2_agent]]

  password = “xxx”

  urls = [“http://localhost:8080/hawtio/jolokia”]

  username = “xxx”

[[inputs.jolokia2_agent.metric]]

  mbean = “org.apache.camel:context=*,type=routes,name=*”

  name = “camel_routes”

  paths = [“ExchangesTotal”, “ExchangesCompleted”, “ExchangesFailed”, “RouteId”, “EndpointUri”, “LastProcessingTime”]

  tag_keys = [“name”]

This will collect the “ExchangesTotal”, “ExchangesCompleted”, “ExchangesFailed”, “RouteId”, “EndpointUri”, “LastProcessingTime” of all your Camel routes every 10 seconds and send it to the “telegraf” database on your InfluxDB hosted at https://influxdb.yourprivatedomain:8086.

Creating graphics of your metrics in Grafana

On your Grafana instance configure an InfluxDB datasource for your telegraf database. On your Grafana web interface, create a dashboard for your monitoring and add below graphs.

To monitor the requests per second

create a “Graph” panel, name it ‘requests per second’ and click the toggle text edit mode on the ‘A’ query  and put this as a query:

SELECT non_negative_derivative(first(“ExchangesCompleted”), 1s) AS “OK”, non_negative_derivative(first(“ExchangesFailed”), 1s) AS “NOK” FROM “autogen”.”camel_routes” WHERE $timeFilter GROUP BY time(1m) fill(null)

This will produce a graphic like this

The non_negative_derivate function calculates the derivative of the difference between each value. You have to understand that the metrics coming back from Fuse are totals and not instant values. Hence you have to consider the difference between each metric you have received to build a graphic based on the evolution of the counts over time.

To monitor the processing time

Create a “Graph” panel, name ‘processing time’ and click the toggle text edit mode on the ‘A’ query  and put this as a query:

SELECT last(“LastProcessingTime”) AS “Last processing time”, max(“LastProcessingTime”) AS “Max processing time”, min(“LastProcessingTime”) AS “Min processing time”, mean(“LastProcessingTime”) AS “Mean processing time” FROM “camel_routes” WHERE “LastProcessingTime” >= 0 AND $timeFilter GROUP BY time(1m) fill(0)

Here you see some aggregation functions (last, mean, max, min).

To monitor the successful/failed request counts

create a “Graph” panel, name it ‘Total requests count’ and click the toggle text edit mode on the ‘A’ query  and put this as a query:

SELECT CUMULATIVE_SUM(“Exchanges completed”) as “Total OK”, CUMULATIVE_SUM(“Exchanges failed”) as “Total NOK” FROM (SELECT NON_NEGATIVE_DIFFERENCE(last(“ExchangesCompleted”)) AS “Exchanges completed”, NON_NEGATIVE_DIFFERENCE(last(“ExchangesFailed”)) AS “Exchanges failed” FROM “camel_routes” WHERE $timeFilter GROUP BY “host”, “name” fill(null))

Here there is a trick to implement due to the fact that you are only receiving the last total counts from Red Hat Fuse. The subquery is transforming the metrics of totals over time into differences between the totals over time then it is summed in order to render the evolution of the counts.

If you add GROUP BY “name” at the end of the query then you’ll see a graph per camel route instead of the total over all the routes:

How to monitor your KPIs

  • Throughput capacity has to be at least X requests per second
    This is what your “requests per second” graph is showing.
  • Average response time less than X
    This is what your “processing time” graph is showing.
  • Success rate has to be at least X%
    This is something you have partially with the “Total requests count” graph.
    You should create a new graph that returns the ratio of failure to have it exactly.
    FYI, a connection failure will trigger an error in the route out of the box but an HTTP >= 400 error code potentially not. This is something to implement in the route by reacting with an error in that case.

Sending alerts

On your InfluxDB VMs, install Chronograf and Kapacitor.

Kapacitor will be used to produce alerts and send them out. Once installed on the InfluxDB VM if the standard ports are used, it will automatically start reading the data.

Chronograf is a tool from the TICK stack that is in theory providing the monitoring graphics element and a DB query client.

In this case we will use it to create basic alerts on Kapacitor.

Once installed next to Kapacitor it will enable alert definitions.

To create an alert, go to your InfluxDB hostname on the default port 8888 of Chronograf: https://influxdb.yourprivatedomain:8888.

Surf to the Alerting > Manage Tasks menu and click “+ Build Alert Rule”.

In the alert creation page, under the “Time Series” section select the “telegraf.autogen” database, the “camel_routes” measurement and finally the “ExchangesFailed” field.

Add the “last” function on the “ExchangesField”.

In the “Conditions” section select: Send Alert when “change” compared to previous “5 m” is greater than 5.

This means that an alert will be published when there is an increase of 5 errors compared to the measure 5 minutes before.

Setup an “Alert Handler”. I used the mail handler because we deploy an SMTP server on all our VMs but there are many other handlers.

Define a message to add in the alert. It can optionally contain information collected during the alert.

Click “Save Rule” at the top of the page.

Kapacitor on its own has more alerting capabilities if you like to investigate further.

Time synchronization is crucial

This is only working if the time on all your elements inside your private network and ideally with your partner is aligned and synchronized.

At least for your logs. You can configure InfluxDB to generate the timestamp of the metric when received.

Conclusion

With this architecture you have the capacity to monitor, debug and create alerts on your integration layer. It is based on best in class open source solutions backed by private companies and/or big communities.

Though I have documented the basics you can obviously fine tune it at will depending on your needs.

Everything is custom which is the power and the burden of this solution.

The cost is that you have to set up everything yourself and specify your needs.

The benefit is monitoring, debugging and alerting customization to a great extent.

Author: Antoine Wils