Service Assurance

This section will cover the basic functionalities how Horizon tests if a service or device available and measure his latency.

In Horizon this task is provided by a Service Monitor framework. The main component is Pollerd which provides the following functionality:

Track the status of a management resource or an application for availability calculations
Measure response times for service quality
Correlation of node and interface outages based on a Critical Service

The following image shows the model and representation of availability and response time.

Figure 1. Representation of latency measurement and availability

This information is based on Service Monitors which are scheduled and executed by Pollerd. A Service can have any arbitrary name and is associated with a Service Monitor. For example, we can define two Services with the name HTTP and HTTP-8080, both are associated with the HTTP Service Monitor but use a different TCP port configuration parameter. The following figure shows how Pollerd interacts with other components in OpenNMS and applications or agents to be monitored.

The availability is calculated over the last 24 hours and is shown in the Surveillance Views, SLA Categories and the Node Detail Page. Response times are displayed as Resource Graphs of the IP Interface on the Node Detail Page. Configuration parameters of the Service Monitor can be seen in the Service Page by clicking on the Service Name on the Node Detail Page. The status of a Service can be Up or Down.

The Service Page also includes timestamps indicating the last time at which the service was polled and found to to be Up (Last Good) or Down (Last Fail). These fields can be used to validate that Pollerd is polling the services as expected.

When a Service Monitor detects an outage, Pollerd sends an Event which is used to create an Alarm. Events can also be used to generate Notifications for on-call network or server administrators. The following images shows the interaction of Pollerd in Horizon.

Figure 2. Service assurance with Pollerd in OpenNMS platform

Pollerd can generate the following Events in Horizon:

Event name	Description
uei.opennms.org/nodes/nodeLostService	Critical Services are still up, just this service is lost.
uei.opennms.org/nodes/nodeRegainedService	Service came back up
uei.opennms.org/nodes/interfaceDown	Critical Service on an IP interface is down or all services are down.
uei.opennms.org/nodes/interfaceUp	Critical Service on that interface came back up again
uei.opennms.org/nodes/nodeDown	All critical services on all IP interfaces are down from node. The whole host is unreachable over the network.
uei.opennms.org/nodes/nodeUp	Some of the Critical Services came back online.

Event name

Description

uei.opennms.org/nodes/nodeLostService

Critical Services are still up, just this service is lost.

uei.opennms.org/nodes/nodeRegainedService

Service came back up

uei.opennms.org/nodes/interfaceDown

Critical Service on an IP interface is down or all services are down.

uei.opennms.org/nodes/interfaceUp

Critical Service on that interface came back up again

uei.opennms.org/nodes/nodeDown

All critical services on all IP interfaces are down from node. The whole host is unreachable over the network.

uei.opennms.org/nodes/nodeUp

Some of the Critical Services came back online.

The behavior to generate interfaceDown and nodeDown events is described in the Critical Service section.

This assumes that node-outage processing is enabled.