Thresholding

Thresholding lets you define limits against network performance metrics of a managed entity to trigger an event when a value goes above or below the specified limit.

High
Low
Absolute Value
Relative Change

How thresholding works in Horizon

Horizon uses collectors to implement data collection for a particular protocol or family of protocols (SNMP, JMX, HTTP, XML/JSON, WS-Management/WinRM, JDBC). You can specify configuration for a particular collector in a collection package: essentially the set of instructions that drives the behavior of the collector.

The Collectd daemon gathers and stores performance data from these collectors. This is the data against which Horizon applies thresholds. Thresholds trigger events when a specified threshold value is met. You can further create notifications and alarms for threshold events.

thresholding flow

What triggers a thresholding event?

Horizon uses four thresholding algorithms that trigger an event when the data source value:

Low - equals or drops below the threshold value and re-arms when it equals or comes back up above the re-arm value (for example, available disk space falls under the specified value)
High - equals or exceeds the threshold value, and re-arms when it equals or drops below the re-arm value (for example, bandwidth use exceeds the specified amount)
Absolute - changes by the specified amount (for example, on a fiber-optic link, a change in loss of anything greater than 3 dB is a problem regardless of what the original or final value is)
Relative - changes by percent (for example, available disk space changes more than 5% from the last poll)

These thresholds can be basic (tested against a single value) or an expression (evaluated against multiple values in an expression).

Horizon applies these algorithms against any performance data (telemetry) collected by collectd or pushed to telemetryd. This includes, but is not limited to, metrics such as CPU load, bandwidth, disk space, and so on.

The basic walk-through focuses on how to set simple thresholds using default values in the Horizon setup. For information on setting and configuring collectors, collectd, and the collectd-configuration.xml file, see Performance Management.

Basic walk-through – thresholding

This section describes how to create a basic threshold for a single, system-wide variable: the number of logged-in users. Our threshold will tell Horizon to create an event when the number of logged-in users on the device exceeds two, and re-arm when it falls below two.

Before creating a threshold, you need to make sure you are collecting the metric against which you want to threshold.

Determine you are collecting metric

In this case, we have chosen a metric (number of logged-in users) that is collected by default. We are also using data collected via SNMP. (For information on other collectors, see Collectors.)

In the Horizon UI, choose Reports>Resource Graphs.
Select one of the listed resources.
Under SNMP Node Data, select Node-level Performance Data and choose Graph Selection.
Scroll to find the Number of Users graph.
1. You can click the binoculars icon to display only this graph.

Create a threshold

Click the gear icon in the top-right.
Under Performance Measurement, choose Configure Thresholds.
1. A screen with a list of preconfigured threshold groups appears. We will work with netsnmp. For information on how to create a threshold group, see Creating a Threshold Group.
Click Edit beside the netsnmp group.
Click Create New Threshold at the bottom of the Basic Thresholds area of the screen.
Set the following information and click Save:

Field	Description	Default
Type	Triggers an event when the data source value equals or exceeds the threshold value, and re-arms when it equals or drops below the re-arm value.	high
Datasource	Name of the data source you want to threshold against. For this tutorial, we have provided the data source for logged-in users. For information on how to determine a metric’s data source, see Determine the data source.	hrSystemNumUsers
Datasource label	Optional text label. Not required for this tutorial.	leave blank
Value	The value above which we want to trigger an event. In this case, we want to trigger an event when the number of logged-in users exceeds two.	2
Re-arm	The value below which we want the system to re-arm. In this case, once the number of logged-in users falls below two.	2
Trigger	The number of consecutive times the threshold value can occur before the system triggers an event. Since our default polling period is 5 minutes, a value of 3 means Horizon would create a threshold event if there are more than 2 users for 15 minutes.	3
Description	Optional text to describe your threshold.	leave blank
Triggered UEI	A custom uniform event identifier (UEI) sent into the events system when the threshold is triggered. A custom UEI for each threshold makes it easier to create notifications. If left blank, it defaults to the standard thresholds UEIs.	leave blank
Re-armed UEI	A custom uniform event identifier (UEI) sent into the events system when the threshold is re-armed.	leave blank

Field

Description

Default

Type

Triggers an event when the data source value equals or exceeds the threshold value, and re-arms when it equals or drops below the re-arm value.

high

Datasource

Name of the data source you want to threshold against. For this tutorial, we have provided the data source for logged-in users. For information on how to determine a metric’s data source, see Determine the data source.

hrSystemNumUsers

Datasource label

Optional text label. Not required for this tutorial.

leave blank

Value

The value above which we want to trigger an event. In this case, we want to trigger an event when the number of logged-in users exceeds two.

Re-arm

The value below which we want the system to re-arm. In this case, once the number of logged-in users falls below two.

Trigger

The number of consecutive times the threshold value can occur before the system triggers an event. Since our default polling period is 5 minutes, a value of 3 means Horizon would create a threshold event if there are more than 2 users for 15 minutes.

Description

Optional text to describe your threshold.

leave blank

Triggered UEI

A custom uniform event identifier (UEI) sent into the events system when the threshold is triggered. A custom UEI for each threshold makes it easier to create notifications. If left blank, it defaults to the standard thresholds UEIs.

leave blank

Re-armed UEI

A custom uniform event identifier (UEI) sent into the events system when the threshold is re-armed.

leave blank

Testing the threshold

To test the threshold we just created, log a second person into the node you are monitoring. Navigate to the Events page. You should see an event that indicates your threshold triggered when more than one user logged in.

Log out the second user. The Events page should indicate that the system has re-armed.

Creating a threshold for CPU use

This procedure describes how to create an expression-based threshold when the five-minute CPU load average metric reaches or goes above 70% for two consecutive measurement intervals. Expression-based thresholds are useful when you need to threshold on a percentage, not the actual value of the data collected.

Expression-based thresholds work only if the data sources in question are in the same directory.

Click the gear icon in the top-right.
Under Performance Measurement, choose Configure Thresholds.
Click Edit beside the netsnmp group.
Click Create New Expression-based Threshold.

Fill in the following information:

Field	Description	Default
Type	Triggers an event when the data source value equals or exceeds the threshold value, and re-arms when it equals or drops below the re-arm value.	high
Expression	Divides the five-minute CPU load average by 100 (to obtain the effective load average), which is then divided by the number of CPUs. This value is then multiplied by 100 to provide a percentage. ( SNMP does not report in decimals, which is why the expression divides the loadavg5 by 100.)	((loadavg5 / 100) / CpuNumCpus) * 100
Datasource type	The type of data source from which you are collecting data.	node
Datasource label	Optional text label. Not required for this tutorial.	leave blank
Value	Trigger an event when the five-minute CPU load average goes above 70%.	70
Re-arm	Re-arm the system when the five-minute CPU load average drops below 50%.	50
Trigger	The number of consecutive times the threshold value can occur before the system triggers an event. In this case, when the five-minute CPU load average goes above 70% for two consecutive polling periods.	2
Description	Optional text to describe your threshold.	Trigger an alert when the five-minute CPU load average metric reaches or goes above 70% for two consecutive measurement intervals.
Triggered UEI	See the table in Create a threshold for details.	leave blank
Re-armed UEI	See the table in Create a threshold for details.	leave blank

Field

Description

Default

Type

Triggers an event when the data source value equals or exceeds the threshold value, and re-arms when it equals or drops below the re-arm value.

high

Expression

Divides the five-minute CPU load average by 100 (to obtain the effective load average*), which is then divided by the number of CPUs. This value is then multiplied by 100 to provide a percentage.

(* SNMP does not report in decimals, which is why the expression divides the loadavg5 by 100.)

((loadavg5 / 100) / CpuNumCpus) * 100

Datasource type

The type of data source from which you are collecting data.

node

Datasource label

Optional text label. Not required for this tutorial.

leave blank

Value

Trigger an event when the five-minute CPU load average goes above 70%.

Re-arm

Re-arm the system when the five-minute CPU load average drops below 50%.

Trigger

The number of consecutive times the threshold value can occur before the system triggers an event. In this case, when the five-minute CPU load average goes above 70% for two consecutive polling periods.

Description

Optional text to describe your threshold.

Trigger an alert when the five-minute CPU load average metric reaches or goes above 70% for two consecutive measurement intervals.

Triggered UEI

See the table in Create a threshold for details.

leave blank

Re-armed UEI

See the table in Create a threshold for details.

leave blank

Click Save.

Using metadata in a threshold

Metadata in expression-based thresholds can streamline threshold creation. The Metadata DSL (domain specific language) lets you use patterns in an expression, whereby the metadata is replaced with a corresponding value during the collection process. A single expression can behave differently based on the node being tested against.

During evaluation of an expression, the following scopes are available:

Node metadata
Interface metadata
Service metadata

Metadata is also supported in Value, Re-arm, and Trigger fields for Single-DS and expression-based thresholds.

For more information on metadata and how to define it, see Metadata.

This procedure uses metadata to trigger an event when the number of logged-in users exceeds 1.

The expression is in the form ${context:key|context_fallback:key_fallback|…|default}.

Before using metatdata in a threshold, you need to add the metatdata context pair, in this case, a requisition key called userLimit (see Adding metadata through the web UI).

Click the gear icon in the top-right menu.
Under Performance Measurement, choose Configure Thresholds.
Click Edit beside the netsnmp group.
Click Create New Expression-based Threshold.
Fill in the following information:
- Type: High
- Expression: hrSystemNumUsers / ${requisition:userLimit|1}
- Datasource type: Node
- Value: 1
- Rearm: 1
- Description: Too many logged-in users
Click Save.

This expression will trigger an event when the number of logged-in users exceeds 1.

meta expression2

Determining the data source

Creating a threshold requires the name of the data source generating the metrics on which you want to threshold. Data source names for the SNMP protocol appear in etc/snmp-graph.properties.d/.

To determine the name of the data source, navigate to the Resource Graphs screen. For example,
1. Reports>Resource Graphs.
2. Select one of the listed resources.
3. Under SNMP Node Data, select Node-level Performance Data and choose Graph Selection.
Scroll through the graphs to find the title of the graph that displays the metric on which you want to threshold. For example, "Number of Processes" or "System Uptime":
Go to etc/snmp-graph.properties.d/ and search for the title of the graph (for example, "System Uptime").
Note the name of the data source, and type it in the Datasource field when you create your threshold.

Create a threshold group

A threshold group associates a set of thresholds to a service (for example, thresholds that apply to all Cisco devices). Horizon includes seven preconfigured, editable threshold groups:

mib2
cisco
hrstorage
netsnmp
juniper-srx
netsnmp-memory-linux
netsnmp-memory-nonlinux

You can edit an existing group (through the UI) or create a new one (in the thresholds.xml file located in $OPENNMS_HOME/etc/thresholds.xml). Once you create the group, you can then define it in the thresholds.xml file or define it in the UI.

We will create a threshold group called "demo_group".

Type the following in the thresholds.xml file.

<group name="demo_group" rrdRepository="/opt/opennms/share/rrd/snmp/">
</group>

Once you have created the group in the thresholds.xml file, switch to the UI, go to the threshold screen and click Reload Threshold Configuration.
1. The group you created should appear in the UI.
Click Edit to edit it.

The following is a sample of how the threshold appears in the thresholds.xml file:

<group name="demo_group" rrdRepository="/opt/opennms/share/rrd/snmp/"> (1)
  <expression type="high" ds-type="hrStorageIndex" value="90.0"
    rearm="75.0" trigger="2" ds-label="hrStorageDescr"
    filterOperator="or" expression="hrStorageUsed / hrStorageSize * 100.0">
    <resource-filter field="hrStorageType">^\.1\.3\.6\.1\.2\.1\.25\.2\.1\.4$</resource-filter> (2)
  </expression>
</group>

1	The name of the group and the directory of the stored data.
2	The details of the threshold including type, data source type, threshold value, rearm value, and so on.

Create a notification on a threshold event

A custom UEI for each threshold makes it easier to create notifications.

Thresholding Service

The Thresholding Service is the component responsible for maintaining the state of the performance metrics and for generating alarms from these when thresholds are triggered (armed) or cleared (unarmed). The service listens for and compares performance metrics after they are persisted to the time-series database. The state of the thresholds are held in memory and pushed to persistent storage only when they are changed.

Distributed thresholding with Sentinel

Thresholding for streaming telemetry with telemetryd is supported on Sentinel when using Newts. When running on Sentinel, the thresholding state can be stored in either Cassandra or PostgreSQL. Given that Newts already requires Cassandra, we recommend using Cassandra to minimize the load on PostgreSQL.

Thresholding on Sentinel uses the same configuration files as Horizon and operates similarly. When a thresholding changes to/from trigger or cleared, an event is published which is processed by Horizon and the alarm is created or updated.

Shell commands

The following shell commands help debug and manage thresholding.

Enumerate the persisted threshold states using opennms:threshold-enumerate:

admin@opennms> opennms:threshold-enumerate
Index   State Key
1       23-127.0.0.1-hrStorageIndex-hrStorageUsed / hrStorageSize * 100.0-/opt/opennms/share/rrd/snmp-RELATIVE_CHANGE
2       23-127.0.0.1-if-ifHCInOctets * 8 / 1000000 / ifHighSpeed * 100-/opt/opennms/share/rrd/snmp-HIGH
3       23-127.0.0.1-node-((loadavg5 / 100) / CpuNumCpus) * 100.0-/opt/opennms/share/rrd/snmp-HIGH
4       23-127.0.0.1-if-ifInDiscards + ifOutDiscards-/opt/opennms/share/rrd/snmp-HIGH

Each state is uniquely identified by a state key and aliased by the given index. Indexes are scoped to the particular shell session and provided as an alternative to specifying the complete state key in subsequent commands.

Display state details using opennms:threshold-details:

admin@opennms> opennms:threshold-details 1
multiplier=1.333
lastSample=64.77758166043765
previousTriggeringSample=28.862826722171075
interpolatedExpression='hrStorageUsed / hrStorageSize * 100.0'

admin@opennms> opennms:threshold-details 2
exceededCount=0
armed=true
interpolatedExpression='ifHCInOctets * 8 / 1000000 / ifHighSpeed * 100'

Different types of thresholds display different properties.

Clear a particular persisted state using opennms:threshold-clear:

admin@opennms> opennms:threshold-clear 2

Or clear all the persisted states with opennms:threshold-clear-all:

admin@opennms> opennms:threshold-clear-all
Clearing all thresholding states....done