Cassandra Monitoring

This section describes some of the metrics Horizon collects from a Cassandra cluster.

JMX must be enabled on the Cassandra nodes and made accessible from Horizon in order to collect these metrics. See Enabling JMX authentication and authorization for details.

The data collection is bound to the agent IP interface with the service name JMX-Cassandra. The JMXCollector retrieves the MBean entities from the Cassandra node.

Client connections

Collects the number of active client connections from org.apache.cassandra.metrics.Client:

Name	Description
connectedNativeClients	Metrics for connected native clients
connectedThriftClients	Metrics for connected thrift clients

Name

Description

connectedNativeClients

Metrics for connected native clients

connectedThriftClients

Metrics for connected thrift clients

Compaction bytes

Collects the following compaction manager metrics from org.apache.cassandra.metrics.Compaction:

Name	Description
BytesCompacted	Number of bytes compacted since node started.

Name

Description

BytesCompacted

Number of bytes compacted since node started.

Compaction tasks

Collects the following compaction manager metrics from org.apache.cassandra.metrics.Compaction:

Name	Description
CompletedTasks	Estimated number of completed compaction tasks.
PendingTasks	Estimated number of pending compaction tasks.

Name

Description

CompletedTasks

Estimated number of completed compaction tasks.

PendingTasks

Estimated number of pending compaction tasks.

Storage load

Collects the following storage load metrics from org.apache.cassandra.metrics.Storage:

Name	Description
Load	Total disk space (in bytes) this node uses.

Storage exceptions

Collects the following storage exception metrics from org.apache.cassandra.metrics.Storage:

Name	Description
Exceptions	Number of unhandled exceptions since start of this Cassandra instance.

Name

Description

Exceptions

Number of unhandled exceptions since start of this Cassandra instance.

Dropped messages

Measurement of messages that were droppable. These ran after a given timeout set per message type so were discarded. In JMX, access them via org.apache.cassandra.metrics.DroppedMessage. The number of dropped messages in the different message queues is a good indication of whether a cluster can handle its load.

Name	Description	Stage
Mutation	If a write message is processed after its timeout (write_request_timeout_in_ms), it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.	MutationStage
Counter_Mutation	If a write message is processed after its timeout (write_request_timeout_in_ms), it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.	MutationStage
Read_Repair	Times out after write_request_timeout_in_ms.	MutationStage
Read	Times out after read_request_timeout_in_ms. No point in servicing reads after that point since it would have returned an error to the client.	ReadStage
Range_Slice	Times out after range_request_timeout_in_ms.	ReadStage
Request_Response	Times out after request_timeout_in_ms. Response was completed and sent back but not before the timeout.	RequestResponseStage

Name

Description

Stage

Mutation

If a write message is processed after its timeout (write_request_timeout_in_ms), it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.

MutationStage

Counter_Mutation

MutationStage

Read_Repair

Times out after write_request_timeout_in_ms.

MutationStage

Read

Times out after read_request_timeout_in_ms. No point in servicing reads after that point since it would have returned an error to the client.

ReadStage

Range_Slice

Times out after range_request_timeout_in_ms.

ReadStage

Request_Response

Times out after request_timeout_in_ms. Response was completed and sent back but not before the timeout.

RequestResponseStage

Thread pools

Apache Cassandra is based on a staged event-driven architecture (SEDA). This separates different operations in stages. These stages are loosely coupled using a messaging service. Each of these components uses queues and thread pools to group and execute its tasks. The documentation for Cassandra thread pool monitoring originated from the Pythian Guide to Cassandra Thread Pools.

Table 1. Collected metrics for Thread Pools
Name	Description
ActiveTasks	Tasks that are currently running.
CompletedTasks	Tasks that have been completed.
CurrentlyBlockedTasks	Tasks that have been blocked due to a full queue.
PendingTasks	Tasks queued for execution.

Memtable FlushWriter

Sort and write memtables to disk from org.apache.cassandra.metrics.ThreadPools. A majority of the time this backing up is from overrunning disk capability. Sorting can cause issues as well, usually accompanied with high load but a small amount of actual flushes (seen in cfstats). The cause can be from huge rows with large column names; in other words, something inserting many large values into a CQL collection. If overrunning disk capabilities, add nodes or tune the configuration.