Tenant Data Collection Unit (DCU) metrics and alerts
Find the metrics Tenant Data Collection Unit (DCU) exposes and the alerts defined for Tenant Data Collection Unit (DCU).
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
Tenant Data Collection Unit (DCU) | PodMonitor | 9091 | selector:
matchLabels:
app.kubernetes.io/name: {{include "common.util.chart.name" . }}
app.kubernetes.io/instance: {{include "common.util.chart.fullname" . }}
service: {{.Release.Namespace }}
servicename: {{include "common.util.chart.name" . }}
tenant: {{.Values.tenant.sid }} Endpoints to query: /metrics/ |
30 seconds |
See details about:
Metrics[edit source]
Metric and description | Metric details | Indicator of |
---|---|---|
pulse_ The duration in seconds of the last health check performed by Monitor Agent. |
Unit: seconds Type: Gauge |
Error |
pulse_ The Collector container uptime in seconds. |
Unit: seconds Type: Gauge |
Error |
pulse_ The status of writing Collector snapshots to the Redis. |
Unit: Type: Gauge |
Error |
pulse_ The number of active layouts. |
Unit: Type: Gauge |
Saturation |
pulse_ The status of the Collector connection to the upstream server. |
Unit: Type: Gauge |
Error |
pulse_ Duration in seconds of connection to the upstream server. |
Unit: seconds Type: Gauge |
Error |
pulse_ Duration in seconds of disconnection from the upstream server. |
Unit: seconds Type: Gauge |
Error |
pulse_ The total number of Collector statistics. |
Unit: Type: Gauge |
Saturation |
pulse_ The number of successfully open Collector statistics. |
Unit: Type: Gauge |
Saturation |
pulse_ The number of Collector statistics that failed to open. |
Unit: Type: Gauge |
Error |
pulse_ The Stat Server container uptime in seconds. |
Unit: seconds Type: Gauge |
Error |
pulse_ The number of clients connected to the Stat Server. |
Unit: Type: Gauge |
Error |
pulse_ The total number of messages received by the Stat Server. |
Unit: Type: Gauge |
Traffic |
pulse_ The total number of messages sent by the Stat Server. |
Unit: Type: Gauge |
Traffic |
pulse_ The number of Stat Server connections to upstream servers. |
Unit: Type: Gauge |
Error |
pulse_ The total number of messages received by the Stat Server from the upstream server. |
Unit: Type: Gauge |
Traffic |
pulse_ Duration in seconds of the Stat Server connection to the upstream server. |
Unit: seconds Type: Gauge |
Error |
pulse_ Duration in seconds of the Stat Server disconnection from the upstream server. |
Unit: seconds Type: Gauge |
Error |
pulse_ The number of successful registration attempts during current session with the upstream T-Server. |
Unit: Type: Gauge |
Saturation |
pulse_ The number of DNs for which registration failed after predefined number of attempts. |
Unit: Type: Gauge |
Error |
pulse_ The average Stat Server server latency in seconds. |
Unit: Type: Gauge |
Latency |
pulse_ The minimum Stat Server server latency in seconds. |
Unit: Type: Gauge |
Latency |
pulse_ The maximum Stat Server server latency in seconds. |
Unit: Type: Gauge |
Latency |
pulse_ The total number of T-Events received by the Stat Server from the upstream T-Server. |
Unit: Type: Gauge |
Traffic |
Alerts[edit source]
Alerts are based on Collector, Stat Server, and Kubernetes cluster metrics.
The following alerts are defined for Tenant Data Collection Unit (DCU).
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
pulse_dcu_monitor_data_unavailable | Critical | Pulse DCU Monitor Agents do not provide data. | pulse_monitor_check_duration_seconds, kube_statefulset_replicas | for 15 minutes
|
pulse_dcu_critical_nonrunning_instances | Critical | Triggered when Pulse DCU instances are down. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas | for 15 minutes
|
pulse_dcu_too_frequent_restarts | Critical | Detected too frequent restarts of DCU Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour
|
pulse_dcu_critical_cpu | Critical | Detected critical CPU usage by Pulse DCU Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90%
|
pulse_dcu_critical_memory | Critical | Detected critical memory usage by Pulse DCU Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90%
|
pulse_dcu_critical_disk | Critical | Detected critical disk usage by Pulse DCU Pod. | kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes | 90%
|
pulse_dcu_critical_col_snapshot_writing | Critical | Pulse DCU Collector does not write snapshots. | pulse_collector_snapshot_writing_status | for 15 minutes
|
pulse_dcu_critical_col_connected_configservers | Critical | Pulse DCU Collector is not connected to ConfigServer. | pulse_collector_connection_status | for 15 minutes
|
pulse_dcu_critical_col_connected_dbservers | Critical | Pulse DCU Collector is not connected to DbServer. | pulse_collector_connection_status | for 15 minutes
|
pulse_dcu_critical_col_connected_statservers | Critical | Pulse DCU Collector is not connected to Stat Server. | pulse_collector_connection_status | for 15 minutes
|
pulse_dcu_critical_ss_failed_dn_registrations | Critical | Detected critical DN registration failures on Pulse DCU Stat Server. | pulse_statserver_dn_failed, pulse_statserver_dn_registered | 0.5%
|
pulse_dcu_critical_ss_connected_configservers | Critical | Pulse DCU Stat Server is not connected to ConfigServer. | pulse_statserver_server_connected_seconds | for 15 minutes
|
pulse_dcu_critical_ss_connected_tservers | Critical | Pulse DCU Stat Server is not connected to T-Servers. | pulse_statserver_server_connected_number | 2
|
pulse_dcu_critical_ss_connected_ixnservers | Critical | Pulse DCU Stat Server is not connected to IxnServers. | pulse_statserver_server_connected_seconds | 2 |