Tenant Data Collection Unit (DCU) metrics and alerts
Find the metrics Tenant Data Collection Unit (DCU) exposes and the alerts defined for Tenant Data Collection Unit (DCU).
| Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval | 
|---|---|---|---|---|
| Tenant Data Collection Unit (DCU) | PodMonitor | 9091 | selector:
  matchLabels:
    app.kubernetes.io/name: {{include "common.util.chart.name" . }}
    app.kubernetes.io/instance: {{include "common.util.chart.fullname" . }}
    service: {{.Release.Namespace }}
    servicename: {{include "common.util.chart.name" . }}
    tenant: {{.Values.tenant.sid }}Endpoints to query: /metrics/ | 30 seconds | 
See details about:
Metrics[edit source]
| Metric and description | Metric details | Indicator of | 
|---|---|---|
| pulse_ The duration in seconds of the last health check performed by Monitor Agent. | Unit: seconds Type: Gauge | Error | 
| pulse_ The Collector container uptime in seconds. | Unit: seconds Type: Gauge | Error | 
| pulse_ The status of writing Collector snapshots to the Redis. | Unit: Type: Gauge | Error | 
| pulse_ The number of active layouts. | Unit: Type: Gauge | Saturation | 
| pulse_ The status of the Collector connection to the upstream server. | Unit: Type: Gauge | Error | 
| pulse_ Duration in seconds of connection to the upstream server. | Unit: seconds Type: Gauge | Error | 
| pulse_ Duration in seconds of disconnection from the upstream server. | Unit: seconds Type: Gauge | Error | 
| pulse_ The total number of Collector statistics. | Unit: Type: Gauge | Saturation | 
| pulse_ The number of successfully open Collector statistics. | Unit: Type: Gauge | Saturation | 
| pulse_ The number of Collector statistics that failed to open. | Unit: Type: Gauge | Error | 
| pulse_ The Stat Server container uptime in seconds. | Unit: seconds Type: Gauge | Error | 
| pulse_ The number of clients connected to the Stat Server. | Unit: Type: Gauge | Error | 
| pulse_ The total number of messages received by the Stat Server. | Unit: Type: Gauge | Traffic | 
| pulse_ The total number of messages sent by the Stat Server. | Unit: Type: Gauge | Traffic | 
| pulse_ The number of Stat Server connections to upstream servers. | Unit: Type: Gauge | Error | 
| pulse_ The total number of messages received by the Stat Server from the upstream server. | Unit: Type: Gauge | Traffic | 
| pulse_ Duration in seconds of the Stat Server connection to the upstream server. | Unit: seconds Type: Gauge | Error | 
| pulse_ Duration in seconds of the Stat Server disconnection from the upstream server. | Unit: seconds Type: Gauge | Error | 
| pulse_ The number of successful registration attempts during current session with the upstream T-Server. | Unit: Type: Gauge | Saturation | 
| pulse_ The number of DNs for which registration failed after predefined number of attempts. | Unit: Type: Gauge | Error | 
| pulse_ The average Stat Server server latency in seconds. | Unit: Type: Gauge | Latency | 
| pulse_ The minimum Stat Server server latency in seconds. | Unit: Type: Gauge | Latency | 
| pulse_ The maximum Stat Server server latency in seconds. | Unit: Type: Gauge | Latency | 
| pulse_ The total number of T-Events received by the Stat Server from the upstream T-Server. | Unit: Type: Gauge | Traffic | 
Alerts[edit source]
Alerts are based on Collector, Stat Server, and Kubernetes cluster metrics.
The following alerts are defined for Tenant Data Collection Unit (DCU).
| Alert | Severity | Description | Based on | Threshold | 
|---|---|---|---|---|
| pulse_dcu_monitor_data_unavailable | Critical | Pulse DCU Monitor Agents do not provide data. | pulse_monitor_check_duration_seconds, kube_statefulset_replicas | for 15 minutes 
 | 
| pulse_dcu_critical_nonrunning_instances | Critical | Triggered when Pulse DCU instances are down. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas | for 15 minutes 
 | 
| pulse_dcu_too_frequent_restarts | Critical | Detected too frequent restarts of DCU Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour 
 | 
| pulse_dcu_critical_cpu | Critical | Detected critical CPU usage by Pulse DCU Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% 
 | 
| pulse_dcu_critical_memory | Critical | Detected critical memory usage by Pulse DCU Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% 
 | 
| pulse_dcu_critical_disk | Critical | Detected critical disk usage by Pulse DCU Pod. | kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes | 90% 
 | 
| pulse_dcu_critical_col_snapshot_writing | Critical | Pulse DCU Collector does not write snapshots. | pulse_collector_snapshot_writing_status | for 15 minutes 
 | 
| pulse_dcu_critical_col_connected_configservers | Critical | Pulse DCU Collector is not connected to ConfigServer. | pulse_collector_connection_status | for 15 minutes 
 | 
| pulse_dcu_critical_col_connected_dbservers | Critical | Pulse DCU Collector is not connected to DbServer. | pulse_collector_connection_status | for 15 minutes 
 | 
| pulse_dcu_critical_col_connected_statservers | Critical | Pulse DCU Collector is not connected to Stat Server. | pulse_collector_connection_status | for 15 minutes 
 | 
| pulse_dcu_critical_ss_failed_dn_registrations | Critical | Detected critical DN registration failures on Pulse DCU Stat Server. | pulse_statserver_dn_failed, pulse_statserver_dn_registered | 0.5% 
 | 
| pulse_dcu_critical_ss_connected_configservers | Critical | Pulse DCU Stat Server is not connected to ConfigServer. | pulse_statserver_server_connected_seconds | for 15 minutes 
 | 
| pulse_dcu_critical_ss_connected_tservers | Critical | Pulse DCU Stat Server is not connected to T-Servers. | pulse_statserver_server_connected_number | 2 
 | 
| pulse_dcu_critical_ss_connected_ixnservers | Critical | Pulse DCU Stat Server is not connected to IxnServers. | pulse_statserver_server_connected_seconds | 2 | 
