Tenant Data Collection Unit (DCU) metrics and alerts

From Genesys Documentation
Jump to: navigation, search
This topic is part of the manual Genesys Pulse Private Edition Guide for version Current of Reporting.


Find the metrics Tenant Data Collection Unit (DCU) exposes and the alerts defined for Tenant Data Collection Unit (DCU).

Related documentation:
Service CRD or annotations? Port Endpoint/Selector Metrics update interval
Tenant Data Collection Unit (DCU) PodMonitor 9091
selector:
  matchLabels:
    app.kubernetes.io/name: {{include "common.util.chart.name" . }}
    app.kubernetes.io/instance: {{include "common.util.chart.fullname" . }}
    service: {{.Release.Namespace }}
    servicename: {{include "common.util.chart.name" . }}
    tenant: {{.Values.tenant.sid }}

Endpoints to query: /metrics/

30 seconds

See details about:

Metrics[edit source]

Metric and description Metric details Indicator of
pulse_monitor_check_duration_seconds

The duration in seconds of the last health check performed by Monitor Agent.

Unit: seconds

Type: Gauge
Label: tenant
Sample value:

Error
pulse_collector_uptime_seconds

The Collector container uptime in seconds.

Unit: seconds

Type: Gauge
Label: tenant
Sample value:

Error
pulse_collector_snapshot_writing_status

The status of writing Collector snapshots to the Redis.

Unit:

Type: Gauge
Label: tenant
Sample value: 1

Error
pulse_collector_active_layouts_count

The number of active layouts.

Unit:

Type: Gauge
Label: tenant
Sample value: 100

Saturation
pulse_collector_connection_status

The status of the Collector connection to the upstream server.

Unit:

Type: Gauge
Label: tenant, connection
Sample value: 1

Error
pulse_collector_connection_connected_seconds

Duration in seconds of connection to the upstream server.

Unit: seconds

Type: Gauge
Label: tenant, connection
Sample value:

Error
pulse_collector_connection_disconnected_seconds

Duration in seconds of disconnection from the upstream server.

Unit: seconds

Type: Gauge
Label: tenant, connection
Sample value:

Error
pulse_collector_statistics_total_count

The total number of Collector statistics.

Unit:

Type: Gauge
Label: tenant. connection
Sample value: 1000

Saturation
pulse_collector_statistics_opened_count

The number of successfully open Collector statistics.

Unit:

Type: Gauge
Label: tenant, connection
Sample value: 1000

Saturation
pulse_collector_statistics_failed_count

The number of Collector statistics that failed to open.

Unit:

Type: Gauge
Label: tenant, connection
Sample value: 0

Error
pulse_statserver_uptime_seconds

The Stat Server container uptime in seconds.

Unit: seconds

Type: Gauge
Label: tenant
Sample value:

Error
pulse_statserver_clients_number

The number of clients connected to the Stat Server.

Unit:

Type: Gauge
Label: tenant
Sample value: 1

Error
pulse_statserver_messages_received_total_count

The total number of messages received by the Stat Server.

Unit:

Type: Gauge
Label: tenant
Sample value: 10000

Traffic
pulse_statserver_messages_sent_total_count

The total number of messages sent by the Stat Server.

Unit:

Type: Gauge
Label: tenant
Sample value: 10000

Traffic
pulse_statserver_server_connected_number

The number of Stat Server connections to upstream servers.

Unit:

Type: Gauge
Label: tenant, type
Sample value: 1

Error
pulse_statserver_server_messages_received_total_count

The total number of messages received by the Stat Server from the upstream server.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 1

Traffic
pulse_statserver_server_connected_seconds

Duration in seconds of the Stat Server connection to the upstream server.

Unit: seconds

Type: Gauge
Label: tenant, server, type
Sample value:

Error
pulse_statserver_server_disconnects_count

Duration in seconds of the Stat Server disconnection from the upstream server.

Unit: seconds

Type: Gauge
Label: tenant, server, type
Sample value:

Error
pulse_statserver_dn_registered

The number of successful registration attempts during current session with the upstream T-Server.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 1000

Saturation
pulse_statserver_dn_failed

The number of DNs for which registration failed after predefined number of attempts.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 0

Error
pulse_statserver_server_latency_avg

The average Stat Server server latency in seconds.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 0.25

Latency
pulse_statserver_server_latency_min

The minimum Stat Server server latency in seconds.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 0.25

Latency
pulse_statserver_server_latency_max

The maximum Stat Server server latency in seconds.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 0.25

Latency
pulse_statserver_server_tevents_received_total_count

The total number of T-Events received by the Stat Server from the upstream T-Server.

Unit:

Type: Gauge
Label: tenant, server, type
Sample value: 10000

Traffic


Alerts[edit source]

Alerts are based on Collector, Stat Server, and Kubernetes cluster metrics.

The following alerts are defined for Tenant Data Collection Unit (DCU).

Alert Severity Description Based on Threshold
pulse_dcu_monitor_data_unavailable Critical Pulse DCU Monitor Agents do not provide data. pulse_monitor_check_duration_seconds, kube_statefulset_replicas for 15 minutes


pulse_dcu_critical_nonrunning_instances Critical Triggered when Pulse DCU instances are down. kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas for 15 minutes


pulse_dcu_too_frequent_restarts Critical Detected too frequent restarts of DCU Pod container. kube_pod_container_status_restarts_total 2 for 1 hour


pulse_dcu_critical_cpu Critical Detected critical CPU usage by Pulse DCU Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%


pulse_dcu_critical_memory Critical Detected critical memory usage by Pulse DCU Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%


pulse_dcu_critical_disk Critical Detected critical disk usage by Pulse DCU Pod. kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes 90%


pulse_dcu_critical_col_snapshot_writing Critical Pulse DCU Collector does not write snapshots. pulse_collector_snapshot_writing_status for 15 minutes


pulse_dcu_critical_col_connected_configservers Critical Pulse DCU Collector is not connected to ConfigServer. pulse_collector_connection_status for 15 minutes


pulse_dcu_critical_col_connected_dbservers Critical Pulse DCU Collector is not connected to DbServer. pulse_collector_connection_status for 15 minutes


pulse_dcu_critical_col_connected_statservers Critical Pulse DCU Collector is not connected to Stat Server. pulse_collector_connection_status for 15 minutes


pulse_dcu_critical_ss_failed_dn_registrations Critical Detected critical DN registration failures on Pulse DCU Stat Server. pulse_statserver_dn_failed, pulse_statserver_dn_registered 0.5%


pulse_dcu_critical_ss_connected_configservers Critical Pulse DCU Stat Server is not connected to ConfigServer. pulse_statserver_server_connected_seconds for 15 minutes


pulse_dcu_critical_ss_connected_tservers Critical Pulse DCU Stat Server is not connected to T-Servers. pulse_statserver_server_connected_number 2


pulse_dcu_critical_ss_connected_ixnservers Critical Pulse DCU Stat Server is not connected to IxnServers. pulse_statserver_server_connected_seconds 2