Agent State Service metrics and alerts
Find the metrics Agent State Service exposes and the alerts defined for Agent State Service.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
Agent State Service | PodMonitor | 11000 | http://<pod-ipaddress>:11000/metrics | 30 seconds |
See details about:
Metrics[edit source]
Voice Agent State Service exposes Genesys-defined, Agent State Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the Agent State Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Agent State Service metrics not documented on this page.
Metric and description | Metric details | Indicator of |
---|---|---|
agent_ Current Redis connection state: -1 - error |
Unit: N/A Type: gauge |
|
agent_ Current Tenant Redis connection state: 0 - disconnected |
Unit: N/A Type: gauge |
|
agent_ Total number of agent sessions. |
Unit: N/A Type: gauge |
Saturation |
agent_ Total number of received call events. |
Unit: N/A Type: counter |
Traffic |
agent_ Number of logged-in agents. |
Unit: N/A Type: gauge |
Saturation |
agent_ Health level of the agent node: -1 - error |
Unit: N/A Type: gauge |
Traffic |
agent_ Status of the Envoy proxy: -1 - error |
Unit: N/A Type: gauge |
|
agent_ Status of the config node connection: 0 - disconnected |
Unit: N/A Type: gauge |
|
http_ HTTP client time from request to response, in seconds. |
Unit: seconds Type: histogram |
|
http_ HTTP client responses received. |
Unit: N/A Type: counter |
Traffic |
kafka_ Number of messages received from Kafka. |
Unit: N/A Type: counter |
Traffic |
kafka_ Number of Kafka consumer errors. |
Unit: N/A Type: counter |
Errors |
kafka_ Consumer latency is the time difference between when the message is produced and when the message is consumed. That is, the time when the consumer received the message minus the time when the producer produced the message. |
Unit: Type: histogram |
Latency |
kafka_ Number of Kafka consumer re-balance events. |
Unit: N/A Type: counter |
|
kafka_ Current state of the Kafka consumer. |
Unit: N/A Type: gauge |
|
kafka_ Number of messages received from Kafka. |
Unit: N/A Type: counter |
|
kafka_ Number of Kafka producer pending events. |
Unit: N/A Type: gauge |
Saturation |
kafka_ Age of the oldest producer pending event in seconds. |
Unit: seconds Type: gauge |
|
kafka_ Number of Kafka producer errors. |
Unit: N/A Type: counter |
|
kafka_ Current state of the Kafka producer. |
Unit: N/A Type: gauge |
|
log_ Total amount of log output, in bytes. |
Unit: bytes Type: counter |
Alerts[edit source]
The following alerts are defined for Agent State Service.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
Kafka events latency is too high | Warning | Actions:
|
kafka_consumer_latency_bucket | Latency for more than 5% of messages is more than 0.5 seconds for topic {{ $labels.topic }}.
|
Possible messages lost | Critical | Actions:
|
kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total | Number of sent requests is two times higher than received for topic {{ $labels.topic }}.
|
Too many Kafka consumer failed health checks | Warning | Actions:
|
kafka_consumer_error_total | Health check failed more than 10 times in 5 minutes for Kafka consumer for topic {{ $labels.topic}}.
|
Too many Kafka consumer request timeouts | Warning | Actions:
|
kafka_consumer_error_total | More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic {{ $labels.topic}}.
|
Too many Kafka consumer crashes | Critical | Actions:
|
kafka_consumer_error_total | More than 3 Kafka consumer crashes in 5 minutes for service {{ $labels.container }}.
|
Pod status Failed | Warning | Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Failed state.
|
Pod status Unknown | Warning | Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Unknown state for 5 minutes.
|
Pod status Pending | Warning | Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Pending state for 5 minutes.
|
Pod status NotReady | Critical | Actions:
|
kube_pod_status_ready | Pod {{ $labels.pod }} is in NotReady status for 5 minutes.
|
Container restarted repeatedly | Critical | Actions:
|
kube_pod_container_status_restarts_total | Container {{ $labels.container }} was restarted 5 or more times within 15 minutes.
|
Max replicas is not sufficient for 5 mins | Critical | The desired number of replicas is higher than the current available replicas for the past 5 minutes. | kube_statefulset_replicas, kube_statefulset_status_replicas | The desired number of replicas is higher than the current available replicas for the past 5 minutes.
|
Kafka not available | Critical | Actions:
|
kafka_producer_state, kafka_consumer_state | Kafka is not available for pod {{ $labels.pod }} for 5 consecutive minutes.
|
Redis not available | Critical | Actions:
|
agent_redis_state, agent_stream_redis_state | Redis is not available for pod {{ $labels.pod }} for 5 consecutive minutes.
|
Agent service fail | Critical | Actions:
|
agent_health_level | Agent health level is Fail for pod {{ $labels.pod }} for 5 consecutive minutes.
|
Config node fail | Warning | Actions:
|
http_client_response_count | Requests to the config node fail for 5 consecutive minutes.
|
Pod CPU greater than 65% | Warning | High CPU load for pod {{ $labels.pod }}. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.
|
Pod CPU greater than 80% | Critical | Critical CPU load for pod {{ $labels.pod }}. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes.
|
Pod memory greater than 65% | Warning | High memory usage for pod {{ $labels.pod }}. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.
|
Pod memory greater than 80% | Critical | Critical memory usage for pod {{ $labels.pod }}. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.
|
Too many Kafka pending events | Critical | Actions:
|
kafka_producer_queue_depth | Too many Kafka producer pending events for pod {{ $labels.pod }} (more than 100 in 5 minutes). |