Corinneh: Published

2022-02-23T20:56:47Z

Published

New page

{{ArticlePEServiceMetrics
|IncludedServiceId=fe5268d0-df4c-4c25-bb9a-283f94d25d49
|CRD=PodMonitor
|Port=11000
|Endpoint=http://<pod-ipaddress>:11000/metrics
|MetricsUpdateInterval=30 seconds
|MetricsDefined=Yes
|MetricsIntro=Voice Agent State Service exposes Genesys-defined, Agent State Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the Agent State Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Agent State Service metrics not documented on this page.
|PEMetric={{PEMetric
|Metric=agent_redis_state
|Type=gauge
|Unit=N/A
|Label=location, redis_cluster_name
|MetricDescription=Current Redis connection state:

-1 - error 
0 - disconnected 
1 - connected 
2 - ready
|SampleValue=2
}}{{PEMetric
|Metric=agent_stream_redis_state
|Type=gauge
|Unit=N/A
|Label=location, redis_cluster_name
|MetricDescription=Current Tenant Redis connection state:

0 - disconnected 
1 - connected
|SampleValue=1
}}{{PEMetric
|Metric=agent_total_sessions
|Type=gauge
|Unit=N/A
|Label=tenant
|MetricDescription=Total number of agent sessions.
|UsedFor=Saturation
}}{{PEMetric
|Metric=agent_callevents
|Type=counter
|Unit=N/A
|Label=tenant
|MetricDescription=Total number of received call events.
|UsedFor=Traffic
}}{{PEMetric
|Metric=agent_logged_in_agents
|Type=gauge
|Unit=N/A
|Label=tenant
|MetricDescription=Number of logged-in agents.
|UsedFor=Saturation
}}{{PEMetric
|Metric=agent_health_level
|Type=gauge
|Unit=N/A
|Label=tenant
|MetricDescription=Health level of the agent node:

-1 - error 
0 - fail 
1 - degraded 
2 - pass
|SampleValue=2
|UsedFor=Traffic
}}{{PEMetric
|Metric=agent_envoy_proxy_status
|Type=gauge
|Unit=N/A
|MetricDescription=Status of the Envoy proxy:

-1 - error 
0 - disconnected 
1 - connected
|SampleValue=1
}}{{PEMetric
|Metric=agent_config_node_status
|Type=gauge
|Unit=N/A
|MetricDescription=Status of the config node connection:

0 - disconnected 
1 - connected
|SampleValue=1
}}{{PEMetric
|Metric=http_client_request_duration_seconds
|Type=histogram
|Unit=seconds
|Label=target_service_name
|MetricDescription=HTTP client time from request to response, in seconds.
}}{{PEMetric
|Metric=http_client_response_count
|Type=counter
|Unit=N/A
|Label=target_service_name, tenant, status
|MetricDescription=HTTP client responses received.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_consumer_recv_messages_total
|Type=counter
|Unit=N/A
|Label=topic, tenant, kafka_location
|MetricDescription=Number of messages received from Kafka.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_consumer_error_total
|Type=counter
|Unit=N/A
|Label=topic, kafka_location
|MetricDescription=Number of Kafka consumer errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=kafka_consumer_latency
|Type=histogram
|Label=topic, tenant, kafka_location
|MetricDescription=Consumer latency is the time difference between when the message is produced and when the message is consumed. That is, the time when the consumer received the message minus the time when the producer produced the message.
|UsedFor=Latency
}}{{PEMetric
|Metric=kafka_consumer_rebalance_total
|Type=counter
|Unit=N/A
|Label=topic, kafka_location
|MetricDescription=Number of Kafka consumer re-balance events.
}}{{PEMetric
|Metric=kafka_consumer_state
|Type=gauge
|Unit=N/A
|Label=topic, kafka_location
|MetricDescription=Current state of the Kafka consumer.
}}{{PEMetric
|Metric=kafka_producer__messages_total
|Type=counter
|Unit=N/A
|Label=topic, tenant, kafka_location
|MetricDescription=Number of messages received from Kafka.
}}{{PEMetric
|Metric=kafka_producer_queue_depth
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer pending events.
|UsedFor=Saturation
}}{{PEMetric
|Metric=kafka_producer_queue_age_seconds
|Type=gauge
|Unit=seconds
|Label=kafka_location
|MetricDescription=Age of the oldest producer pending event in seconds.
}}{{PEMetric
|Metric=kafka_producer_error_total
|Type=counter
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer errors.
}}{{PEMetric
|Metric=kafka_producer_state
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Current state of the Kafka producer.
}}{{PEMetric
|Metric=log_output_bytes_total
|Type=counter
|Unit=bytes
|Label=level, format, module
|MetricDescription=Total amount of log output, in bytes.
}}
|AlertsDefined=Yes
|PEAlert={{PEAlert
|Alert=Kafka events latency is too high
|Severity=Warning
|AlertDescription=Actions:

*If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload).
*If the alarm is triggered only for topic <nowiki>{{ $labels.topic }}</nowiki>, check if there is an issue with the service related to the topic (CPU, memory, or network overload).
|BasedOn=kafka_consumer_latency_bucket
|Threshold=Latency for more than 5% of messages is more than 0.5 seconds for topic <nowiki>{{ $labels.topic }}</nowiki>.
}}{{PEAlert
|Alert=Possible messages lost
|Severity=Critical
|AlertDescription=Actions:

*Check Kafka and <nowiki>{{ $labels.job }}</nowiki> service overload, network degradation.
|BasedOn=kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total
|Threshold=Number of sent requests is two times higher than received for topic <nowiki>{{ $labels.topic }}</nowiki>.
}}{{PEAlert
|Alert=Too many Kafka consumer failed health checks
|Severity=Warning
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka.
*If the alarm is triggered only for container <nowiki>{{ $labels.container }}</nowiki>, check if there is an issue with the service.
|BasedOn=kafka_consumer_error_total
|Threshold=Health check failed more than 10 times in 5 minutes for Kafka consumer for topic <nowiki>{{ $labels.topic}}</nowiki>.
}}{{PEAlert
|Alert=Too many Kafka consumer request timeouts
|Severity=Warning
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka.
*If the alarm is triggered only for container <nowiki>{{ $labels.container }}</nowiki>, check if there is an issue with the service.
|BasedOn=kafka_consumer_error_total
|Threshold=More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic <nowiki>{{ $labels.topic}}</nowiki>.
}}{{PEAlert
|Alert=Too many Kafka consumer crashes
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka.
*If the alarm is triggered only for container <nowiki>{{ $labels.container }}</nowiki>, check if there is an issue with the service.
|BasedOn=kafka_consumer_error_total
|Threshold=More than 3 Kafka consumer crashes in 5 minutes for service <nowiki>{{ $labels.container }}</nowiki>.
}}{{PEAlert
|Alert=Pod status Failed
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Failed state.
}}{{PEAlert
|Alert=Pod status Unknown
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Unknown state for 5 minutes.
}}{{PEAlert
|Alert=Pod status Pending
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Pending state for 5 minutes.
}}{{PEAlert
|Alert=Pod status NotReady
|Severity=Critical
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_ready
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in NotReady status for 5 minutes.
}}{{PEAlert
|Alert=Container restarted repeatedly
|Severity=Critical
|AlertDescription=Actions:

*Check if the new version of the image was deployed.
*Check for issues with the Kubernetes cluster.
|BasedOn=kube_pod_container_status_restarts_total
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> was restarted 5 or more times within 15 minutes.
}}{{PEAlert
|Alert=Max replicas is not sufficient for 5 mins
|Severity=Critical
|AlertDescription=The desired number of replicas is higher than the current available replicas for the past 5 minutes.
|BasedOn=kube_statefulset_replicas, kube_statefulset_status_replicas
|Threshold=The desired number of replicas is higher than the current available replicas for the past 5 minutes.
}}{{PEAlert
|Alert=Kafka not available
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=kafka_producer_state, kafka_consumer_state
|Threshold=Kafka is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Redis not available
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=agent_redis_state, agent_stream_redis_state
|Threshold=Redis is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Agent service fail
|Severity=Critical
|AlertDescription=Actions:

*Check if there is any problem with pod <nowiki>{{ $labels.pod }}</nowiki>, then restart the pod.
|BasedOn=agent_health_level
|Threshold=Agent health level is Fail for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Config node fail
|Severity=Warning
|AlertDescription=Actions:

*Check if there is any problem with pod <nowiki>{{ $labels.pod }}</nowiki> and the config node.
|BasedOn=http_client_response_count
|Threshold=Requests to the config node fail for 5 consecutive minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 65%
|Severity=Warning
|AlertDescription=High CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 80%
|Severity=Critical
|AlertDescription=Critical CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 65%
|Severity=Warning
|AlertDescription=High memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 80%
|Severity=Critical
|AlertDescription=Critical memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Too many Kafka pending events
|Severity=Critical
|AlertDescription=Actions:

*Ensure there are no issues with Kafka or <nowiki>{{ $labels.pod }}</nowiki> pod's CPU and network.
|BasedOn=kafka_producer_queue_depth
|Threshold=Too many Kafka producer pending events for pod <nowiki>{{ $labels.pod }}</nowiki> (more than 100 in 5 minutes).
}}
}}

VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics - Revision history

Corinneh: Published