Corinneh: Published

2022-02-23T20:56:27Z

Published

New page

{{ArticlePEServiceMetrics
|IncludedServiceId=e6a28c22-4cf7-4037-b117-2f7c5b35d8f5
|CRD=Supports both CRD and annotations
|Port=11900
|Endpoint=http://<pod-ipaddress>:11900/metrics
|MetricsUpdateInterval=30 seconds
|MetricsDefined=Yes
|MetricsIntro=Voice Call State Service exposes Genesys-defined, Call State Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the Call State Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Call State Service metrics not documented on this page.
|PEMetric={{PEMetric
|Metric=callthread_call_threads
|Type=counter
|Unit=N/A
|MetricDescription=Number of monitored call threads.
|UsedFor=Saturation
}}{{PEMetric
|Metric=callthread_envoy_proxy_status
|Type=gauge
|Unit=N/A
|MetricDescription=Status of the envoy proxy:

-1 - error 
0 - disconnected 
1 – connected
}}{{PEMetric
|Metric=callthread_health_level
|Type=gauge
|Unit=N/A
|MetricDescription=Health level of the agent node:

-1 - error 
0 - fail 
1 - degraded 
2 - pass
}}{{PEMetric
|Metric=callthread_healthcheck_generic_exception
|Type=gauge
|Unit=N/A
|MetricDescription=Generic error during health check.
}}{{PEMetric
|Metric=callthread_redis_state
|Type=gauge
|Unit=N/A
|MetricDescription=Current Redis connection state:

-1 – error 
0 – disconnected 
1 – connected 
2 – ready
|UsedFor=Errors
}}{{PEMetric
|Metric=http_client_request_duration_seconds
|Type=histogram
|Unit=seconds
|Label=target_service_name
|MetricDescription=HTTP client time from request to response, in seconds.
}}{{PEMetric
|Metric=http_client_response_count
|Type=counter
|Unit=N/A
|Label=target_service_name, tenant, status
|MetricDescription=The number of HTTP client responses received.
}}{{PEMetric
|Metric=kafka_consumer_recv_messages_total
|Type=counter
|Unit=N/A
|Label=topic, tenant, kafka_location
|MetricDescription=Number of messages received from Kafka.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_consumer_error_total
|Type=counter
|Unit=N/A
|Label=topic, kafka_location
|MetricDescription=Number of Kafka consumer errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=kafka_consumer_latency
|Type=histogram
|Label=topic, tenant, kafka_location
|MetricDescription=Consumer latency is the time difference between when the message is produced and when the message is consumed. That is, the time when the consumer received the message minus the time when the producer produced the message.
|UsedFor=Latency
}}{{PEMetric
|Metric=kafka_consumer_rebalance_total
|Type=counter
|Unit=N/A
|Label=topic, kafka_location
|MetricDescription=Number of Kafka consumer re-balance events.
}}{{PEMetric
|Metric=kafka_consumer_state
|Type=gauge
|Unit=N/A
|Label=topic, kafka_location
|MetricDescription=Current state of Kafka consumer.
}}{{PEMetric
|Metric=kafka_producer__messages_total
|Type=counter
|Unit=N/A
|Label=topic, tenant, kafka_location
|MetricDescription=Number of messages received from Kafka.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_producer_queue_depth
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer pending events.
|UsedFor=Saturation
}}{{PEMetric
|Metric=kafka_producer_queue_age_seconds
|Type=gauge
|Unit=seconds
|Label=kafka_location
|MetricDescription=Age of the oldest producer pending event, in seconds.
}}{{PEMetric
|Metric=kafka_producer_error_total
|Type=counter
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=kafka_producer_state
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Current state of the Kafka producer.
}}{{PEMetric
|Metric=log_output_bytes_total
|Type=counter
|Unit=bytes
|Label=level, format, module
|MetricDescription=Total amount of log output, in bytes.
}}
|AlertsDefined=Yes
|PEAlert={{PEAlert
|Alert=Kafka events latency is too high
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload).
*If the alarm is triggered only for topic <nowiki>{{ $labels.topic }}</nowiki>, check if there is an issue with the service related to the topic (CPU, memory, or network overload).
|BasedOn=kafka_consumer_latency_bucket
|Threshold=Latency for more than 5% of messages is more than 0.5 seconds for topic <nowiki>{{ $labels.topic }}</nowiki>.
}}{{PEAlert
|Alert=Too many Kafka consumer failed health checks
|Severity=Warning
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka.
*If the alarm is triggered only for <nowiki>{{ $labels.container }}</nowiki>, check if there is an issue with the service.
|BasedOn=kafka_consumer_error_total
|Threshold=Health check failed more than 10 times in 5 minutes for Kafka consumer for topic <nowiki>{{ $labels.topic }}</nowiki>.
}}{{PEAlert
|Alert=Too many Kafka consumer request timeouts
|Severity=Warning
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka.
*If the alarm is triggered only for <nowiki>{{ $labels.container }}</nowiki>, check if there is an issue with the service.
|BasedOn=kafka_consumer_error_total
|Threshold=More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic <nowiki>{{ $labels.topic }}</nowiki>.
}}{{PEAlert
|Alert=Too many Kafka consumer crashes
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka.
*If the alarm is triggered only for <nowiki>{{ $labels.container }}</nowiki>, check if there is an issue with the service.
|BasedOn=kafka_consumer_error_total
|Threshold=More than 3 Kafka consumer crashes in 5 minutes for topic <nowiki>{{ $labels.topic }}</nowiki>.
}}{{PEAlert
|Alert=Pod status Failed
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Failed state.
}}{{PEAlert
|Alert=Pod status Unknown
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Unknown state for 5 minutes.
}}{{PEAlert
|Alert=Pod status Pending
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Pending state for 5 minutes.
}}{{PEAlert
|Alert=Pod status NotReady
|Severity=Critical
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_ready
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in NotReady status for 5 minutes.
}}{{PEAlert
|Alert=Container restarted repeatedly
|Severity=Critical
|AlertDescription=Actions:

*Check if the new version of the image was deployed.
*Check for issues with the Kubernetes cluster.
|BasedOn=kube_pod_container_status_restarts_total
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> was restarted 5 or more times within 15 minutes.
}}{{PEAlert
|Alert=Max replicas is not sufficient for 5 mins
|Severity=Critical
|AlertDescription=The desired number of replicas is higher than the current available replicas for the past 5 minutes.
|BasedOn=kube_statefulset_replicas, kube_statefulset_status_replicas
|Threshold=The desired number of replicas is higher than the current available replicas for the past 5 minutes.
}}{{PEAlert
|Alert=Kafka not available
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=kafka_producer_state, kafka_consumer_state
|Threshold=Kafka is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Redis not available
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=callthread_redis_state
|Threshold=Redis is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 65%
|Severity=Warning
|AlertDescription=High CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 80%
|Severity=Critical
|AlertDescription=Critical CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 65%
|Severity=Warning
|AlertDescription=High memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 80%
|Severity=Critical
|AlertDescription=Critical memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Too many Kafka pending events
|Severity=Critical
|AlertDescription=Actions:

*Ensure there are no issues with Kafka or <nowiki>{{ $labels.container }}</nowiki> service's CPU and network.
|BasedOn=kafka_producer_queue_depth
|Threshold=Too many Kafka producer pending events for service <nowiki>{{ $labels.container }}</nowiki> (more than 100 in 5 minutes).
}}
}}

VM/Current/VMPEGuide/VoiceCallStateServiceMetrics - Revision history

Corinneh: Published