Dial Plan Service metrics and alerts

This topic is part of the manual Voice Microservices Private Edition Guide for version Current of Voice Microservices.

Metrics[edit source]

You can query Prometheus directly to see all the metrics that the Voice Dial Plan Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Dial Plan Service metrics not documented on this page.

Metric and description	Metric details	Indicator of
dialplan_health_level Aggregated health level of the dialplan node for dependent services such as Redis and the Envoy sidecar connection: -1 – fail 0 – starting 1 – degraded 2 – pass	Unit: N/A Type: gauge Label: Sample value: 2	Health
dialplan_redis_state Current Redis connection state: 0 – disconnected 1 – connecting 2 – connected	Unit: N/A Type: gauge Label: redis_cluster_name Sample value: 2	Health
dialplan_total_request Number of dialplan requests received.	Unit: N/A Type: counter Label: tenant, pod, operation_type Sample value:	Traffic
dialplan_failure_response The number of Dial Plan failure responses.	Unit: N/A Type: counter Label: tenant, pod, operation_type, status, reason Sample value:	Traffic
dialplan_response_time Dialplan request processing duration histogram, in ms.	Unit: milliseconds Type: histogram Label: Sample value:	Latency
dialplan_redis_cache_latency_msec Redis fetch latency, measured in milliseconds.	Unit: milliseconds Type: histogram Label: tenant Sample value:	Latency

Alerts[edit source]

The following alerts are defined for Dial Plan Service.

Alert	Severity	Description	Based on	Threshold
DialPlan processing time > 0.5 seconds	Warning	Actions: If the alarm is generated for all dialplan pods, then Redis or network delay might be the most probable cause. If the alarm is generated in a single dialplan pod, then it might be due to Envoy or a network issue.	dialplan_response_time	When the latency for 95% of the dial plan messages is more than 0.5 seconds for a duration of 5 minutes, then this warning alarm is raised for the {{ $labels.container }}.
DialPlan processing time > 2 seconds	Critical	Actions: If the alarm is generated for all dialplan pods, then Redis or network delay might be the most probable cause. If the alarm is generated in a single dialplan pod, then it might be due to Envoy or a network issue.	dialplan_response_time	If the latency for 95% of the dial plan messages is more than 2 seconds for a duration of 5 minutes, then this warning alarm is raised for the {{ $labels.container }}.
Aggregated service health failing for 5 minutes	Critical	Actions: Check the dialplan dashboard for Aggregated Service Health errors and, in case of a Redis error, first check for any issues/crashes in the pod and then restart Redis. In the case of an Envoy error, the dialplan container will be restarted by the liveness probe. If the issue still exists after that, restart the pod.	dialplan_health_level	Dependent services or the Envoy sidecar is not available for 5 minutes in the pod {{ $labels.pod }}.
Redis disconnected for 5 minutes	Warning	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Redis and then restart Redis. If the alarm is triggered only for the pod {{ $labels.pod }}, check to see if there is an issue with the pod.	redis_state	Redis is not available for the pod {{ $labels.pod }} for 5 minutes.
Redis disconnected for 10 minutes	Critical	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Redis and then restart Redis. If the alarm is triggered only for the pod {{ $labels.pod }}, check to see if there is an issue with the pod.	redis_state	Redis is not available for the pod {{ $labels.pod }} for 10 minutes.
Pod Failed	Warning	Actions: One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason.	kube_pod_status_phase	Pod {{ $labels.pod }} failed.
Pod Unknown state	Warning	Actions: If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. If the alarm is triggered only for the pod {{ $labels.pod }}, check whether the image is correct and if the container is starting up.	kube_pod_status_phase	Pod {{ $labels.pod }} is in Unknown state for 5 minutes.
Pod Pending state	Warning	Actions: If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. If the alarm is triggered only for the pod {{ $labels.pod }}, check the health of the pod.	kube_pod_status_phase	Pod {{ $labels.pod }} is in the Pending state for 5 minutes.
Pod Not ready for 10 minutes	Critical	Actions: If this alarm is triggered, check whether the CPU is available for the pods. Check whether the port of the pod is running and serving the request.	kube_pod_status_ready	Pod {{ $labels.pod }} is in the NotReady state for 10 minutes.
Pod memory greater than 65%	Warning	High memory usage for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Collect the service logs; raise an investigation ticket	container_memory_working_set_bytes, kube_pod_container_resource_limits	Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.
Pod memory greater than 80%	Critical	Critical memory usage for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Restart the service. Collect the service logs; raise an investigation ticket.	container_memory_working_set_bytes, kube_pod_container_resource_limits	Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.
Pod CPU greater than 65%	Warning	High CPU load for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Collect the service logs; raise an investigation ticket.	container_cpu_usage_seconds_total, kube_pod_container_resource_limits	Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.
Pod CPU greater than 80%	Critical	Critical CPU load for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Restart the service. Collect the service logs; raise an investigation ticket.	container_cpu_usage_seconds_total, kube_pod_container_resource_limits	Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes.

Dial Plan Service metrics and alerts

Contents

Metrics[edit source]

Alerts[edit source]