ORS metrics and alerts

This topic is part of the manual Voice Microservices Private Edition Guide for version Current of Voice Microservices.

Metrics[edit source]

You can query Prometheus directly to see all the metrics that the Voice Orchestration Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Orchestration Service metrics not documented on this page.

Metric and description	Metric details	Indicator of
orsnode_callevents Total number of received call events.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_ha_writes The number of HA writes to Redis.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_ha_reads The number of HA reads from Redis.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_interactions The number of active interactions.	Unit: N/A Type: gauge Label: Sample value:	Traffic
orsnode_total_interactions The total number of interactions that have been created.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_cleared_interactions The total number of call interactions that have been cleared.	Unit: N/A Type: counter Label: Sample value:
orsnode_strategies The number of strategies that are running.	Unit: N/A Type: gauge Label: Sample value:	Traffic
orsnode_total_strategies The total number of strategies that have been created.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_load_errors The total number of strategy load errors.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_fetch_errors The total number of errors encountered when a strategy tried to fetch data from a Designer Application Server (DAS).	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_config_errors The total number of strategy configuration errors.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_invoke_errors The total number of strategy invoke errors.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_treatments The total number of strategy treatments.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_failed_treatments The total number of failed strategy treatments.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_userdata_updates The total number of times that a strategy updated user data.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_scxml_transitions The total number of SCXML transitions.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_scxml_events The total number of SCXML events.	Unit: N/A Type: counter Label: Sample value:	Traffic
orsnode_scxml_error_events The total number of SCXML `error.*` events.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_http_fetch_requests The total number of HTTP `fetch` requests.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_http_fetch_duration The HTTP fetch time, measured in milliseconds (ms).	Unit: milliseconds Type: histogram Label: Sample value:	Latency
orsnode_http_fetch_errors The total number of HTTP `fetch` errors.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_http_fetch_error_status Status of the HTTP `fetch` error.	Unit: Type: histogram Label: Sample value:	Errors
orsnode_urs_rlib_latency_msec The Universal Routing Server (URS) `rlib` latency, measured in milliseconds (ms).	Unit: milliseconds Type: histogram Label: Sample value:	Latency
orsnode_urs_rlib_errors The total number of URS `rlib` errors.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_urs_rlib_requests The total number of URS `rlib` requests.	Unit: N/A Type: counter Label: Sample value:
orsnode_urs_rlib_events The total number of URS `rlib` events.	Unit: N/A Type: counter Label: Sample value:
orsnode_urs_rlib_timeouts The total number of URS `rlib` timeouts.	Unit: N/A Type: counter Label: Sample value:
orsnode_redis_state Current Redis connection state.	Unit: N/A Type: gauge Label: redis_cluster_name Sample value:
orsnode_redis_disconnect The number of times that the ORS node disconnected from Redis.	Unit: N/A Type: counter Label: Sample value:
orsnode_sdr_messages_sent The number of SDR messages that have been sent.	Unit: N/A Type: counter Label: Sample value:
orsnode_rq_latency_msec Redis queue latency, measured in milliseconds (ms).	Unit: milliseconds Type: histogram Label: le, service Sample value:	Latency
orsnode_routing_latency_msec Routing latency, measured in milliseconds (ms).	Unit: milliseconds Type: histogram Label: Sample value:	Latency
orsnode_rstream_latency_msec Redis stream latency, measured in (ms).	Unit: milliseconds Type: histogram Label: le, node Sample value:	Latency
orsnode_digital_latency_msec Digital stream latency, measured in milliseconds (ms).	Unit: milliseconds Type: histogram Label: Sample value:	Latency
orsnode_sip_health_check ORS health check.	Unit: N/A Type: gauge Label: node Sample value:
orsnode_ixn_health_check Interaction health check.	Unit: N/A Type: gauge Label: Sample value:
orsnode_rq_state Current Redis queue connection state.	Unit: N/A Type: gauge Label: Sample value:
orsnode_ixn_events Total number of interaction stream events received.	Unit: N/A Type: counter Label: Sample value:
orsnode_rq_disconnect Number of times the ORS node disconnected from the RQ Service.	Unit: N/A Type: counter Label: Sample value:
service_version_info Displays the version of Voice Orchestration Service that is currently running. In the case of this metric, the labels provide the important information. The metric value is always 1 and does not provide any information.	Unit: N/A Type: gauge Label: version Sample value: service_version_info{version="100.0.1000006"} 1
orsnode_route_redirected Total number of EventRouteUsed events without a ReferenceID.	Unit: N/A Type: counter Label: Sample value:
orsnode_balancer_stream_state The state of the voice balancer stream.	Unit: N/A Type: gauge Label: balancer_stream_type Sample value:
orsnode_high_memory Indicates when the ORS node is using a lot of memory.	Unit: N/A Type: gauge Label: Sample value:
orsnode_urs_rlib_state Indicates a Tenant `rlib` request timeout.	Unit: N/A Type: gauge Label: Sample value:
orsnode_stuck_interactions The number of stuck interactions.	Unit: N/A Type: gauge Label: Sample value:
orsnode_urs_scxml_submit_requests The total number of URS `SCXMLSubmit` requests.	Unit: N/A Type: counter Label: Sample value:
orsnode_urs_scxml_cancel_requests The total number of URS `SCXMLQueueCancel` requests.	Unit: N/A Type: counter Label: Sample value:
orsnode_urs_queue_submit_done_events Total number of URS `queue.submit.done` events.	Unit: N/A Type: counter Label: Sample value:
orsnode_health_level Summarized health level of the ORS node: -1 – fail 0 – starting 1 – degraded 2 – pass	Unit: N/A Type: gauge Label: Sample value:
orsnode_health_check_error Health check errors for the ORS node: 1 – has error 0 – no error	Unit: N/A Type: gauge Label: reason Sample value:	Errors
orsnode_running_applications The number of active sessions for each Designer application.	Unit: N/A Type: gauge Label: Sample value:
orsnode_failed_applications The number of failed sessions for each Designer application.	Unit: N/A Type: gauge Label: Sample value:
orsnode_total_applications The total number of sessions created for each Designer application.	Unit: N/A Type: gauge Label: Sample value:
orsnode_failed_scripts The number of scripts that failed to load in the Tenant Service configuration management environment.	Unit: N/A Type: gauge Label: Sample value:
orsnode_session_load_time_msec The time it takes for the strategy to be compiled and go through its initial states.	Unit: milliseconds Type: histogram Label: Sample value:
orsnode_service_started Timestamp when the ORS node started.	Unit: N/A Type: gauge Label: started Sample value:
orsnode_total_terminal_requests Total number of terminal requests (like Deliver, PlaceInQueue, StopProcessing for Digital and RequestClearCall, RequestRouteCall for Voice).	Unit: N/A Type: counter Label: Sample value:
orsnode_total_non_terminal_requests Total number of non-terminal requests to the Interaction Server (for Digital) or SIP Server (for Voice).	Unit: N/A Type: counter Label: Sample value:
orsnode_sip_post_errors Total number of errors encountered in POST requests to the SIP node.	Unit: N/A Type: counter Label: Sample value:	Errors
orsnode_pending_tlib_requests Total number of pending TLib requests.	Unit: N/A Type: counter Label: Sample value:
orsnode_sips_rest_connections The number of active REST connections with SIP Cluster Service.	Unit: N/A Type: gauge Label: Sample value:
orsnode_number_compiled_applications The number of compiled applications in the cache.	Unit: N/A Type: counter Label: Sample value:
orsnode_cached_applications_size The sum of the sizes of the cached applications.	Unit: Type: gauge Label: Sample value:
orsnode_tlib_latency_msec The TLib Rest API request latency, measured in (ms).	Unit: milliseconds Type: histogram Label: le Sample value:	Latency
orsnode_application_size The compiled size of the Designer application.	Unit: Type: gauge Label: Sample value:
orsnode_application_microstep_count The number of microsteps while executing the Designer application.	Unit: N/A Type: gauge Label: Sample value:
orsnode_application_run_time_msec The length of time the Designer application was running, measured in milliseconds (ms).	Unit: milliseconds Type: gauge Label: Sample value:
orsnode_application_compiled_date The date on which the Designer application was compiled.	Unit: N/A Type: gauge Label: Sample value:
orsnode_application_last_invoked_date The date when the Designer application was last invoked.	Unit: N/A Type: gauge Label: Sample value:

Alerts[edit source]

The following alerts are defined for ORS.

Alert	Severity	Description	Based on	Threshold
Number of running strategies is too high	Warning	Too many active sessions. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check the number of voice, digital, and callback calls in the system.	orsnode_strategies	More than 400 strategies running in 5 consecutive seconds.
Number of running strategies is critical	Critical	Too many active sessions. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check the number of voice, digital, and callback calls in the system.	orsnode_strategies	More than 600 strategies running in 5 consecutive seconds.
Redis disconnected for 5 minutes	Warning	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. If alarm is triggered only for pod {{ $labels.pod }}, check if there is an issue with the pod.	redis_state	Redis is not available for the pod {{ $labels.pod }} for 5 minutes.
Redis disconnected for 10 minutes	Critical	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. If the alarm is triggered only for pod {{ $labels.pod }}, check if there is an issue with the pod.	redis_state	Redis is not available for the pod {{ $labels.pod }} for 10 minutes.
Pod status Failed	Warning	Pod {{ $labels.pod }} failed. Actions: One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason.	kube_pod_status_phase	Pod {{ $labels.pod }} is in Failed state.
Pod in Unknown state	Warning	Pod {{ $labels.pod }} is in Unknown state. Actions: If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. If the alarm is triggered only for pod {{ $labels.pod }}, check whether the image is correct and if the container is starting up.	kube_pod_status_phase	Pod {{ $labels.pod }} is in Unknown state for 5 minutes.
Pod in Pending state	Warning	Pod {{ $labels.pod }} is in Pending state. Actions: If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. If the alarm is triggered only for the pod {{ $labels.pod }}, check the health of the pod.	kube_pod_status_phase	Pod {{ $labels.pod }} is in Pending state for 5 minutes.
Pod Not ready for 10 minutes	Critical	Pod {{ $labels.pod }} in NotReady state. Actions: If this alarm is triggered, check whether the CPU is available for the pods. Check whether the port of the pod is running and serving the request.	kube_pod_status_ready	Pod {{ $labels.pod }} in NotReady state for 10 minutes.
Container restored repeatedly	Critical	Actions: One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason.	kube_pod_container_status_restarts_total	Container {{ $labels.container }} was restarted 5 or more times within 15 minutes.
Pod memory greater than 65%	Warning	High memory usage for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Collect the service logs; raise an investigation ticket.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.
Pod memory greater than 80%	Critical	Critical memory usage for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Restart the service. Collect the service logs; raise an investigation ticket.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.
Pod CPU greater than 65%	Warning	High CPU load for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Collect the service logs; raise an investigation ticket.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.
Pod CPU greater than 80%	Critical	Critical CPU load for pod {{ $labels.pod }}. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Restart the service. Collect the service logs; raise an investigation ticket.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes.

Voice Microservices Private Edition Guide

Overview

Configure and deploy

Configure and deploy Voicemail

Observability

Functionality

ORS metrics and alerts

Contents

Metrics[edit source]

Alerts[edit source]