ORS metrics and alerts
Find the metrics ORS exposes and the alerts defined for ORS.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
ORS | Supports both CRD and annotations | 11200 | http://<pod-ipaddress>:11200/metrics | 30 seconds |
See details about:
Metrics[edit source]
You can query Prometheus directly to see all the metrics that the Voice Orchestration Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Orchestration Service metrics not documented on this page.
Metric and description | Metric details | Indicator of |
---|---|---|
orsnode_ Total number of received call events. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The number of HA writes to Redis. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The number of HA reads from Redis. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The number of active interactions. |
Unit: N/A Type: gauge |
Traffic |
orsnode_ The total number of interactions that have been created. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The total number of call interactions that have been cleared. |
Unit: N/A Type: counter |
|
orsnode_ The number of strategies that are running. |
Unit: N/A Type: gauge |
Traffic |
orsnode_ The total number of strategies that have been created. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The total number of strategy load errors. |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of errors encountered when a strategy tried to fetch data from a Designer Application Server (DAS). |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of strategy configuration errors. |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of strategy invoke errors. |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of strategy treatments. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The total number of failed strategy treatments. |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of times that a strategy updated user data. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The total number of SCXML transitions. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The total number of SCXML events. |
Unit: N/A Type: counter |
Traffic |
orsnode_ The total number of SCXML error.* events. |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of HTTP fetch requests. |
Unit: N/A Type: counter |
Errors |
orsnode_ The HTTP fetch time, measured in milliseconds (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ The total number of HTTP fetch errors. |
Unit: N/A Type: counter |
Errors |
orsnode_ Status of the HTTP fetch error. |
Unit: Type: histogram |
Errors |
orsnode_ The Universal Routing Server (URS) rlib latency, measured in milliseconds (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ The total number of URS rlib errors. |
Unit: N/A Type: counter |
Errors |
orsnode_ The total number of URS rlib requests. |
Unit: N/A Type: counter |
|
orsnode_ The total number of URS rlib events. |
Unit: N/A Type: counter |
|
orsnode_ The total number of URS rlib timeouts. |
Unit: N/A Type: counter |
|
orsnode_ Current Redis connection state. |
Unit: N/A Type: gauge |
|
orsnode_ The number of times that the ORS node disconnected from Redis. |
Unit: N/A Type: counter |
|
orsnode_ The number of SDR messages that have been sent. |
Unit: N/A Type: counter |
|
orsnode_ Redis queue latency, measured in milliseconds (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ Routing latency, measured in milliseconds (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ Redis stream latency, measured in (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ Digital stream latency, measured in milliseconds (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ ORS health check. |
Unit: N/A Type: gauge |
|
orsnode_ Interaction health check. |
Unit: N/A Type: gauge |
|
orsnode_ Current Redis queue connection state. |
Unit: N/A Type: gauge |
|
orsnode_ Total number of interaction stream events received. |
Unit: N/A Type: counter |
|
orsnode_ Number of times the ORS node disconnected from the RQ Service. |
Unit: N/A Type: counter |
|
service_ Displays the version of Voice Orchestration Service that is currently running. In the case of this metric, the labels provide the important information. The metric value is always 1 and does not provide any information. |
Unit: N/A Type: gauge |
|
orsnode_ Total number of EventRouteUsed events without a ReferenceID. |
Unit: N/A Type: counter |
|
orsnode_ The state of the voice balancer stream. |
Unit: N/A Type: gauge |
|
orsnode_ Indicates when the ORS node is using a lot of memory. |
Unit: N/A Type: gauge |
|
orsnode_ Indicates a Tenant rlib request timeout. |
Unit: N/A Type: gauge |
|
orsnode_ The number of stuck interactions. |
Unit: N/A Type: gauge |
|
orsnode_ The total number of URS SCXMLSubmit requests. |
Unit: N/A Type: counter |
|
orsnode_ The total number of URS SCXMLQueueCancel requests. |
Unit: N/A Type: counter |
|
orsnode_ Total number of URS queue.submit.done events. |
Unit: N/A Type: counter |
|
orsnode_ Summarized health level of the ORS node: -1 – fail |
Unit: N/A Type: gauge |
|
orsnode_ Health check errors for the ORS node: 1 – has error |
Unit: N/A Type: gauge |
Errors |
orsnode_ The number of active sessions for each Designer application. |
Unit: N/A Type: gauge |
|
orsnode_ The number of failed sessions for each Designer application. |
Unit: N/A Type: gauge |
|
orsnode_ The total number of sessions created for each Designer application. |
Unit: N/A Type: gauge |
|
orsnode_ The number of scripts that failed to load in the Tenant Service configuration management environment. |
Unit: N/A Type: gauge |
|
orsnode_ The time it takes for the strategy to be compiled and go through its initial states. |
Unit: milliseconds Type: histogram |
|
orsnode_ Timestamp when the ORS node started. |
Unit: N/A Type: gauge |
|
orsnode_ Total number of terminal requests (like Deliver, PlaceInQueue, StopProcessing for Digital and RequestClearCall, RequestRouteCall for Voice). |
Unit: N/A Type: counter |
|
orsnode_ Total number of non-terminal requests to the Interaction Server (for Digital) or SIP Server (for Voice). |
Unit: N/A Type: counter |
|
orsnode_ Total number of errors encountered in POST requests to the SIP node. |
Unit: N/A Type: counter |
Errors |
orsnode_ Total number of pending TLib requests. |
Unit: N/A Type: counter |
|
orsnode_ The number of active REST connections with SIP Cluster Service. |
Unit: N/A Type: gauge |
|
orsnode_ The number of compiled applications in the cache. |
Unit: N/A Type: counter |
|
orsnode_ The sum of the sizes of the cached applications. |
Unit: Type: gauge |
|
orsnode_ The TLib Rest API request latency, measured in (ms). |
Unit: milliseconds Type: histogram |
Latency |
orsnode_ The compiled size of the Designer application. |
Unit: Type: gauge |
|
orsnode_ The number of microsteps while executing the Designer application. |
Unit: N/A Type: gauge |
|
orsnode_ The length of time the Designer application was running, measured in milliseconds (ms). |
Unit: milliseconds Type: gauge |
|
orsnode_ The date on which the Designer application was compiled. |
Unit: N/A Type: gauge |
|
orsnode_ The date when the Designer application was last invoked. |
Unit: N/A Type: gauge |
Alerts[edit source]
The following alerts are defined for ORS.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
Number of running strategies is too high | Warning | Too many active sessions.
Actions:
|
orsnode_strategies | More than 400 strategies running in 5 consecutive seconds.
|
Number of running strategies is critical | Critical | Too many active sessions.
Actions:
|
orsnode_strategies | More than 600 strategies running in 5 consecutive seconds.
|
Redis disconnected for 5 minutes | Warning | Actions:
|
redis_state | Redis is not available for the pod {{ $labels.pod }} for 5 minutes.
|
Redis disconnected for 10 minutes | Critical | Actions:
|
redis_state | Redis is not available for the pod {{ $labels.pod }} for 10 minutes.
|
Pod status Failed | Warning | Pod {{ $labels.pod }} failed.
Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Failed state.
|
Pod in Unknown state | Warning | Pod {{ $labels.pod }} is in Unknown state.
Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Unknown state for 5 minutes.
|
Pod in Pending state | Warning | Pod {{ $labels.pod }} is in Pending state.
Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Pending state for 5 minutes.
|
Pod Not ready for 10 minutes | Critical | Pod {{ $labels.pod }} in NotReady state.
Actions:
|
kube_pod_status_ready | Pod {{ $labels.pod }} in NotReady state for 10 minutes.
|
Container restored repeatedly | Critical | Actions:
|
kube_pod_container_status_restarts_total | Container {{ $labels.container }} was restarted 5 or more times within 15 minutes.
|
Pod memory greater than 65% | Warning | High memory usage for pod {{ $labels.pod }}.
Actions:
|
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.
|
Pod memory greater than 80% | Critical | Critical memory usage for pod {{ $labels.pod }}.
Actions:
|
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.
|
Pod CPU greater than 65% | Warning | High CPU load for pod {{ $labels.pod }}.
Actions:
|
container_cpu_usage_seconds_total, container_spec_cpu_period | Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.
|
Pod CPU greater than 80% | Critical | Critical CPU load for pod {{ $labels.pod }}.
Actions:
|
container_cpu_usage_seconds_total, container_spec_cpu_period | Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes. |