Voice RQ Service metrics and alerts

From Genesys Documentation
Jump to: navigation, search
This topic is part of the manual Voice Microservices Private Edition Guide for version Current of Voice Microservices.


Find the metrics Voice RQ Service exposes and the alerts defined for Voice RQ Service.

Service CRD or annotations? Port Endpoint/Selector Metrics update interval
Voice RQ Service Supports both CRD and annotations 12000 http://<pod-ipaddress>:12000/metrics 30 seconds

See details about:

Metrics[edit source]

You can query Prometheus directly to see all the metrics that the Voice RQ Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Voice RQ Service metrics not documented on this page.

Metric and description Metric details Indicator of
rqnode_clients

Number of clients connected.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
rqnode_streams

Number of active streams present.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
rqnode_xreads

Number of XREAD requests received.

Unit: N/A

Type: counter
Label:
Sample value:

Traffic
rqnode_xadds

Number of XADD requests received.

Unit: N/A

Type: counter
Label:
Sample value:

Traffic
rqnode_redis_state

Current Redis connection state.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
rqnode_redis_disconnects

The number of Redis disconnects that occurred for the RQ node.

Unit:

Type: counter
Label:
Sample value:

Errors
rqnode_consul_leader_error

Number of errors received from Consul during the leadership process.

Unit: N/A

Type: counter
Label:
Sample value:

Errors
rqnode_active_master

Service master role is active.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
rqnode_active_backup

Service backup role is active.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
rqnode_read_latency

RQ latency; that is, the time duration between when an event is added to Redis and when it's read via XREAD.

Unit:

Type: histogram
Label: le, healthcheck
Sample value:

Latency
rqnode_add_latency

RQ latency; that is, the time duration between when a message is received and when it's added to the list.

Unit:

Type: histogram
Label: le, healthcheck
Sample value:

Latency
rqnode_redis_latency

Latency caused by Redis read/write.

Unit:

Type: histogram
Label: le
Sample value:

Latency


Alerts[edit source]

The following alerts are defined for Voice RQ Service.

Alert Severity Description Based on Threshold
Number of Redis streams is too high Warning Too many active sessions.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has reached.
  • Check the number of voice, digital, and callback calls in the system.
rqnode_streams More than 10000 active streams running.


Redis disconnected for 5 minutes Warning Redis is not available for the pod {{ $labels.pod }}.

Actions:

  • If the alarm is triggered for multiple services, make sure there are no issues with Redis, restart Redis.
  • If the alarm is triggered only for the pod {{ $labels.pod }}, check to see if there is any issue with the pod.
redis_state Redis is not available for the pod {{ $labels.pod }} for 5 minutes.


Redis disconnected for 10 minutes Critical Redis is not available for the pod {{ $labels.pod }}.

Actions:

  • If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis.
  • If the alarm is triggered only for the pod {{ $labels.pod }}, check to see if there is any issue with the pod.
redis_state Redis is not available for the pod {{ $labels.pod }} for 10 minutes.


Pod failed Warning Pod {{ $labels.pod }} failed.

Actions:

  • One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason.
kube_pod_status_phase Pod {{ $labels.pod }} is in Failed state.


Pod Unknown state Warning Pod {{ $labels.pod }} in Unknown state.

Actions:

  • If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster.
  • If the alarm is triggered only for the pod {{ $labels.pod }}, check whether the image is correct and if the container is starting up.
kube_pod_status_phase Pod {{ $labels.pod }} in Unknown state for 5 minutes.


Pod Pending state Warning Pod {{ $labels.pod }} is in the Pending state.

Actions:

  • If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster.
  • If the alarm is triggered only for the pod {{ $labels.pod }}, check the health of the pod.
kube_pod_status_phase Pod {{ $labels.pod }} is in the Pending state for 5 minutes.


Pod not ready for 10 minutes Critical Pod {{ $labels.pod }} in NotReady state.

Actions:

  • If this alarm is triggered, check whether the CPU is available for the pods.
  • Check whether the port of the pod is running and serving the request.
kube_pod_status_ready Pod {{ $labels.pod }} in NotReady state for 10 minutes.


Container restored repeatedly Critical Container {{ $labels.container }} was repeatedly restarted.

Actions:

  • One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason.
kube_pod_container_status_restarts_total Container {{ $labels.container }} was restarted 5 or more times within 15 minutes.


Pod memory greater than 65% Warning High memory usage for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Collect the service logs; raise an investigation ticket.
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.


Pod memory greater than 80% Critical Critical memory usage for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Restart the service.
  • Collect the service logs; raise an investigation ticket.
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.


Pod CPU greater than 65% Warning High CPU load for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Collect the service logs; raise an investigation ticket
container_cpu_usage_seconds_total, container_spec_cpu_period Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.


Pod CPU greater than 80% Critical Critical CPU load for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Restart the service.
  • Collect the service logs; raise an investigation ticket.
container_cpu_usage_seconds_total, container_spec_cpu_period Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes.