Corinneh: Published

2022-02-23T20:56:30Z

Published

New page

{{ArticlePEServiceMetrics
|IncludedServiceId=25aca843-1ef0-44f4-9d67-b2a215dcf082
|CRD=Supports both CRD and annotations
|Port=12000
|Endpoint=http://<pod-ipaddress>:12000/metrics
|MetricsUpdateInterval=30 seconds
|MetricsDefined=Yes
|MetricsIntro=You can query Prometheus directly to see all the metrics that the Voice RQ Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available Voice RQ Service metrics not documented on this page.
|PEMetric={{PEMetric
|Metric=rqnode_clients
|Type=gauge
|Unit=N/A
|MetricDescription=Number of clients connected.
|UsedFor=Traffic
}}{{PEMetric
|Metric=rqnode_streams
|Type=gauge
|Unit=N/A
|MetricDescription=Number of active streams present.
|UsedFor=Traffic
}}{{PEMetric
|Metric=rqnode_xreads
|Type=counter
|Unit=N/A
|MetricDescription=Number of XREAD requests received.
|UsedFor=Traffic
}}{{PEMetric
|Metric=rqnode_xadds
|Type=counter
|Unit=N/A
|MetricDescription=Number of XADD requests received.
|UsedFor=Traffic
}}{{PEMetric
|Metric=rqnode_redis_state
|Type=gauge
|Unit=N/A
|MetricDescription=Current Redis connection state.
|UsedFor=Errors
}}{{PEMetric
|Metric=rqnode_redis_disconnects
|Type=counter
|MetricDescription=The number of Redis disconnects that occurred for the RQ node.
|UsedFor=Errors
}}{{PEMetric
|Metric=rqnode_consul_leader_error
|Type=counter
|Unit=N/A
|MetricDescription=Number of errors received from Consul during the leadership process.
|UsedFor=Errors
}}{{PEMetric
|Metric=rqnode_active_master
|Type=gauge
|Unit=N/A
|MetricDescription=Service master role is active.
|UsedFor=Saturation
}}{{PEMetric
|Metric=rqnode_active_backup
|Type=gauge
|Unit=N/A
|MetricDescription=Service backup role is active.
|UsedFor=Saturation
}}{{PEMetric
|Metric=rqnode_read_latency
|Type=histogram
|Label=le, healthcheck
|MetricDescription=RQ latency; that is, the time duration between when an event is added to Redis and when it's read via XREAD.
|UsedFor=Latency
}}{{PEMetric
|Metric=rqnode_add_latency
|Type=histogram
|Label=le, healthcheck
|MetricDescription=RQ latency; that is, the time duration between when a message is received and when it's added to the list.
|UsedFor=Latency
}}{{PEMetric
|Metric=rqnode_redis_latency
|Type=histogram
|Label=le
|MetricDescription=Latency caused by Redis read/write.
|UsedFor=Latency
}}
|AlertsDefined=Yes
|PEAlert={{PEAlert
|Alert=Number of Redis streams is too high
|Severity=Warning
|AlertDescription=Too many active sessions.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has reached.
*Check the number of voice, digital, and callback calls in the system.
|BasedOn=rqnode_streams
|Threshold=More than 10000 active streams running.
}}{{PEAlert
|Alert=Redis disconnected for 5 minutes
|Severity=Warning
|AlertDescription=Redis is not available for the pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Redis, restart Redis.
*If the alarm is triggered only for the pod <nowiki>{{ $labels.pod }}</nowiki>, check to see if there is any issue with the pod.
|BasedOn=redis_state
|Threshold=Redis is not available for the pod <nowiki>{{ $labels.pod }}</nowiki> for 5 minutes.
}}{{PEAlert
|Alert=Redis disconnected for 10 minutes
|Severity=Critical
|AlertDescription=Redis is not available for the pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis.
*If the alarm is triggered only for the pod <nowiki>{{ $labels.pod }}</nowiki>, check to see if there is any issue with the pod.
|BasedOn=redis_state
|Threshold=Redis is not available for the pod <nowiki>{{ $labels.pod }}</nowiki> for 10 minutes.
}}{{PEAlert
|Alert=Pod failed
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> failed.

Actions:

*One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Failed state.
}}{{PEAlert
|Alert=Pod Unknown state
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> in Unknown state.

Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster.
*If the alarm is triggered only for the pod <nowiki>{{ $labels.pod }}</nowiki>, check whether the image is correct and if the container is starting up.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> in Unknown state for 5 minutes.
}}{{PEAlert
|Alert=Pod Pending state
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> is in the Pending state.

Actions:

*If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster.
*If the alarm is triggered only for the pod <nowiki>{{ $labels.pod }}</nowiki>, check the health of the pod.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in the Pending state for 5 minutes.
}}{{PEAlert
|Alert=Pod not ready for 10 minutes
|Severity=Critical
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> in NotReady state.

Actions:

*If this alarm is triggered, check whether the CPU is available for the pods.
*Check whether the port of the pod is running and serving the request.
|BasedOn=kube_pod_status_ready
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> in NotReady state for 10 minutes.
}}{{PEAlert
|Alert=Container restored repeatedly
|Severity=Critical
|AlertDescription=Container <nowiki>{{ $labels.container }}</nowiki> was repeatedly restarted.

Actions:

*One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason.
|BasedOn=kube_pod_container_status_restarts_total
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> was restarted 5 or more times within 15 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 65%
|Severity=Warning
|AlertDescription=High memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs; raise an investigation ticket.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 80%
|Severity=Critical
|AlertDescription=Critical memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Restart the service.
*Collect the service logs; raise an investigation ticket.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 65%
|Severity=Warning
|AlertDescription=High CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs; raise an investigation ticket
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 80%
|Severity=Critical
|AlertDescription=Critical CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Restart the service.
*Collect the service logs; raise an investigation ticket.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 80% for 5 minutes.
}}
}}

VM/Current/VMPEGuide/VoiceRQServiceMetrics - Revision history

Corinneh: Published