Corinneh: Published

2022-02-23T20:56:39Z

Published

New page

{{ArticlePEServiceMetrics
|IncludedServiceId=fa42b327-7d9a-43c9-b13d-c33ec96146eb
|CRD=Supports both CRD and annotations
|Port=9101
|Endpoint=http://<pod-ipaddress>:9101/metrics
|MetricsUpdateInterval=30 seconds
|MetricsDefined=Yes
|MetricsIntro=Voice FrontEnd Service exposes Genesys-defined, FrontEnd Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the FrontEnd Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available FrontEnd Service metrics not documented on this page.
|PEMetric={{PEMetric
|Metric=kafka_producer_queue_depth
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer pending events.
|SampleValue=0
}}{{PEMetric
|Metric=kafka_producer_queue_age_seconds
|Type=gauge
|Unit=seconds
|Label=kafka_location
|MetricDescription=Age of the oldest producer pending event, in seconds.
}}{{PEMetric
|Metric=kafka_producer_error_total
|Type=counter
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer errors.
}}{{PEMetric
|Metric=kafka_producer_state
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Current state of the Kafka producer.
}}{{PEMetric
|Metric=kafka_producer_biggest_event_size
|Type=gauge
|Label=kafka_location, topic
|MetricDescription=Biggest event size so far.
|SampleValue=515
}}{{PEMetric
|Metric=kafka_max_request_size
|Type=gauge
|Label=kafka_location
|MetricDescription=Exposed config to compare with biggest event size.
}}{{PEMetric
|Metric=log_output_bytes_total
|Type=counter
|Unit=bytes
|Label=level, format, module
|MetricDescription=Total amount of log output, in bytes.
}}{{PEMetric
|Metric=sipfe_requests_total
|Type=counter
|Unit=N/A
|Label=tenant
|MetricDescription=Number of requests.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipfe_responses_total
|Type=counter
|Unit=N/A
|Label=tenant
|MetricDescription=Number of responses for the requests.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipfe_sip_nodes_total
|Type=gauge
|Unit=N/A
|MetricDescription=Number of SIP nodes that are alive.
}}{{PEMetric
|Metric=sipfe_sip_node_requests_total
|Type=counter
|Unit=N/A
|Label=sip_node_id, tenant
|MetricDescription=Number of requests to the SIP nodes.
}}{{PEMetric
|Metric=sipfe_sip_node_responses_total
|Type=counter
|Unit=N/A
|Label=sip_node_id, tenant, status
|MetricDescription=Number of responses from the SIP nodes for the requests.
}}{{PEMetric
|Metric=sipfe_sip_node_request_duration_seconds
|Type=histogram
|Unit=seconds
|Label=le, sip_node_id, tenant, status
|MetricDescription=The duration of time between the SIP node request and the response, measured in seconds.
|UsedFor=Latency
}}{{PEMetric
|Metric=service_version_info
|Type=gauge
|Label=version
|MetricDescription=Displays the version of Voice FrontEnd Service that is currently running. In the case of this metric, the labels provide the important information. The metric value is always 1 and does not provide any information.
|SampleValue=service_version_info{version="100.0.1000006"} 1
}}{{PEMetric
|Metric=sipfe_health_level
|Type=gauge
|Unit=N/A
|MetricDescription=Health level of the sipfe node:

-1 – fail<br />
0 – starting<br />
1 – degraded<br />
2 – pass
|SampleValue=2
|UsedFor=Errors
}}{{PEMetric
|Metric=sipfe_health_check_error
|Type=gauge
|Unit=N/A
|Label=reason
|MetricDescription=Health check errors for the sipfe node:

1 – has error<br />
0 – no error
|SampleValue=0
|UsedFor=Errors
}}
|AlertsDefined=Yes
|PEAlert={{PEAlert
|Alert=Too many Kafka pending producer events
|Severity=Critical
|AlertDescription=Actions:

*Make sure there are no issues with Kafka or <nowiki>{{ $labels.pod }}</nowiki> pod's CPU and network.
|BasedOn=kafka_producer_queue_depth
|Threshold=Too many Kafka producer pending events for pod <nowiki>{{ $labels.pod }}</nowiki> (more than 100 in 5 minutes).
}}{{PEAlert
|Alert=Too many received requests without a response
|Severity=Critical
|AlertDescription=Actions:

*Collect the service logs for pod <nowiki>{{ $labels.pod }}</nowiki>; raise an investigation ticket.
*Restart the service.
|BasedOn=sipfe_requests_total
|Threshold=For too many requests, the Front End service at pod <nowiki>{{ $labels.pod }}</nowiki> did not send any response (more than 100 requests without a response, measured over 5 minutes).
}}{{PEAlert
|Alert=SIP Cluster Service response latency is too high
|Severity=Critical
|AlertDescription=Actions:

*If the alarm is triggered for multiple pods, make sure there are no issues with the SIP Cluster Service (CPU, memory, or network overload).
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod (CPU, memory, or network overload).
|BasedOn=sipfe_sip_node_request_duration_seconds_bucket
|Threshold=Latency for 95% of messages is more than 0.5 seconds for service <nowiki>{{ $labels.container }}</nowiki>.
}}{{PEAlert
|Alert=No requests received
|Severity=Critical
|AlertDescription=Absence of received requests for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*For pod <nowiki>{{ $labels.pod }}</nowiki>, make sure there are no issues with Orchestration Service and Tenant Service or the network to them.
|BasedOn=sipfe_requests_total
|Threshold=increase(sipfe_requests_total{pod=~"sipfe-.+"}[5m]) <= 0 and increase(sipfe_requests_total{pod=~"sipfe-.+"}[10m]) > 100
}}{{PEAlert
|Alert=Too many failure responses sent
|Severity=Critical
|AlertDescription=Too many failure responses are sent by the Front End service at pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*For pod <nowiki>{{ $labels.pod }}</nowiki>, make sure received requests are valid.
|BasedOn=sipfe_responses_total
|Threshold=More than 100 failure responses in 5 consecutive minutes.
}}{{PEAlert
|Alert=Too many Kafka producer errors
|Severity=Critical
|AlertDescription=Kafka responds with errors at pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*For pod <nowiki>{{ $labels.pod }}</nowiki>, make sure there are no issues with Kafka.
|BasedOn=kafka_producer_error_total
|Threshold=More than 100 errors in 5 consecutive minutes.
}}{{PEAlert
|Alert=Too many SIP Cluster Service error responses
|Severity=Critical
|AlertDescription=SIP Cluster Service responds with errors at pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple pods, make sure there are no issues with the SIP Cluster Service (CPU, memory, or network overload).
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there are issues with requests sent by the pod.
|BasedOn=sipfe_sip_node_responses_total
|Threshold=More than 100 errors in 5 consecutive minutes.
}}{{PEAlert
|Alert=Kafka not available
|Severity=Critical
|AlertDescription=Kafka is not available for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=kafka_producer_state
|Threshold=Kafka is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=SIP Node(s) is not available
|Severity=Critical
|AlertDescription=No available SIP Nodes for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple services, make sure there are no issues with SIP Nodes, and then restart SIP Nodes.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod or the network to SIP Nodes.
|BasedOn=sipfe_sip_nodes_total
|Threshold=No available SIP Nodes for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Pod status Failed
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Failed state.

Actions:

*Restart the pod. Check to see if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Failed state.
}}{{PEAlert
|Alert=Pod status Unknown
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Unknown state for 5 minutes.

Actions:

*Restart the pod. Check to see if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Unknown state for 5 minutes.
}}{{PEAlert
|Alert=Pod status Pending
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Pending state for 5 minutes.

Actions:

*Restart the pod. Check to see if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Pending state for 5 minutes.
}}{{PEAlert
|Alert=Pod status NotReady
|Severity=Critical
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> is in the NotReady state for 5 minutes.

Actions:

*Restart the pod. Check to see if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_ready
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in the NotReady state for 5 minutes.
}}{{PEAlert
|Alert=Container restarted repeatedly
|Severity=Critical
|AlertDescription=Container <nowiki>{{ $labels.container }}</nowiki> was restarted 5 or more times within 15 minutes.

Actions:

*Check if a new version of the image was deployed.
*Check for issues with the Kubernetes cluster.
|BasedOn=kube_pod_container_status_restarts_total
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> was restarted 5 or more times within 15 minutes.
}}{{PEAlert
|Alert=Max replicas is not sufficient for 5 mins
|Severity=Critical
|AlertDescription=For the past 5 minutes, the desired number of replicas is higher than the number of replicas currently available.

Actions:

*Check resources available for Kubernetes. Increase resources, if necessary.
|BasedOn=kube_statefulset_replicas, kube_statefulset_status_replicas
|Threshold=Desired number of replicas is higher than current available replicas for the past 5 minutes.
}}{{PEAlert
|Alert=Pods scaled up greater than 80%
|Severity=Critical
|AlertDescription=For the past 5 minutes, the desired number of replicas is greater than the number of replicas currently available.

Actions:

*Check resources available for Kubernetes. Increase resources, if necessary.
|BasedOn=kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas
|Threshold=(kube_hpa_status_current_replicas{namespace="voice",hpa="sipfe-node-hpa"} * 100) / kube_hpa_spec_max_replicas{namespace="voice",hpa="sipfe-node-hpa"} > 80 for: 5m
}}{{PEAlert
|Alert=Pods less than Min Replicas
|Severity=Critical
|AlertDescription=The current number of replicas is lower than the minimum number of replicas that should be available.

Actions:

*Check if Kubernetes cannot deploy new pods or if pods are failing in their status to be active/read.
|BasedOn=kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas
|Threshold=For the past 5 minutes, the current number of replicas is lower than the minimum number of replicas that should be available.
}}{{PEAlert
|Alert=Pod CPU greater than 65%
|Severity=Warning
|AlertDescription=High CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs for pod <nowiki>{{ $labels.pod }}</nowiki>; raise an investigation ticket.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 80%
|Severity=Critical
|AlertDescription=Critical CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Restart the service.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 65%
|Severity=Warning
|AlertDescription=High memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs for pod <nowiki>{{ $labels.pod }}</nowiki>; raise an investigation ticket.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 80%
|Severity=Critical
|AlertDescription=Critical memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Restart the service for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 80% for 5 minutes.
}}
}}

VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics - Revision history

Corinneh: Published