Corinneh: Published

2022-02-23T20:56:46Z

Published

New page

{{ArticlePEServiceMetrics
|IncludedServiceId=66502b9a-041d-42d7-b9de-7a8cce2a5e5d
|CRD=Supports both CRD and annotations
|Port=11300
|Endpoint=http://<pod-ipaddress>:11300/metrics
|MetricsUpdateInterval=30 seconds
|MetricsDefined=Yes
|MetricsIntro=Voice SIP Cluster Service exposes Genesys-defined, SIP Cluster Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the SIP Cluster Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available SIP Cluster Service metrics not documented on this page.
|PEMetric={{PEMetric
|Metric=http_client_request_duration_seconds
|Type=histogram
|Unit=seconds
|Label=le, target_service_name
|MetricDescription=HTTP client time from request to response, measured in seconds.
|UsedFor=Latency
}}{{PEMetric
|Metric=http_client_response_count
|Type=counter
|Unit=N/A
|Label=target_service_name
|MetricDescription=Number of received HTTP client responses.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_producer_queue_depth
|Type=gauge
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer pending events.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_producer_queue_age_seconds
|Type=gauge
|Unit=seconds
|Label=kafka_location
|MetricDescription=Age of the oldest producer pending event, measured in seconds.
|UsedFor=Traffic
}}{{PEMetric
|Metric=kafka_producer_error_total
|Type=counter
|Unit=N/A
|Label=kafka_location
|MetricDescription=Number of Kafka producer errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=log_output_bytes_total
|Type=counter
|Unit=bytes
|Label=level, format, module
|MetricDescription=Total amount of log output in bytes.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_requests_total
|Type=counter
|Unit=N/A
|Label=tenant, request
|MetricDescription=Number of processed requests.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_pending_requests_current
|Type=gauge
|Unit=N/A
|Label=tenant, request
|MetricDescription=Number of pending requests.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_requests_queue_size
|Type=gauge
|Unit=N/A
|MetricDescription=Number of postponed requests.
|UsedFor=Saturation
}}{{PEMetric
|Metric=sipnode_sips_request_duration_seconds
|Type=histogram
|Unit=seconds
|Label=le, tenant, request
|MetricDescription=Duration of the request processed by SIP Cluster Service, measured in seconds.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_events_total
|Type=counter
|Unit=N/A
|Label=tenant, event
|MetricDescription=Call events streamed to Redis.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_ha_writes_total
|Type=counter
|Unit=N/A
|MetricDescription=Number of HA writes to Redis.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_ha_reads_total
|Type=counter
|Unit=N/A
|MetricDescription=Number of HA reads from Redis.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_monitoring_events_total
|Type=counter
|Unit=N/A
|Label=tenant
|MetricDescription=Number of monitoring events submitted to Kafka.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_redis_restored_calls_total
|Type=counter
|Unit=N/A
|MetricDescription=Total number of restored calls from Redis cache.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_sips_restarts_total
|Type=counter
|Unit=N/A
|MetricDescription=Total number of SIP Server restarts.
|UsedFor=Errors
}}{{PEMetric
|Metric=sipnode_sips_disconnects_total
|Type=counter
|Unit=N/A
|MetricDescription=Total number of SIP Cluster Service disconnections from SIP Server.
|UsedFor=Errors
}}{{PEMetric
|Metric=sipnode_redis_state
|Type=gauge
|Unit=N/A
|Label=redis_cluster_name
|MetricDescription=Current Redis connection state.
|UsedFor=Errors
}}{{PEMetric
|Metric=sipnode_ors_tlib_latency_msec
|Type=histogram
|Unit=milliseconds
|Label=le, ors
|MetricDescription=T-Library latency from Orchestration Service to SIP Cluster, measured in milliseconds.
|UsedFor=Latency
}}{{PEMetric
|Metric=sipnode_ors_health_check
|Type=gauge
|Unit=N/A
|MetricDescription=SIP Cluster Service to Orchestration Service health check.
|UsedFor=Traffic
}}{{PEMetric
|Metric=service_version_info
|Type=gauge
|Unit=N/A
|Label=version
|MetricDescription=Displays the version of Voice SIP Cluster Service that is currently running. In the case of this metric, the labels provide the important information. The metric value is always 1 and does not provide any information.
|SampleValue=service_version_info{version="100.0.1000006"} 1
}}{{PEMetric
|Metric=sipnode_treatment_not_applied
|Type=counter
|Unit=N/A
|Label=tenant
|MetricDescription=Number of unsuccessful treatments.
|UsedFor=Errors
}}{{PEMetric
|Metric=sipnode_default_routing_total
|Type=counter
|Unit=N/A
|Label=tenant
|MetricDescription=Total number of default routed calls.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_envoy_proxy_status
|Type=gauge
|Unit=N/A
|MetricDescription=Status of the Envoy proxy:

-1 – error 
0 – disconnected 
1 – connected
|SampleValue=1
|UsedFor=Health
}}{{PEMetric
|Metric=sipnode_config_node_status
|Type=gauge
|Unit=N/A
|MetricDescription=Status of the config node connection:

0 – disconnected 
1 – connected
|SampleValue=1
|UsedFor=Health
}}{{PEMetric
|Metric=sipnode_health_level
|Type=gauge
|Unit=N/A
|MetricDescription=Health level of the SIP node (SIP Cluster Service):

-1 – fail 
0 – starting 
1 – degraded 
2 – pass
|SampleValue=2
|UsedFor=Traffic
}}{{PEMetric
|Metric=sipnode_call_state_health_check
|Type=gauge
|Unit=N/A
|Label=memberId
|MetricDescription=SIP Cluster Service to Call State Service health check.
|UsedFor=Health
}}{{PEMetric
|Metric=sips_hastate
|Type=gauge
|Unit=N/A
|MetricDescription=Current HA state of SIP Server:

0 – Unknown 
1 – backup 
2 – primary
|SampleValue=2
}}{{PEMetric
|Metric=sips_calls
|Type=gauge
|Unit=N/A
|MetricDescription=Current number of calls.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_call_rate
|Type=gauge
|Unit=N/A
|MetricDescription=Call rate.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_cpu_usage_sips
|Type=gauge
|Unit=N/A
|MetricDescription=SIP Server CPU usage.
|UsedFor=Saturation
}}{{PEMetric
|Metric=sips_cpu_usage_main
|Type=gauge
|Unit=N/A
|MetricDescription=SIP Server main thread CPU usage.
|UsedFor=Saturation
}}{{PEMetric
|Metric=sips_cpu_usage_cm
|Type=gauge
|Unit=N/A
|MetricDescription=CPU usage of the call manager thread.
|UsedFor=Saturation
}}{{PEMetric
|Metric=sips_calls_created
|Type=gauge
|Unit=N/A
|MetricDescription=Total number of created calls.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_abandoned_calls
|Type=gauge
|Unit=N/A
|MetricDescription=Total number of abandoned calls.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_rejected_calls
|Type=gauge
|Unit=N/A
|MetricDescription=Total number of rejected calls.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dialogs_created
|Type=gauge
|Unit=N/A
|MetricDescription=Total number of created SIP dialogs.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_call_recording_failed
|Type=gauge
|Unit=N/A
|MetricDescription=Number of failed call recording sessions.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_urs_response_1_to_5_sec
|Type=gauge
|Unit=N/A
|MetricDescription=Number of URS responses from 1 to 5 seconds.
|UsedFor=Latency
}}{{PEMetric
|Metric=sips_urs_response_more_5_sec
|Type=gauge
|Unit=N/A
|MetricDescription=Number of URS responses more than 5 seconds.
|UsedFor=Latency
}}{{PEMetric
|Metric=sips_user_data_updates
|Type=gauge
|Unit=N/A
|MetricDescription=Number of UserData updates.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_routing_timeouts
|Type=gauge
|Unit=N/A
|MetricDescription=Number of routing timeouts.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_trequest_rate
|Type=gauge
|Unit=N/A
|MetricDescription=T-Requests rate.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_treatment_rate
|Type=gauage
|Unit=N/A
|MetricDescription=TApplyTreatment requests rate.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_userdata_rate
|Type=gauge
|Unit=N/A
|MetricDescription=UserData change rate.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_sips_memory_usage
|Type=gauge
|Unit=N/A
|MetricDescription=Memory usage of the SIP Server process.
|UsedFor=Saturation
}}{{PEMetric
|Metric=sips_stat_fetch_total
|Type=counter
|Unit=N/A
|MetricDescription=Number of successful SIP Server statistic fetches.
|UsedFor=Other
}}{{PEMetric
|Metric=sips_sip_response_time_ms
|Type=histogram
|Unit=milliseconds
|Label=le
|MetricDescription=SIP Server metric of response time, measured in milliseconds.
|UsedFor=Latency
}}{{PEMetric
|Metric=sips_trunk_in_service
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Trunk devices that are in service.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_trunk_ncallscreated
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of created calls per trunk.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_trunk_noos_detected
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of trunks that are out of service.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_trunk_n4xx_received
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of received 4xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_trunk_n5xx_received
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of received 5xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_trunk_n6xx_received
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of received 6xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_softswitch_in_service
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Softswitch devices that are in service.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_softswitch_ncallscreated
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of created calls per softswitch device.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_softswitch_noos_detected
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of softswitch devices that are out of service.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_softswitch_n4xx_received
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of received 4xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_softswitch_n5xx_received
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of received 5xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_softswitch_n6xx_received
|Type=gauge
|Unit=N/A
|Label=device_name, tenant
|MetricDescription=Number of received 6xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_msml_in_service
|Type=gauge
|Unit=N/A
|Label=device_name
|MetricDescription=MSML devices that are in service.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_msml_ncallscreated
|Type=gauge
|Unit=N/A
|Label=device_name
|MetricDescription=Number of created calls per MSML device.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_msml_noos_detected
|Type=gauge
|Unit=N/A
|Label=device_name
|MetricDescription=Number of MSML devices that are out of service.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_msml_n4xx_received
|Type=gauge
|Unit=N/A
|Label=device_name
|MetricDescription=Number of received 4xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_msml_n5xx_received
|Type=gauge
|Unit=N/A
|Label=device_name
|MetricDescription=Number of received 5xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_msml_n6xx_received
|Type=gauge
|Unit=N/A
|Label=device_name
|MetricDescription=Number of received 6xx messages.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_state
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service state:

0 – Out-Of-Service 
1 – In-Service
|SampleValue=1
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_dp_queue_size
|Type=gauge
|Unit=N/A
|MetricDescription=Size of the request queue to Dial Plan Service.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_dp_avg_queue_time
|Type=gauge
|Unit=milliseconds
|MetricDescription=Average queue time (msec) of requests to Dial Plan Service.
|UsedFor=Latency
}}{{PEMetric
|Metric=sips_dp_connections
|Type=gauge
|Unit=N/A
|MetricDescription=Number of connections to Dial Plan Service per URL.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_dp_active_connections
|Type=gauge
|Unit=N/A
|MetricDescription=Number of active connections to Dial Plan Service.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_dp_req_rate
|Type=gauge
|Unit=N/A
|MetricDescription=Request rate to Dial plan Service.
|UsedFor=Traffic
}}{{PEMetric
|Metric=sips_dp_400_errors
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service 400 type of errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_404_errors
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service 404 type of errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_4xx_errors
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service 4xx type of errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_500_errors
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service 500 type of errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_501_errors
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service 501 type of errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_5xx_errors
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service 5xx type of errors.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_timeouts
|Type=gauge
|Unit=N/A
|MetricDescription=Dial Plan Service timeouts.
|UsedFor=Errors
}}{{PEMetric
|Metric=sips_dp_average_response_latency
|Type=gauge
|MetricDescription=Dial Plan Service average response latency.
|UsedFor=Latency
}}{{PEMetric
|Metric=sips_sipproxy_in_service
|Type=gauge
|Unit=N/A
|MetricDescription=SIP Proxy Service state:

0 – Out-Of-Service 
1 – In-Service
|SampleValue=1
|UsedFor=Traffic
}}{{PEMetric
|Metric=trunk_config_synced_count
|Type=gauge
|Unit=N/A
|MetricDescription=Number of trunks synchronized with SIP Server.
}}{{PEMetric
|Metric=trunk_config_cached_count
|Type=gauge
|Unit=N/A
|MetricDescription=Number of trunks obtained from the config node.
}}{{PEMetric
|Metric=trunk_config_cfg_node_error_count
|Type=counter
|Unit=N/A
|MetricDescription=Number of failed attempts to read from the config node.
}}{{PEMetric
|Metric=trunk_config_tlib_connection
|Type=gauge
|Unit=N/A
|MetricDescription=Number of trunks with the T-Library connection.
}}
|AlertsDefined=Yes
|PEAlert={{PEAlert
|Alert=Too many Kafka pending events
|Severity=Critical
|AlertDescription=Too many Kafka producer pending events for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Ensure there are no issues with Kafka, <nowiki>{{ $labels.pod }}</nowiki> pod's CPU, and network.
|BasedOn=kafka_producer_queue_depth
|Threshold=Too many Kafka producer pending events for service <nowiki>{{ $labels.container }}</nowiki> (more than 100 in 5 minutes).
}}{{PEAlert
|Alert=Dial Plan node is overloaded
|Severity=Critical
|AlertDescription=Dial Plan node is overloaded as the response latency increases.

Actions:

*Check that the inbound call rate to SIP Server is not too high.
*Check the Dial Plan node CPU and memory usage.
*Check the network connection between SIP Server and Dial Plan nodes.
|BasedOn=sips_dp_average_response_latency
|Threshold=Dial Plan node is overloaded as the response latency increases (more than 1000).
}}{{PEAlert
|Alert=Dial Plan Queue Increase
|Severity=Critical
|AlertDescription=Because Dial Plan requests are huge in size or there is a connection issue with the Dial Plan node, the processing queue size increases in size.

Actions:

*Check SIP Server inbound call rate.
*Check the connection between SIP Server and the Dial Plan node.
|BasedOn=sips_dp_queue_size
|Threshold=The processing queue size is greater than 10 requests for 1 minute.
}}{{PEAlert
|Alert=SIP Proxy overloaded
|Severity=Critical
|AlertDescription=SIP Proxy is overloaded.

Actions:

*Check SIP Proxy nodes for CPU and memory usage.
*If SIP Proxy nodes have acceptable CPU and memory usage, then check for errors or a "hang-up" state which could delay SIP Proxy in forwarding.
*Check the SBC side for network delays.
|BasedOn=sips_sip_response_time_ms_sum, sips_sip_response_time_ms_count
|Threshold=Response time is greater than 20 milliseconds for 1 minute.
}}{{PEAlert
|Alert=SIP Node HealthCheck Fail
|Severity=Critical
|AlertDescription=SIP Node health level fails for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check for failure of dependent services (Redis/Kafka/SIP Proxy/GVP/Dial Plan).
*Check for Envoy proxy failure, then restart the pod.
|BasedOn=sipnode_health_level
|Threshold=SIP Node health level fails for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 minutes.
}}{{PEAlert
|Alert=Kafka not available
|Severity=Critical
|AlertDescription=Kafka is not available for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=kafka_producer_state
|Threshold=Kafka is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 minutes.
}}{{PEAlert
|Alert=Pod Status Error
|Severity=Warning
|AlertDescription=Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_phase
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in Failed, Unknown, or Pending state.
}}{{PEAlert
|Alert=Pod Status NotReady
|Severity=Warning
|AlertDescription=Pod <nowiki>{{ $labels.pod }}</nowiki> is in NotReady state.

Actions:

*Restart the pod. Check if there are any issues with the pod after restart.
|BasedOn=kube_pod_status_ready
|Threshold=Pod <nowiki>{{ $labels.pod }}</nowiki> is in NotReady state for 5 minutes.
}}{{PEAlert
|Alert=Container Restarted Repeatedly
|Severity=Critical
|AlertDescription=Container <nowiki>{{ $labels.container }}</nowiki> was repeatedly restarted.

Actions:

*Check if the new version of the image was deployed.
*Check for issues with the Kubernetes cluster.
|BasedOn=kube_pod_container_status_restarts_total
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> was restarted 5 or more times within 15 minutes.
}}{{PEAlert
|Alert=Ready Pods below 60%
|Severity=Critical
|AlertDescription=The number of statefulset <nowiki>{{ $labels.statefulset}}</nowiki> pods in the Ready state has dropped below 60%.

Actions:

*Check if the new version of the image was deployed.
*Check for issues with the Kubernetes cluster.
|BasedOn=kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas_current
|Threshold=For the last 5 minutes, fewer than 60% of the currently available statefulset <nowiki>{{ $labels.statefulset}}</nowiki> pods have been in the Ready state.
}}{{PEAlert
|Alert=Pods scaled up greater than 80%
|Severity=Critical
|AlertDescription=The current number of replicas is more than 80% of the maximum number of replicas.

Actions:

*Check if max replicas must be modified based on load.
|BasedOn=kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas
|Threshold=For 5 consecutive minutes, the number of replicas is more than 80% of the maximum number of replicas.
}}{{PEAlert
|Alert=Pods less than Min Replicas
|Severity=Critical
|AlertDescription=The current number of replicas is less than the minimum replicas that should be available. This might be because Kubernetes cannot deploy a new pod or pods are failing to be active/ready.

Actions:

*If all services have the same issue, then check Kubernetes nodes and Consul health.
*If the issue is only with the SIP Cluster service, then check pod logs or the deployment manifest/helm errors.
|BasedOn=kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas
|Threshold=For 5 consecutive minutes, the number of replicas is less than the minimum replicas that should be available.
}}{{PEAlert
|Alert=Pod CPU greater than 80%
|Severity=Critical
|AlertDescription=Critical CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs for pod <nowiki>{{ $labels.pod }}</nowiki>; raise an investigation ticket.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Pod CPU greater than 65%
|Severity=Warning
|AlertDescription=High CPU load for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs for pod <nowiki>{{ $labels.pod }}</nowiki>; raise an investigation ticket.
|BasedOn=container_cpu_usage_seconds_total, container_spec_cpu_period
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> CPU usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 80%
|Severity=Critical
|AlertDescription=Critical memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Restart the service for pod <nowiki>{{ $labels.pod }}</nowiki>.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 80% for 5 minutes.
}}{{PEAlert
|Alert=Pod memory greater than 65%
|Severity=Warning
|AlertDescription=High memory usage for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
*Check Grafana for abnormal load.
*Collect the service logs for pod <nowiki>{{ $labels.pod }}</nowiki>; raise an investigation ticket.
|BasedOn=container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes
|Threshold=Container <nowiki>{{ $labels.container }}</nowiki> memory usage exceeded 65% for 5 minutes.
}}{{PEAlert
|Alert=Redis not available
|Severity=Critical
|AlertDescription=Redis is not available for pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis.
*If the alarm is triggered only for pod <nowiki>{{ $labels.pod }}</nowiki>, check if there is an issue with the pod.
|BasedOn=redis_state
|Threshold=Redis is not available for pod <nowiki>{{ $labels.pod }}</nowiki> for 5 consecutive minutes.
}}{{PEAlert
|Alert=Too many Kafka producer errors
|Severity=Critical
|AlertDescription=Kafka responds with errors at pod <nowiki>{{ $labels.pod }}</nowiki>.

Actions:

*For pod <nowiki>{{ $labels.pod }}</nowiki>, ensure there are no issues with Kafka.
|BasedOn=kafka_producer_error_total
|Threshold=More than 100 errors for 5 consecutive minutes.
}}{{PEAlert
|Alert=SIP Server main thread consuming more than 65% CPU for 5 mins
|Severity=Warning
|AlertDescription=Main thread consumes too much CPU.

Actions:

*Collect SIP Server Main thread logs; that is, log files without index in the file name (appname_date.log files). Raise an investigation ticket.
|BasedOn=sips_cpu_usage_main
|Threshold=Main thread consumes too much CPU (more than 65% for 5 consecutive minutes).
}}{{PEAlert
|Alert=Calls activity drop
|Severity=Warning
|AlertDescription=A noticeable reduction in the number of active calls on a specific SIP Server and no new calls are arriving for processing.

Actions:

*If a problematic SIP Server is primary, do a switchover, and then restart the former primary server.
*If a problematic SIP Server is backup, restart the backup server. Collect SIP Server Main thread logs; that is, log files without index in the file name (appname_date.log files). Raise an investigation ticket.
|BasedOn=sips_calls, sips_calls_created
|Threshold=The absolute value of active calls on a specific SIP Server dropped by more than 30 calls in 2 minutes and no new calls are arriving at the SIP Server for processing.
}}{{PEAlert
|Alert=Dial Plan Node Down
|Severity=Critical
|AlertDescription=No Dial Plan nodes are reachable from SIP Server and all connections to Dial Plan nodes are down.

Actions:

*Check the network connection between SIP Server and the Dial Plan node host.
*Check the Dial Plan node CPU and memory usage.
|BasedOn=sips_dp_active_connections
|Threshold=All connections to Dial Plan nodes are down.
}}{{PEAlert
|Alert=Dialplan Node problem
|Severity=Warning
|AlertDescription=Dial Plan node rejects requests with an error or it doesn't respond to requests and requests are timed out.

Actions:

*Check the network connection between SIP Server and the Dial Plan host.
*Check that Dial Plan nodes are running.
|BasedOn=sips_dp_timeouts
|Threshold=During 1 minute, the Dial Plan node rejects more than 5 requests with an error or more than 5 requests time out because the Dial Plan node fails to respond.
}}{{PEAlert
|Alert=Routing timeout counter growth
|Severity=Warning
|AlertDescription=The trigger detects that routing timeouts are increasing.

Actions:

*Check the URS_RESPONSE_MORE5SEC stat value. If it's increasing, then investigate why URS doesn't respond to SIP Server in time.
*Check SIPS-to-URS network connectivity.
|BasedOn=sips_routing_timeouts
|Threshold=The absolute value of NROUTINGTIMEOUTS on a specific SIP Server increased by more than 20 in 2 minutes.
}}{{PEAlert
|Alert=SIP trunk is out of service
|Severity=Critical
|AlertDescription=SIP trunk is out of service.

Actions:

*For Primary and Secondary trunks:
**Troubleshoot SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
**Troubleshoot the SBC. For Inter-SIP Server trunks: troubleshoot the SIP Server-to-SIP Server network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
|BasedOn=sips_trunk_in_service
|Threshold=SIP trunk is out of service for more than 1 minute.
}}{{PEAlert
|Alert=Media service is out of service
|Severity=Critical
|AlertDescription=Media service is out of service.

Actions:

*Troubleshoot the SIP Server-to-Resource Manager (RM) network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
*Troubleshoot RM, consider RM restart.
*After 5 minutes, redirect traffic to another site.
|BasedOn=sips_msml_in_service
|Threshold=Media service is out of service for more than 1 minute.
}}{{PEAlert
|Alert=SIP softswitch is out of service
|Severity=Critical
|AlertDescription=Actions:

*Troubleshoot the SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
*Troubleshoot the SBC.
|BasedOn=sips_softswitch_in_service
|Threshold=SIP softswitch is out of service.
}}{{PEAlert
|Alert=SIP Proxy is out of service
|Severity=Critical
|AlertDescription=Actions:

*Troubleshoot the SIP Server-to-SIP Proxy nodes network connections. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
*Troubleshoot SIP Proxy nodes.
|BasedOn=sips_sipproxy_in_service
|Threshold=SIP Proxy is out of service.
}}
}}

VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics - Revision history

Corinneh: Published