Voice SIP Cluster Service metrics and alerts

From Genesys Documentation
Revision as of 20:56, February 23, 2022 by Corinneh (talk | contribs) (Published)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
This topic is part of the manual Voice Microservices Private Edition Guide for version Current of Voice Microservices.

Find the metrics Voice SIP Cluster Service exposes and the alerts defined for Voice SIP Cluster Service.

Service CRD or annotations? Port Endpoint/Selector Metrics update interval
Voice SIP Cluster Service Supports both CRD and annotations 11300 http://<pod-ipaddress>:11300/metrics 30 seconds

See details about:

Metrics[edit source]

Voice SIP Cluster Service exposes Genesys-defined, SIP Cluster Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the SIP Cluster Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available SIP Cluster Service metrics not documented on this page.

Metric and description Metric details Indicator of
http_client_request_duration_seconds

HTTP client time from request to response, measured in seconds.

Unit: seconds

Type: histogram
Label: le, target_service_name
Sample value:

Latency
http_client_response_count

Number of received HTTP client responses.

Unit: N/A

Type: counter
Label: target_service_name
Sample value:

Traffic
kafka_producer_queue_depth

Number of Kafka producer pending events.

Unit: N/A

Type: gauge
Label: kafka_location
Sample value:

Traffic
kafka_producer_queue_age_seconds

Age of the oldest producer pending event, measured in seconds.

Unit: seconds

Type: gauge
Label: kafka_location
Sample value:

Traffic
kafka_producer_error_total

Number of Kafka producer errors.

Unit: N/A

Type: counter
Label: kafka_location
Sample value:

Errors
log_output_bytes_total

Total amount of log output in bytes.

Unit: bytes

Type: counter
Label: level, format, module
Sample value:

Traffic
sipnode_requests_total

Number of processed requests.

Unit: N/A

Type: counter
Label: tenant, request
Sample value:

Traffic
sipnode_pending_requests_current

Number of pending requests.

Unit: N/A

Type: gauge
Label: tenant, request
Sample value:

Traffic
sipnode_requests_queue_size

Number of postponed requests.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
sipnode_sips_request_duration_seconds

Duration of the request processed by SIP Cluster Service, measured in seconds.

Unit: seconds

Type: histogram
Label: le, tenant, request
Sample value:

Traffic
sipnode_events_total

Call events streamed to Redis.

Unit: N/A

Type: counter
Label: tenant, event
Sample value:

Traffic
sipnode_ha_writes_total

Number of HA writes to Redis.

Unit: N/A

Type: counter
Label:
Sample value:

Traffic
sipnode_ha_reads_total

Number of HA reads from Redis.

Unit: N/A

Type: counter
Label:
Sample value:

Traffic
sipnode_monitoring_events_total

Number of monitoring events submitted to Kafka.

Unit: N/A

Type: counter
Label: tenant
Sample value:

Traffic
sipnode_redis_restored_calls_total

Total number of restored calls from Redis cache.

Unit: N/A

Type: counter
Label:
Sample value:

Traffic
sipnode_sips_restarts_total

Total number of SIP Server restarts.

Unit: N/A

Type: counter
Label:
Sample value:

Errors
sipnode_sips_disconnects_total

Total number of SIP Cluster Service disconnections from SIP Server.

Unit: N/A

Type: counter
Label:
Sample value:

Errors
sipnode_redis_state

Current Redis connection state.

Unit: N/A

Type: gauge
Label: redis_cluster_name
Sample value:

Errors
sipnode_ors_tlib_latency_msec

T-Library latency from Orchestration Service to SIP Cluster, measured in milliseconds.

Unit: milliseconds

Type: histogram
Label: le, ors
Sample value:

Latency
sipnode_ors_health_check

SIP Cluster Service to Orchestration Service health check.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
service_version_info

Displays the version of Voice SIP Cluster Service that is currently running. In the case of this metric, the labels provide the important information. The metric value is always 1 and does not provide any information.

Unit: N/A

Type: gauge
Label: version
Sample value: service_version_info{version="100.0.1000006"} 1

sipnode_treatment_not_applied

Number of unsuccessful treatments.

Unit: N/A

Type: counter
Label: tenant
Sample value:

Errors
sipnode_default_routing_total

Total number of default routed calls.

Unit: N/A

Type: counter
Label: tenant
Sample value:

Traffic
sipnode_envoy_proxy_status

Status of the Envoy proxy:

-1 – error
0 – disconnected
1 – connected

Unit: N/A

Type: gauge
Label:
Sample value: 1

Health
sipnode_config_node_status

Status of the config node connection:

0 – disconnected
1 – connected

Unit: N/A

Type: gauge
Label:
Sample value: 1

Health
sipnode_health_level

Health level of the SIP node (SIP Cluster Service):

-1 – fail
0 – starting
1 – degraded
2 – pass

Unit: N/A

Type: gauge
Label:
Sample value: 2

Traffic
sipnode_call_state_health_check

SIP Cluster Service to Call State Service health check.

Unit: N/A

Type: gauge
Label: memberId
Sample value:

Health
sips_hastate

Current HA state of SIP Server:

0 – Unknown
1 – backup
2 – primary

Unit: N/A

Type: gauge
Label:
Sample value: 2

sips_calls

Current number of calls.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_call_rate

Call rate.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_cpu_usage_sips

SIP Server CPU usage.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
sips_cpu_usage_main

SIP Server main thread CPU usage.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
sips_cpu_usage_cm

CPU usage of the call manager thread.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
sips_calls_created

Total number of created calls.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_abandoned_calls

Total number of abandoned calls.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_rejected_calls

Total number of rejected calls.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dialogs_created

Total number of created SIP dialogs.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_call_recording_failed

Number of failed call recording sessions.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_urs_response_1_to_5_sec

Number of URS responses from 1 to 5 seconds.

Unit: N/A

Type: gauge
Label:
Sample value:

Latency
sips_urs_response_more_5_sec

Number of URS responses more than 5 seconds.

Unit: N/A

Type: gauge
Label:
Sample value:

Latency
sips_user_data_updates

Number of UserData updates.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_routing_timeouts

Number of routing timeouts.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_trequest_rate

T-Requests rate.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_treatment_rate

TApplyTreatment requests rate.

Unit: N/A

Type: gauage
Label:
Sample value:

Traffic
sips_userdata_rate

UserData change rate.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_sips_memory_usage

Memory usage of the SIP Server process.

Unit: N/A

Type: gauge
Label:
Sample value:

Saturation
sips_stat_fetch_total

Number of successful SIP Server statistic fetches.

Unit: N/A

Type: counter
Label:
Sample value:

Other
sips_sip_response_time_ms

SIP Server metric of response time, measured in milliseconds.

Unit: milliseconds

Type: histogram
Label: le
Sample value:

Latency
sips_trunk_in_service

Trunk devices that are in service.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Traffic
sips_trunk_ncallscreated

Number of created calls per trunk.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Traffic
sips_trunk_noos_detected

Number of trunks that are out of service.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_trunk_n4xx_received

Number of received 4xx messages.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_trunk_n5xx_received

Number of received 5xx messages.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_trunk_n6xx_received

Number of received 6xx messages.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_softswitch_in_service

Softswitch devices that are in service.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Traffic
sips_softswitch_ncallscreated

Number of created calls per softswitch device.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Traffic
sips_softswitch_noos_detected

Number of softswitch devices that are out of service.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_softswitch_n4xx_received

Number of received 4xx messages.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_softswitch_n5xx_received

Number of received 5xx messages.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_softswitch_n6xx_received

Number of received 6xx messages.

Unit: N/A

Type: gauge
Label: device_name, tenant
Sample value:

Errors
sips_msml_in_service

MSML devices that are in service.

Unit: N/A

Type: gauge
Label: device_name
Sample value:

Traffic
sips_msml_ncallscreated

Number of created calls per MSML device.

Unit: N/A

Type: gauge
Label: device_name
Sample value:

Traffic
sips_msml_noos_detected

Number of MSML devices that are out of service.

Unit: N/A

Type: gauge
Label: device_name
Sample value:

Errors
sips_msml_n4xx_received

Number of received 4xx messages.

Unit: N/A

Type: gauge
Label: device_name
Sample value:

Errors
sips_msml_n5xx_received

Number of received 5xx messages.

Unit: N/A

Type: gauge
Label: device_name
Sample value:

Errors
sips_msml_n6xx_received

Number of received 6xx messages.

Unit: N/A

Type: gauge
Label: device_name
Sample value:

Errors
sips_dp_state

Dial Plan Service state:

0 – Out-Of-Service
1 – In-Service

Unit: N/A

Type: gauge
Label:
Sample value: 1

Traffic
sips_dp_queue_size

Size of the request queue to Dial Plan Service.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_dp_avg_queue_time

Average queue time (msec) of requests to Dial Plan Service.

Unit: milliseconds

Type: gauge
Label:
Sample value:

Latency
sips_dp_connections

Number of connections to Dial Plan Service per URL.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_dp_active_connections

Number of active connections to Dial Plan Service.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_dp_req_rate

Request rate to Dial plan Service.

Unit: N/A

Type: gauge
Label:
Sample value:

Traffic
sips_dp_400_errors

Dial Plan Service 400 type of errors.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_404_errors

Dial Plan Service 404 type of errors.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_4xx_errors

Dial Plan Service 4xx type of errors.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_500_errors

Dial Plan Service 500 type of errors.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_501_errors

Dial Plan Service 501 type of errors.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_5xx_errors

Dial Plan Service 5xx type of errors.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_timeouts

Dial Plan Service timeouts.

Unit: N/A

Type: gauge
Label:
Sample value:

Errors
sips_dp_average_response_latency

Dial Plan Service average response latency.

Unit:

Type: gauge
Label:
Sample value:

Latency
sips_sipproxy_in_service

SIP Proxy Service state:

0 – Out-Of-Service
1 – In-Service

Unit: N/A

Type: gauge
Label:
Sample value: 1

Traffic
trunk_config_synced_count

Number of trunks synchronized with SIP Server.

Unit: N/A

Type: gauge
Label:
Sample value:

trunk_config_cached_count

Number of trunks obtained from the config node.

Unit: N/A

Type: gauge
Label:
Sample value:

trunk_config_cfg_node_error_count

Number of failed attempts to read from the config node.

Unit: N/A

Type: counter
Label:
Sample value:

trunk_config_tlib_connection

Number of trunks with the T-Library connection.

Unit: N/A

Type: gauge
Label:
Sample value:

Alerts[edit source]

The following alerts are defined for Voice SIP Cluster Service.

Alert Severity Description Based on Threshold
Too many Kafka pending events Critical Too many Kafka producer pending events for pod {{ $labels.pod }}.

Actions:

  • Ensure there are no issues with Kafka, {{ $labels.pod }} pod's CPU, and network.
kafka_producer_queue_depth Too many Kafka producer pending events for service {{ $labels.container }} (more than 100 in 5 minutes).


Dial Plan node is overloaded Critical Dial Plan node is overloaded as the response latency increases.

Actions:

  • Check that the inbound call rate to SIP Server is not too high.
  • Check the Dial Plan node CPU and memory usage.
  • Check the network connection between SIP Server and Dial Plan nodes.
sips_dp_average_response_latency Dial Plan node is overloaded as the response latency increases (more than 1000).


Dial Plan Queue Increase Critical Because Dial Plan requests are huge in size or there is a connection issue with the Dial Plan node, the processing queue size increases in size.

Actions:

  • Check SIP Server inbound call rate.
  • Check the connection between SIP Server and the Dial Plan node.
sips_dp_queue_size The processing queue size is greater than 10 requests for 1 minute.


SIP Proxy overloaded Critical SIP Proxy is overloaded.

Actions:

  • Check SIP Proxy nodes for CPU and memory usage.
  • If SIP Proxy nodes have acceptable CPU and memory usage, then check for errors or a "hang-up" state which could delay SIP Proxy in forwarding.
  • Check the SBC side for network delays.
sips_sip_response_time_ms_sum, sips_sip_response_time_ms_count Response time is greater than 20 milliseconds for 1 minute.


SIP Node HealthCheck Fail Critical SIP Node health level fails for pod {{ $labels.pod }}.

Actions:

  • Check for failure of dependent services (Redis/Kafka/SIP Proxy/GVP/Dial Plan).
  • Check for Envoy proxy failure, then restart the pod.
sipnode_health_level SIP Node health level fails for pod {{ $labels.pod }} for 5 minutes.


Kafka not available Critical Kafka is not available for pod {{ $labels.pod }}.

Actions:

  • If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka.
  • If the alarm is triggered only for pod {{ $labels.pod }}, check if there is an issue with the pod.
kafka_producer_state Kafka is not available for pod {{ $labels.pod }} for 5 minutes.


Pod Status Error Warning Actions:
  • Restart the pod. Check if there are any issues with the pod after restart.
kube_pod_status_phase Pod {{ $labels.pod }} is in Failed, Unknown, or Pending state.


Pod Status NotReady Warning Pod {{ $labels.pod }} is in NotReady state.

Actions:

  • Restart the pod. Check if there are any issues with the pod after restart.
kube_pod_status_ready Pod {{ $labels.pod }} is in NotReady state for 5 minutes.


Container Restarted Repeatedly Critical Container {{ $labels.container }} was repeatedly restarted.

Actions:

  • Check if the new version of the image was deployed.
  • Check for issues with the Kubernetes cluster.
kube_pod_container_status_restarts_total Container {{ $labels.container }} was restarted 5 or more times within 15 minutes.


Ready Pods below 60% Critical The number of statefulset {{ $labels.statefulset}} pods in the Ready state has dropped below 60%.

Actions:

  • Check if the new version of the image was deployed.
  • Check for issues with the Kubernetes cluster.
kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas_current For the last 5 minutes, fewer than 60% of the currently available statefulset {{ $labels.statefulset}} pods have been in the Ready state.


Pods scaled up greater than 80% Critical The current number of replicas is more than 80% of the maximum number of replicas.

Actions:

  • Check if max replicas must be modified based on load.
kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas For 5 consecutive minutes, the number of replicas is more than 80% of the maximum number of replicas.


Pods less than Min Replicas Critical The current number of replicas is less than the minimum replicas that should be available. This might be because Kubernetes cannot deploy a new pod or pods are failing to be active/ready.

Actions:

  • If all services have the same issue, then check Kubernetes nodes and Consul health.
  • If the issue is only with the SIP Cluster service, then check pod logs or the deployment manifest/helm errors.
kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas For 5 consecutive minutes, the number of replicas is less than the minimum replicas that should be available.


Pod CPU greater than 80% Critical Critical CPU load for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Collect the service logs for pod {{ $labels.pod }}; raise an investigation ticket.
container_cpu_usage_seconds_total, container_spec_cpu_period Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes.


Pod CPU greater than 65% Warning High CPU load for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Collect the service logs for pod {{ $labels.pod }}; raise an investigation ticket.
container_cpu_usage_seconds_total, container_spec_cpu_period Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.


Pod memory greater than 80% Critical Critical memory usage for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Restart the service for pod {{ $labels.pod }}.
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.


Pod memory greater than 65% Warning High memory usage for pod {{ $labels.pod }}.

Actions:

  • Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached.
  • Check Grafana for abnormal load.
  • Collect the service logs for pod {{ $labels.pod }}; raise an investigation ticket.
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.


Redis not available Critical Redis is not available for pod {{ $labels.pod }}.

Actions:

  • If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis.
  • If the alarm is triggered only for pod {{ $labels.pod }}, check if there is an issue with the pod.
redis_state Redis is not available for pod {{ $labels.pod }} for 5 consecutive minutes.


Too many Kafka producer errors Critical Kafka responds with errors at pod {{ $labels.pod }}.

Actions:

  • For pod {{ $labels.pod }}, ensure there are no issues with Kafka.
kafka_producer_error_total More than 100 errors for 5 consecutive minutes.


SIP Server main thread consuming more than 65% CPU for 5 mins Warning Main thread consumes too much CPU.

Actions:

  • Collect SIP Server Main thread logs; that is, log files without index in the file name (appname_date.log files). Raise an investigation ticket.
sips_cpu_usage_main Main thread consumes too much CPU (more than 65% for 5 consecutive minutes).


Calls activity drop Warning A noticeable reduction in the number of active calls on a specific SIP Server and no new calls are arriving for processing.

Actions:

  • If a problematic SIP Server is primary, do a switchover, and then restart the former primary server.
  • If a problematic SIP Server is backup, restart the backup server. Collect SIP Server Main thread logs; that is, log files without index in the file name (appname_date.log files). Raise an investigation ticket.
sips_calls, sips_calls_created The absolute value of active calls on a specific SIP Server dropped by more than 30 calls in 2 minutes and no new calls are arriving at the SIP Server for processing.


Dial Plan Node Down Critical No Dial Plan nodes are reachable from SIP Server and all connections to Dial Plan nodes are down.

Actions:

  • Check the network connection between SIP Server and the Dial Plan node host.
  • Check the Dial Plan node CPU and memory usage.
sips_dp_active_connections All connections to Dial Plan nodes are down.


Dialplan Node problem Warning Dial Plan node rejects requests with an error or it doesn't respond to requests and requests are timed out.

Actions:

  • Check the network connection between SIP Server and the Dial Plan host.
  • Check that Dial Plan nodes are running.
sips_dp_timeouts During 1 minute, the Dial Plan node rejects more than 5 requests with an error or more than 5 requests time out because the Dial Plan node fails to respond.


Routing timeout counter growth Warning The trigger detects that routing timeouts are increasing.

Actions:

  • Check the URS_RESPONSE_MORE5SEC stat value. If it's increasing, then investigate why URS doesn't respond to SIP Server in time.
  • Check SIPS-to-URS network connectivity.
sips_routing_timeouts The absolute value of NROUTINGTIMEOUTS on a specific SIP Server increased by more than 20 in 2 minutes.


SIP trunk is out of service Critical SIP trunk is out of service.

Actions:

  • For Primary and Secondary trunks:
    • Troubleshoot SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
    • Troubleshoot the SBC. For Inter-SIP Server trunks: troubleshoot the SIP Server-to-SIP Server network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
sips_trunk_in_service SIP trunk is out of service for more than 1 minute.


Media service is out of service Critical Media service is out of service.

Actions:

  • Troubleshoot the SIP Server-to-Resource Manager (RM) network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
  • Troubleshoot RM, consider RM restart.
  • After 5 minutes, redirect traffic to another site.
sips_msml_in_service Media service is out of service for more than 1 minute.


SIP softswitch is out of service Critical Actions:
  • Troubleshoot the SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
  • Troubleshoot the SBC.
sips_softswitch_in_service SIP softswitch is out of service.


SIP Proxy is out of service Critical Actions:
  • Troubleshoot the SIP Server-to-SIP Proxy nodes network connections. Collect network stats and escalate to the Network team to resolve network issues, if necessary.
  • Troubleshoot SIP Proxy nodes.
sips_sipproxy_in_service SIP Proxy is out of service.
Comments or questions about this documentation? Contact us for support!