Voice SIP Cluster Service metrics and alerts
Find the metrics Voice SIP Cluster Service exposes and the alerts defined for Voice SIP Cluster Service.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
Voice SIP Cluster Service | Supports both CRD and annotations | 11300 | http://<pod-ipaddress>:11300/metrics | 30 seconds |
See details about:
Metrics[edit source]
Voice SIP Cluster Service exposes Genesys-defined, SIP Cluster Service–specific metrics as well as some standard Kafka metrics. You can query Prometheus directly to see all the metrics that the SIP Cluster Service exposes. The following metrics are likely to be particularly useful. Genesys does not commit to maintain other currently available SIP Cluster Service metrics not documented on this page.
Metric and description | Metric details | Indicator of |
---|---|---|
http_ HTTP client time from request to response, measured in seconds. |
Unit: seconds Type: histogram |
Latency |
http_ Number of received HTTP client responses. |
Unit: N/A Type: counter |
Traffic |
kafka_ Number of Kafka producer pending events. |
Unit: N/A Type: gauge |
Traffic |
kafka_ Age of the oldest producer pending event, measured in seconds. |
Unit: seconds Type: gauge |
Traffic |
kafka_ Number of Kafka producer errors. |
Unit: N/A Type: counter |
Errors |
log_ Total amount of log output in bytes. |
Unit: bytes Type: counter |
Traffic |
sipnode_ Number of processed requests. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Number of pending requests. |
Unit: N/A Type: gauge |
Traffic |
sipnode_ Number of postponed requests. |
Unit: N/A Type: gauge |
Saturation |
sipnode_ Duration of the request processed by SIP Cluster Service, measured in seconds. |
Unit: seconds Type: histogram |
Traffic |
sipnode_ Call events streamed to Redis. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Number of HA writes to Redis. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Number of HA reads from Redis. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Number of monitoring events submitted to Kafka. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Total number of restored calls from Redis cache. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Total number of SIP Server restarts. |
Unit: N/A Type: counter |
Errors |
sipnode_ Total number of SIP Cluster Service disconnections from SIP Server. |
Unit: N/A Type: counter |
Errors |
sipnode_ Current Redis connection state. |
Unit: N/A Type: gauge |
Errors |
sipnode_ T-Library latency from Orchestration Service to SIP Cluster, measured in milliseconds. |
Unit: milliseconds Type: histogram |
Latency |
sipnode_ SIP Cluster Service to Orchestration Service health check. |
Unit: N/A Type: gauge |
Traffic |
service_ Displays the version of Voice SIP Cluster Service that is currently running. In the case of this metric, the labels provide the important information. The metric value is always 1 and does not provide any information. |
Unit: N/A Type: gauge |
|
sipnode_ Number of unsuccessful treatments. |
Unit: N/A Type: counter |
Errors |
sipnode_ Total number of default routed calls. |
Unit: N/A Type: counter |
Traffic |
sipnode_ Status of the Envoy proxy: -1 – error |
Unit: N/A Type: gauge |
Health |
sipnode_ Status of the config node connection: 0 – disconnected |
Unit: N/A Type: gauge |
Health |
sipnode_ Health level of the SIP node (SIP Cluster Service): -1 – fail |
Unit: N/A Type: gauge |
Traffic |
sipnode_ SIP Cluster Service to Call State Service health check. |
Unit: N/A Type: gauge |
Health |
sips_ Current HA state of SIP Server: 0 – Unknown |
Unit: N/A Type: gauge |
|
sips_ Current number of calls. |
Unit: N/A Type: gauge |
Traffic |
sips_ Call rate. |
Unit: N/A Type: gauge |
Traffic |
sips_ SIP Server CPU usage. |
Unit: N/A Type: gauge |
Saturation |
sips_ SIP Server main thread CPU usage. |
Unit: N/A Type: gauge |
Saturation |
sips_ CPU usage of the call manager thread. |
Unit: N/A Type: gauge |
Saturation |
sips_ Total number of created calls. |
Unit: N/A Type: gauge |
Traffic |
sips_ Total number of abandoned calls. |
Unit: N/A Type: gauge |
Errors |
sips_ Total number of rejected calls. |
Unit: N/A Type: gauge |
Errors |
sips_ Total number of created SIP dialogs. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of failed call recording sessions. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of URS responses from 1 to 5 seconds. |
Unit: N/A Type: gauge |
Latency |
sips_ Number of URS responses more than 5 seconds. |
Unit: N/A Type: gauge |
Latency |
sips_ Number of UserData updates. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of routing timeouts. |
Unit: N/A Type: gauge |
Errors |
sips_ T-Requests rate. |
Unit: N/A Type: gauge |
Traffic |
sips_ TApplyTreatment requests rate. |
Unit: N/A Type: gauage |
Traffic |
sips_ UserData change rate. |
Unit: N/A Type: gauge |
Traffic |
sips_ Memory usage of the SIP Server process. |
Unit: N/A Type: gauge |
Saturation |
sips_ Number of successful SIP Server statistic fetches. |
Unit: N/A Type: counter |
Other |
sips_ SIP Server metric of response time, measured in milliseconds. |
Unit: milliseconds Type: histogram |
Latency |
sips_ Trunk devices that are in service. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of created calls per trunk. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of trunks that are out of service. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 4xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 5xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 6xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Softswitch devices that are in service. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of created calls per softswitch device. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of softswitch devices that are out of service. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 4xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 5xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 6xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ MSML devices that are in service. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of created calls per MSML device. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of MSML devices that are out of service. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 4xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 5xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Number of received 6xx messages. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service state: 0 – Out-Of-Service |
Unit: N/A Type: gauge |
Traffic |
sips_ Size of the request queue to Dial Plan Service. |
Unit: N/A Type: gauge |
Traffic |
sips_ Average queue time (msec) of requests to Dial Plan Service. |
Unit: milliseconds Type: gauge |
Latency |
sips_ Number of connections to Dial Plan Service per URL. |
Unit: N/A Type: gauge |
Traffic |
sips_ Number of active connections to Dial Plan Service. |
Unit: N/A Type: gauge |
Traffic |
sips_ Request rate to Dial plan Service. |
Unit: N/A Type: gauge |
Traffic |
sips_ Dial Plan Service 400 type of errors. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service 404 type of errors. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service 4xx type of errors. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service 500 type of errors. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service 501 type of errors. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service 5xx type of errors. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service timeouts. |
Unit: N/A Type: gauge |
Errors |
sips_ Dial Plan Service average response latency. |
Unit: Type: gauge |
Latency |
sips_ SIP Proxy Service state: 0 – Out-Of-Service |
Unit: N/A Type: gauge |
Traffic |
trunk_ Number of trunks synchronized with SIP Server. |
Unit: N/A Type: gauge |
|
trunk_ Number of trunks obtained from the config node. |
Unit: N/A Type: gauge |
|
trunk_ Number of failed attempts to read from the config node. |
Unit: N/A Type: counter |
|
trunk_ Number of trunks with the T-Library connection. |
Unit: N/A Type: gauge |
Alerts[edit source]
The following alerts are defined for Voice SIP Cluster Service.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
Too many Kafka pending events | Critical | Too many Kafka producer pending events for pod {{ $labels.pod }}.
Actions:
|
kafka_producer_queue_depth | Too many Kafka producer pending events for service {{ $labels.container }} (more than 100 in 5 minutes).
|
Dial Plan node is overloaded | Critical | Dial Plan node is overloaded as the response latency increases.
Actions:
|
sips_dp_average_response_latency | Dial Plan node is overloaded as the response latency increases (more than 1000).
|
Dial Plan Queue Increase | Critical | Because Dial Plan requests are huge in size or there is a connection issue with the Dial Plan node, the processing queue size increases in size.
Actions:
|
sips_dp_queue_size | The processing queue size is greater than 10 requests for 1 minute.
|
SIP Proxy overloaded | Critical | SIP Proxy is overloaded.
Actions:
|
sips_sip_response_time_ms_sum, sips_sip_response_time_ms_count | Response time is greater than 20 milliseconds for 1 minute.
|
SIP Node HealthCheck Fail | Critical | SIP Node health level fails for pod {{ $labels.pod }}.
Actions:
|
sipnode_health_level | SIP Node health level fails for pod {{ $labels.pod }} for 5 minutes.
|
Kafka not available | Critical | Kafka is not available for pod {{ $labels.pod }}.
Actions:
|
kafka_producer_state | Kafka is not available for pod {{ $labels.pod }} for 5 minutes.
|
Pod Status Error | Warning | Actions:
|
kube_pod_status_phase | Pod {{ $labels.pod }} is in Failed, Unknown, or Pending state.
|
Pod Status NotReady | Warning | Pod {{ $labels.pod }} is in NotReady state.
Actions:
|
kube_pod_status_ready | Pod {{ $labels.pod }} is in NotReady state for 5 minutes.
|
Container Restarted Repeatedly | Critical | Container {{ $labels.container }} was repeatedly restarted.
Actions:
|
kube_pod_container_status_restarts_total | Container {{ $labels.container }} was restarted 5 or more times within 15 minutes.
|
Ready Pods below 60% | Critical | The number of statefulset {{ $labels.statefulset}} pods in the Ready state has dropped below 60%.
Actions:
|
kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas_current | For the last 5 minutes, fewer than 60% of the currently available statefulset {{ $labels.statefulset}} pods have been in the Ready state.
|
Pods scaled up greater than 80% | Critical | The current number of replicas is more than 80% of the maximum number of replicas.
Actions:
|
kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas | For 5 consecutive minutes, the number of replicas is more than 80% of the maximum number of replicas.
|
Pods less than Min Replicas | Critical | The current number of replicas is less than the minimum replicas that should be available. This might be because Kubernetes cannot deploy a new pod or pods are failing to be active/ready.
Actions:
|
kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas | For 5 consecutive minutes, the number of replicas is less than the minimum replicas that should be available.
|
Pod CPU greater than 80% | Critical | Critical CPU load for pod {{ $labels.pod }}.
Actions:
|
container_cpu_usage_seconds_total, container_spec_cpu_period | Container {{ $labels.container }} CPU usage exceeded 80% for 5 minutes.
|
Pod CPU greater than 65% | Warning | High CPU load for pod {{ $labels.pod }}.
Actions:
|
container_cpu_usage_seconds_total, container_spec_cpu_period | Container {{ $labels.container }} CPU usage exceeded 65% for 5 minutes.
|
Pod memory greater than 80% | Critical | Critical memory usage for pod {{ $labels.pod }}.
Actions:
|
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container {{ $labels.container }} memory usage exceeded 80% for 5 minutes.
|
Pod memory greater than 65% | Warning | High memory usage for pod {{ $labels.pod }}.
Actions:
|
container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container {{ $labels.container }} memory usage exceeded 65% for 5 minutes.
|
Redis not available | Critical | Redis is not available for pod {{ $labels.pod }}.
Actions:
|
redis_state | Redis is not available for pod {{ $labels.pod }} for 5 consecutive minutes.
|
Too many Kafka producer errors | Critical | Kafka responds with errors at pod {{ $labels.pod }}.
Actions:
|
kafka_producer_error_total | More than 100 errors for 5 consecutive minutes.
|
SIP Server main thread consuming more than 65% CPU for 5 mins | Warning | Main thread consumes too much CPU.
Actions:
|
sips_cpu_usage_main | Main thread consumes too much CPU (more than 65% for 5 consecutive minutes).
|
Calls activity drop | Warning | A noticeable reduction in the number of active calls on a specific SIP Server and no new calls are arriving for processing.
Actions:
|
sips_calls, sips_calls_created | The absolute value of active calls on a specific SIP Server dropped by more than 30 calls in 2 minutes and no new calls are arriving at the SIP Server for processing.
|
Dial Plan Node Down | Critical | No Dial Plan nodes are reachable from SIP Server and all connections to Dial Plan nodes are down.
Actions:
|
sips_dp_active_connections | All connections to Dial Plan nodes are down.
|
Dialplan Node problem | Warning | Dial Plan node rejects requests with an error or it doesn't respond to requests and requests are timed out.
Actions:
|
sips_dp_timeouts | During 1 minute, the Dial Plan node rejects more than 5 requests with an error or more than 5 requests time out because the Dial Plan node fails to respond.
|
Routing timeout counter growth | Warning | The trigger detects that routing timeouts are increasing.
Actions:
|
sips_routing_timeouts | The absolute value of NROUTINGTIMEOUTS on a specific SIP Server increased by more than 20 in 2 minutes.
|
SIP trunk is out of service | Critical | SIP trunk is out of service.
Actions:
|
sips_trunk_in_service | SIP trunk is out of service for more than 1 minute.
|
Media service is out of service | Critical | Media service is out of service.
Actions:
|
sips_msml_in_service | Media service is out of service for more than 1 minute.
|
SIP softswitch is out of service | Critical | Actions:
|
sips_softswitch_in_service | SIP softswitch is out of service.
|
SIP Proxy is out of service | Critical | Actions:
|
sips_sipproxy_in_service | SIP Proxy is out of service. |