Cargo query
Showing below up to 500 results in range #101 to #600.
View (previous 500 | next 500) (20 | 50 | 100 | 250 | 500)
Page | Alert | Severity | AlertDescription | BasedOn | Threshold |
---|---|---|---|---|---|
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_FETCH_RESOURCE_TIMEOUT | MEDIUM | Number of VXMLi fetch timeouts exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | 1min |
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_PARSE_ERROR | WARNING | Number of VXMLi parse errors exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...} | 1min |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerCPUreached80percent | HIGH | The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerMemoryUsage80percent | HIGH | The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins | container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerRestartedRepeatedly | CRITICAL | The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins | kube_pod_container_status_restarts_total | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | InitContainerFailingRepeatedly | CRITICAL | The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins | kube_pod_init_container_status_restarts_total | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | PodStatusNotReady | CRITICAL | The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file. | kube_pod_status_ready | 30mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | PVC50PercentFilled | HIGH | This trigger will flag an alarm when the RS PVC size is 50% filled
|
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | PVC80PercentFilled | CRITICAL | This trigger will flag an alarm when the RS PVC size is 80% filled
|
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes | 5mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | RSQueueSizeCritical | HIGH | The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins | rsQueueSize | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerCPUreached80percentForRM0 | HIGH | The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerCPUreached80percentForRM1 | HIGH | The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerMemoryUsage80percentForRM0 | HIGH | The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins | container_memory_rss, kube_pod_container_resource_limits_memory_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerMemoryUsage80percentForRM1 | HIGH | The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins | container_memory_rss, kube_pod_container_resource_limits_memory_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerRestartedRepeatedly | CRITICAL | The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins | kube_pod_container_status_restarts_total | 15 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | InitContainerFailingRepeatedly | CRITICAL | The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins. | kube_pod_init_container_status_restarts_total | 15 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | MCPPortsExceeded | HIGH | All the MCP ports in MCP LRG are exceeded | gvp_rm_log_parser_eror_total | 1min |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | PodStatusNotReady | CRITICAL | The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml. | kube_pod_status_ready | 30mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RM Service Down | CRITICAL | RM pods are not in ready state and RM service is not available | kube_pod_container_status_running | 0 |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMConfigServerConnectionLost | HIGH | RM lost connection to GVP Configuration Server for 5mins. | gvp_rm_log_parser_warn_total | 5 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMInterNodeConnectivityBroken | HIGH | Inter-node connectivity between RM nodes is lost for 5mins. | gvp_rm_log_parser_warn_total | 5 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMMatchingIVRTenantNotFound | MEDIUM | Matching IVR profile tenant could not be found for 2mins | gvp_rm_log_parser_eror_total | 2mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMResourceAllocationFailed | MEDIUM | RM Resource allocation failed for 1mins | gvp_rm_log_parser_eror_total | 1min |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMServiceDegradedTo50Percentage | HIGH | One of the RM container is not in running state for 5mins | kube_pod_container_status_running | 5mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMSocketInterNodeError | HIGH | RM Inter node Socket Error for 5mins. | gvp_rm_log_parser_eror_total | 5mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMTotal4XXErrorForINVITE | MEDIUM | The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm. | rmTotal4xxInviteSent | 1min |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMTotal5XXErrorForINVITE | HIGH | The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm. | rmTotal5xxInviteSent | 5 mins |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | CPUThrottling | Critical | Containers are being throttled more than 1 time per second. | container_cpu_cfs_throttled_periods_total | 1 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_500_responces_java | Critical | Too many 500 responses. | gws_responses_total | 10 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_5xx_responces_count | Critical | Too many 5xx responses. | gws_responses_total | 60 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_cpu_usage | Warning | High container CPU usage. | container_cpu_usage_seconds_total | 300% |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_jvm_gc_pause_seconds_count | Critical | JVM garbage collection occurs too often. | jvm_gc_pause_seconds_count | 10 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_jvm_threads_deadlocked | Critical | Deadlocked JVM threads exist. | jvm_threads_deadlocked | 0 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | netstat_Tcp_RetransSegs | Warning | High number of TCP RetransSegs (retransmitted segments). | node_netstat_Tcp_RetransSegs | 2000 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | total_count_of_errors_during_context_initialization | Warning | Total count of errors during context initialization. | gws_context_error_total | 1200 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | total_count_of_errors_in_PSDK_connections | Warning | Total count of errors in PSDK connections. | psdk_conn_error_total | 3 |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | DesiredPodsDontMatchSpec | Critical | The Workspace Service deployment doesn't have the desired number of replicas. | kube_deployment_status_replicas_available, kube_deployment_spec_replicas | Fired when number of available replicas does not equal to configured number. |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_app_workspace_incoming_requests | Critical | High rate of incoming requests from Workspace Web Edition. | gws_app_workspace_incoming_requests | 10 |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_high_500_responces_workspace | Critical | The Workspace Service has too many 500 responses. | gws_app_workspace_requests | 10 |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_high_cpu_usage | Warning | High container CPU usage. | container_cpu_usage_seconds_total | 300% |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_high_nodejs_eventloop_lag_seconds | Critical | The Node.js event loop is too slow. | nodejs_eventloop_lag_seconds | 0.2 |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES-NODE-JS-DELAY-WARNING | Warning | Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment. | application_ccecp_nodejs_eventloop_lag_seconds | Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_ENQUEUE_LIMIT_REACHED | Info | GES is throttling callbacks to a given phone number. | CB_ENQUEUE_LIMIT_REACHED | Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_SUBMIT_FAILED | Info | GES has failed to submit a callback to ORS. | CB_SUBMIT_FAILED | Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_TTL_LIMIT_REACHED | Info | GES is throttling callbacks for a specific tenant. | CB_TTL_LIMIT_REACHED | Triggered when GES has started throttling callbacks within the past 2 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CPU_USAGE | Info | GES has high CPU usage for 1 minute. | ges_process_cpu_seconds_total | Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_DNS_FAILURE | Warning | A GES pod has encountered difficulty resolving DNS requests. | DNS_FAILURE | Triggered when GES encounters any DNS failures within the last 30 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_AUTH_DOWN | Warning | Connection to the Genesys Authentication Service is down. | GWS_AUTH_STATUS | Triggered when the connection to the Genesys Authentication Service is down for 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_CONFIG_DOWN | Warning | Connection to the GWS Configuration Service is down. | GWS_CONFIG_STATUS | Triggered when the connection to the GWS Configuration Service is down. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_ENVIRONMENT_DOWN | Warning | Connection to the GWS Environment Service is down. | GWS_ENV_STATUS | Triggered when the connection to the GWS Environment Service is down. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_INCORRECT_CLIENT_CREDENTIALS | Warning | The GWS client credentials provided to GES are incorrect. | GWS_INCORRECT_CLIENT_CREDENTIALS | Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_SERVER_ERROR | Warning | GES has encountered server or connection errors with GWS. | GWS_SERVER_ERROR | Triggered when there has been a GWS server error in the past 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HEALTH | Critical | One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down. | GES_HEALTH | Triggered when any component is down for any length of time. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_400_POD | Info | An individual GES pod is returning excessive HTTP 400 results. | ges_http_failed_requests_total, http_400_tolerance | Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_401_POD | Info | An individual GES pod is returning excessive HTTP 401 results. | ges_http_failed_requests_total, http_401_tolerance | Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_404_POD | Info | An individual GES pod is returning excessive HTTP 404 results. | ges_http_failed_requests_total, http_404_tolerance | Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_500_POD | Info | An individual GES pod is returning excessive HTTP 500 results. | ges_http_failed_requests_total, http_500_tolerance | Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_INVALID_CONTENT_LENGTH | Info | Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port. | INVALID_CONTENT_LENGTH, invalid_content_length_tolerance | Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_LOGGING_FAILURE | Warning | GES has failed to write a message to the log. | LOGGING_FAILURE | Triggered when there are any failures writing to the logs. Silenced after 1 minute. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_MEMORY_USAGE | Info | GES has high memory usage for a period of 90 seconds. | ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes | Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NEXUS_ACCESS_FAILURE | Warning | GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback. | NEXUS_ACCESS_FAILURE | Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NOT_READY_CRITICAL | Critical | GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. | kube_pod_container_status_ready | Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NOT_READY_WARNING | Warning | GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. | kube_pod_container_status_ready | Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_ORS_REDIS_DOWN | Critical | Connection to ORS_REDIS is down. | ORS_REDIS_STATUS | Triggered when the ORS_REDIS connection is down for 5 consecutive minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_PODS_RESTART | Critical | GES pods have been excessively crashing and restarting. | kube_pod_container_status_restarts_total | Triggered when there have been more than five pod restarts in the past 15 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_RBAC_CREATE_VQ_PROXY_ERROR | Info | Fires if there are issues with GES managing VQ Proxy Objects. | RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance | Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_SLOW_HTTP_RESPONSE_TIME | Warning | Fired if the average response time for incoming requests begins to lag. | ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count | Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_UNCAUGHT_EXCEPTION | Warning | There has been an uncaught exception within GES. | UNCAUGHT_EXCEPTION | Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_UP | Critical | Fires when fewer than two GES pods have been up for the last 15 minutes. | Triggered when fewer than two GES pods are up for 15 consecutive minutes. | |
Draft:PEC-DC/Current/DCPEGuide/DCMetrics | Memory usage is above 3000 Mb | Critical | Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes. | nexus_process_resident_memory_bytes | For 15 minutes |
Draft:PEC-DC/Current/DCPEGuide/DCMetrics | Nexus error rate | Critical | Triggered when the error rate on this pod is greater than 20% for 15 minutes. | nexus_errors_total, nexus_request_total | For 15 minutes |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | Database connections above 75 | HIGH | Triggered when pod database connections number is above 75. | Default number of connections: 75 | |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | IWD DB errors | CRITICAL | Triggered when IWD experiences more than 2 errors within 1 minute during operations with database. | Default number of errors: 2 | |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | IWD error rate | CRITICAL | Triggered when the number of errors in IWD exceeds the threshold for 15 min period. | Default number of errors: 2 | |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | Memory usage is above 3000 Mb | CRITICAL | Triggered when the pod memory usage is above 3000 MB. | Default memory usage: 3000 MB | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-API-LatencyHigh | HIGH | Triggered when the latency for API responses is beyond the defined threshold. | 2500ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-API-Redis-Connection-Failed | HIGH | Triggered when the connection to redis fails for more than 1 minute. | 1m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-EXT-Ingress-Error-Rate | HIGH | Triggered when the Ingress error rate is above the specified threshold. | 20% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | cxc_api_too_many_errors_from_auth | HIGH | Triggered when there are too many error responses from the auth service for more than the specified time threshold. | 1m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-CM-Redis-Connection-Failed | HIGH | Triggered when the connection to redis fails for more than 1 minute. | 1m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-CPUUsage | HIGH | Triggered when a the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-CoM-Redis-no-active-connections | HIGH | Triggered when CX Contact compliance has no active redis connection for 2 minutes | 2m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-Compliance-LatencyHigh | HIGH | Triggered when the latency for API responses is beyond the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold. | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-DM-LatencyHigh | HIGH | Triggered when the latency for dial manager is above the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-JS-LatencyHigh | HIGH | Triggered when the latency for job scheduler is above the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-LB-LatencyHigh | HIGH | Triggered when the latency for list builder is above the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-LM-LatencyHigh | HIGH | Triggered when the latency for list manager is above the defined threshold | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | cxc_list_manager_too_many_errors_from_auth | HIGH | Triggered when there are too many error responses from the auth service (list manager) for more than the specified time threshold. | 1m | |
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics | gcxi__cluster__info | This alert indicates problems with the cluster states. Applicable only if you have two or more nodes in a cluster. | gcxi__cluster__info | ||
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics | gcxi__projects__status | If the value of cxi__projects__status is greater than 0, this alarm is set, indicating that reporting is not functioning properly. | cxi__projects__status | < 0 | |
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics | raa-errors | '''Specified by''': raa. '''Recommended value''': warning |
A nonzero value indicates that errors have been logged during the scrape interval. | gcxi_raa_error_count | >0 |
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics | raa-health | '''Specified by''': raa. '''Recommended value:''' severe |
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. | gcxi_raa_health_level | Specified by: raa. '''Recommended value''': 30m |
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics | raa-long-aggregation | '''Specified by''': raa. '''Recommended value''': warning |
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. | gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count | Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml. '''Recommended value''': 300 |
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics | GcaOOMKilled | Critical | Triggered when a GCA pod is restarted because of OOMKilled. | kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason | 1 |
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics | GcaPodCrashLooping | Critical | Triggered when a GCA pod is crash looping. | kube_pod_container_status_restarts_total | The restart rate is greater than 0 for 5 minutes |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspFlinkJobDown | Critical | Triggered when the GSP Flink job is not running (number of running jobs equals to 0 or metric is not available) | flink_jobmanager_numRunningJobs | For 5 minutes |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspNoTmRegistered | Critical | Triggered when there are no registered TaskManagers (or metric not available) | flink_jobmanager_numRegisteredTaskManagers | For 5 minutes |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspOOMKilled | Critical | Triggered when a GSP pod is restarted because of OOMKilled | kube_pod_container_status_restarts_total | 0 |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspUnknownPerson | High | Triggered when GSP encounters unknown person(s) | flink_ |
For 5 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_connected_configservers | Critical | Pulse DCU Collector is not connected to ConfigServer. | pulse_collector_connection_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_connected_dbservers | Critical | Pulse DCU Collector is not connected to DbServer. | pulse_collector_connection_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_connected_statservers | Critical | Pulse DCU Collector is not connected to Stat Server. | pulse_collector_connection_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_snapshot_writing | Critical | Pulse DCU Collector does not write snapshots. | pulse_collector_snapshot_writing_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_cpu | Critical | Detected critical CPU usage by Pulse DCU Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_disk | Critical | Detected critical disk usage by Pulse DCU Pod. | kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes | 90% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_memory | Critical | Detected critical memory usage by Pulse DCU Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_nonrunning_instances | Critical | Triggered when Pulse DCU instances are down. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_connected_configservers | Critical | Pulse DCU Stat Server is not connected to ConfigServer. | pulse_statserver_server_connected_seconds | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_connected_ixnservers | Critical | Pulse DCU Stat Server is not connected to IxnServers. | pulse_statserver_server_connected_seconds | 2 |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_connected_tservers | Critical | Pulse DCU Stat Server is not connected to T-Servers. | pulse_statserver_server_connected_number | 2 |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_failed_dn_registrations | Critical | Detected critical DN registration failures on Pulse DCU Stat Server. | pulse_statserver_dn_failed, pulse_statserver_dn_registered | 0.5% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_monitor_data_unavailable | Critical | Pulse DCU Monitor Agents do not provide data. | pulse_monitor_check_duration_seconds, kube_statefulset_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_too_frequent_restarts | Critical | Detected too frequent restarts of DCU Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_critical_cpu | Critical | Detected critical CPU usage by Pulse LDS Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_critical_memory | Critical | Detected critical memory usage by Pulse LDS Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_critical_nonrunning_instances | Critical | Triggered when Pulse LDS instances are down. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_monitor_data_unavailable | Critical | Pulse LDS Monitor Agents do not provide data. | pulse_monitor_check_duration_seconds, kube_statefulset_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_no_connected_senders | Critical | Pule LDS is not connected to upstream servers. | pulse_lds_senders_number | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_no_registered_dns | Critical | No DNs are registered on Pulse LDS. | pulse_lds_sender_registered_dns_number | for 30 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_too_frequent_restarts | Critical | Detected too frequent restarts of LDS Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_5xx | Critical | Detected critical 5xx errors per second for Pulse container. | http_server_requests_seconds_count | 15% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_cpu | Critical | Detected critical CPU usage by Pulse Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_hikari_cp | Critical | Detected critical Hikari connections pool usage by Pulse container. | hikaricp_connections_active, hikaricp_connections | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_memory | Critical | Detected critical memory usage by Pulse Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_pulse_health | Critical | Detected critical number of healthy Pulse containers. | pulse_health_all_Boolean | 50% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_running_instances | Critical | Triggered when Pulse instances are down. | kube_deployment_status_replicas_available, kube_deployment_status_replicas | 75% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_service_down | Critical | All Pulse instances are down. | up | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_too_frequent_restarts | Critical | Detected too frequent restarts of Pulse Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_critical_cpu | Critical | Detected critical CPU usage by Pulse Permissions Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_critical_memory | Critical | Detected critical memory usage by Pulse Permissions Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_critical_running_instances | Critical | Triggered when Pulse Permissions instances are down. | kube_deployment_status_replicas_available, kube_deployment_status_replicas | 75% |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_too_frequent_restarts | Critical | Detected too frequent restarts of Permissions Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:STRMS/Current/STRMSPEGuide/ServiceMetrics | streams_GWS_AUTH_DOWN | critical | Unable to connect to GWS auth service | gws_auth_down | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_BATCH_LAG_TIME | warning | Message handling exceeds 2 secs | 30 seconds | |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_DOWN | critical | The number of running instances is 0 | sum(up) < 1 | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_ENDPOINT_CONNECTION_DOWN | warning | Unable to connect to a customer endpoint | endpoint_connection_down | 30 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_ENGAGE_KAFKA_CONNECTION_DOWN | critical | Unable to connect to Engage Kafka | engage_kafka_main_connection_down | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_GWS_AUTH_DOWN | Critical | Unable to connect to GWS auth service | gws_auth_down | 30 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_GWS_CONFIG_DOWN | critical | Unable to connect to GWS config service | gws_config_down | |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_GWS_ENV_DOWN | critical | Unable to connect to GWS environment service | gws_env_down | 30 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_INIT_ERROR | critical | Aborted due to initialization error e.g., KAFKA_FQDN is not defined | application_streams_init_error > 0 | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_REDIS_DOWN | critical | redis_connection_down | 10 seconds | |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Http Errors Occurrences Exceeded Threshold | High | Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes | telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} | >500 in 5 minutes |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry CPU Utilization is Greater Than Threshold | High | Triggered when average CPU usage is more than 60% | node_cpu_seconds_total | >60% |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry Dependency Status | Low | Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus | telemetry_dependency_status | <80 |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry GAuth Time Alert | High | Triggered when there is no connection to the GAuth service | telemetry_gws_auth_req_time | >10000 |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry Healthy Pod Count Alert | High | Triggered when the number of healthy pods drops to critical level | kube_pod_container_status_ready | <2 |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry High Network Traffic | High | Triggered when network traffic exceeds 10MB/second for 5 minutes | node_network_transmit_bytes_total, node_network_receive_bytes_total | >10MBps |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry Memory Usage is Greater Than Threshold | High | Triggered when average memory usage is more than 60% | container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores | >60% |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_elasticsearch_health_status | critical | Triggered when there is no connection to ElasticSearch | ucsx_elasticsearch_health_status | 2 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_elasticsearch_slow_processing_time | critical | Triggered when Elasticsearch internal processing time > 500 ms | ucsx_elastic_search_sum, ucsx_elastic_search_count | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_high_cpu_utilization | warning | Triggered when average CPU usage is more than 80% | ucsx_performance | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_high_http_request_rate | warning | Triggered when request rate is more than 120 requests per seconds on one UCS-X instance | ucsx_http_request_duration_count | 30 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_high_memory_usage | warning | Triggered when average CPU usage is more than 800 Mb | ucsx_memory | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_overloaded | warning | Triggered when overload protection rate is more than 0 | ucsx_overload_protection_count | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_slow_http_response | critical | Triggered when average http response time > 500 ms | ucsx_http_request_duration_sum, ucsx_http_request_duration_count | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_masterdb_health_status | warning | Triggered when there is no connection to master DB | ucsx_masterdb_health_status | 2 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_tenantdb_health_status | critical | Triggered when there is no connection to tenant DB | ucsx_tenantdb_health_status | 2 minutes |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Agent service fail | Critical | Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005B-QINU`"', then restart the pod. | agent_health_level | Agent health level is Fail for pod '"`UNIQ--nowiki-0000005C-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Config node fail | Warning | Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005D-QINU`"' and the config node. | http_client_response_count | Requests to the config node fail for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Container restarted repeatedly | Critical | Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000056-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Kafka events latency is too high | Warning | Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network | kafka_consumer_latency_bucket | Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Kafka not available | Critical | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000057-QINU`"', check if there is an issue with the pod. | kafka_producer_state, kafka_consumer_state | Kafka is not available for pod '"`UNIQ--nowiki-00000058-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Max replicas is not sufficient for 5 mins | Critical | The desired number of replicas is higher than the current available replicas for the past 5 minutes. | kube_statefulset_replicas, kube_statefulset_status_replicas | The desired number of replicas is higher than the current available replicas for the past 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000005E-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000005F-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000060-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000061-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000062-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000063-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000064-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000065-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status Failed | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000052-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status NotReady | Critical | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000055-QINU`"' is in NotReady status for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status Pending | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000054-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status Unknown | Warning | Actions: *Restart the pod. Check if there are any issues with pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000053-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Possible messages lost | Critical | Actions: *Check Kafka and '"`UNIQ--nowiki-0000004A-QINU`"' service overload, network degradation. | kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total | Number of sent requests is two times higher than received for topic '"`UNIQ--nowiki-0000004B-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Redis not available | Critical | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000059-QINU`"', check if there is an issue with the pod. | agent_redis_state, agent_stream_redis_state | Redis is not available for pod '"`UNIQ--nowiki-0000005A-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka consumer crashes | Critical | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-00000050-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 3 Kafka consumer crashes in 5 minutes for service '"`UNIQ--nowiki-00000051-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka consumer failed health checks | Warning | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka consumer request timeouts | Warning | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004F-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka pending events | Critical | Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000066-QINU`"' pod's CPU and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000067-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Container restarted repeatedly | Critical | Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-0000004A-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Kafka events latency is too high | Critical | Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-0000003E-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network | kafka_consumer_latency_bucket | Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-0000003F-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Kafka not available | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004B-QINU`"', check if there is an issue with the pod. | kafka_producer_state, kafka_consumer_state | Kafka is not available for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Max replicas is not sufficient for 5 mins | Critical | The desired number of replicas is higher than the current available replicas for the past 5 minutes. | kube_statefulset_replicas, kube_statefulset_status_replicas | The desired number of replicas is higher than the current available replicas for the past 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000004F-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000050-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000051-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000052-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000053-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000054-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000055-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000056-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status Failed | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000046-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status NotReady | Critical | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000049-QINU`"' is in NotReady status for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status Pending | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000048-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status Unknown | Warning | Actions: *Restart the pod. Check if there are any issues with pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000047-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Redis not available | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004D-QINU`"', check if there is an issue with the pod. | callthread_redis_state | Redis is not available for pod '"`UNIQ--nowiki-0000004E-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka consumer crashes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000044-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 3 Kafka consumer crashes in 5 minutes for topic '"`UNIQ--nowiki-00000045-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka consumer failed health checks | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000040-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000041-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka consumer request timeouts | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000042-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000043-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka pending events | Critical | Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000057-QINU`"' service's CPU and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000058-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Container restarted repeatedly | Critical | Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000038-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000003E-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000040-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod Failed | Warning | Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_status_phase | Pod failed '"`UNIQ--nowiki-00000032-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000039-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000003A-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-0000003B-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000003C-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod Not ready for 10 minutes | Critical | Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000037-QINU`"' is in NotReady state for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod Pending state | Warning | Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000035-QINU`"', check the health of the pod. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000036-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod Unknown state | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000033-QINU`"', check to see whether the image is correct and if the container is starting up. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000034-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Redis disconnected for 10 minutes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000030-QINU`"', check to see if there is an issue with the pod. | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-00000031-QINU`"' for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Redis disconnected for 5 minutes | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000002E-QINU`"', check to see if there is an issue with the pod. | redis_state | Redis is not available for pod '"`UNIQ--nowiki-0000002F-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Aggregated service health failing for 5 minutes | Critical | Actions: *Check the dialplan dashboard for Aggregated Service Health errors and, in case of a Redis error, first check for any issues/crashes in the pod and then restart Redis. *In the case of an Envoy error, the dialplan container will be restarted by the liveness probe. If the issue still exists | dialplan_health_level | Dependent services or the Envoy sidecar is not available for 5 minutes in the pod '"`UNIQ--nowiki-00000032-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | DialPlan processing time > 0.5 seconds | Warning | Actions: *If the alarm is generated for all dialplan pods, then Redis or network delay might be the most probable cause. *If the alarm is generated in a single dialplan pod, then it might be due to Envoy or a network issue. | dialplan_response_time | When the latency for 95% of the dial plan messages is more than 0.5 seconds for a duration of 5 minutes, then this warning alarm is raised for the '"`UNIQ--nowiki-00000030-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | DialPlan processing time > 2 seconds | Critical | Actions: *If the alarm is generated for all dialplan pods, then Redis or network delay might be the most probable cause. *If the alarm is generated in a single dialplan pod, then it might be due to Envoy or a network issue. | dialplan_response_time | If the latency for 95% of the dial plan messages is more than 2 seconds for a duration of 5 minutes, then this warning alarm is raised for the '"`UNIQ--nowiki-00000031-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-00000041-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000042-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000043-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000044-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod Failed | Warning | Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000037-QINU`"' failed. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-0000003E-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000040-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod Not ready for 10 minutes | Critical | Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-0000003C-QINU`"' is in the NotReady state for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod Pending state | Warning | Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003A-QINU`"', check the health of the pod. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000003B-QINU`"' is in the Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Pod Unknown state | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000038-QINU`"', check whether the image is correct and if the container is starting up. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000039-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Redis disconnected for 10 minutes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis and then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000035-QINU`"', check to see if there is an issue with the pod. | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-00000036-QINU`"' for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics | Redis disconnected for 5 minutes | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis and then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000033-QINU`"', check to see if there is an issue with the pod. | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-00000034-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Container restarted repeatedly | Critical | Container '"`UNIQ--nowiki-00000076-QINU`"' was restarted 5 or more times within 15 minutes. Actions: *Check if a new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000077-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Kafka not available | Critical | Kafka is not available for pod '"`UNIQ--nowiki-00000068-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000069-QINU`"', check if there is an issue wit | kafka_producer_state | Kafka is not available for pod '"`UNIQ--nowiki-0000006A-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Max replicas is not sufficient for 5 mins | Critical | For the past 5 minutes, the desired number of replicas is higher than the number of replicas currently available. Actions: *Check resources available for Kubernetes. Increase resources, if necessary. | kube_statefulset_replicas, kube_statefulset_status_replicas | Desired number of replicas is higher than current available replicas for the past 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | No requests received | Critical | Absence of received requests for pod '"`UNIQ--nowiki-00000060-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000061-QINU`"', make sure there are no issues with Orchestration Service and Tenant Service or the network to them. | sipfe_requests_total | increase(sipfe_requests_total{pod=~"sipfe-.+"}[5m]) <= 0 and increase(sipfe_requests_total{pod=~"sipfe-.+"}[10m]) > 100 |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-00000078-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000079-QINU`"'; raise an investi | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000007A-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-0000007B-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000007C-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-0000007D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-0000007E-QINU`"'; raise an inv | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000007F-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000080-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service for pod '"`UNIQ--nowiki-00000081-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000082-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod status Failed | Warning | Pod '"`UNIQ--nowiki-0000006E-QINU`"' is in Failed state. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000006F-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod status NotReady | Critical | Pod '"`UNIQ--nowiki-00000074-QINU`"' is in the NotReady state for 5 minutes. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000075-QINU`"' is in the NotReady state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod status Pending | Warning | Pod '"`UNIQ--nowiki-00000072-QINU`"' is in Pending state for 5 minutes. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000073-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pod status Unknown | Warning | Pod '"`UNIQ--nowiki-00000070-QINU`"' is in Unknown state for 5 minutes. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000071-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pods less than Min Replicas | Critical | The current number of replicas is lower than the minimum number of replicas that should be available. Actions: *Check if Kubernetes cannot deploy new pods or if pods are failing in their status to be active/read. | kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas | For the past 5 minutes, the current number of replicas is lower than the minimum number of replicas that should be available. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Pods scaled up greater than 80% | Critical | For the past 5 minutes, the desired number of replicas is greater than the number of replicas currently available. Actions: *Check resources available for Kubernetes. Increase resources, if necessary. | kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas | (kube_hpa_status_current_replicas{namespace="voice",hpa="sipfe-node-hpa"} * 100) / kube_hpa_spec_max_replicas{namespace="voice",hpa="sipfe-node-hpa"} > 80 for: 5m |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | SIP Cluster Service response latency is too high | Critical | Actions: *If the alarm is triggered for multiple pods, make sure there are no issues with the SIP Cluster Service (CPU, memory, or network overload). *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000005E-QINU`"', check if there is an issue with the pod (CPU, memory, or network overload | sipfe_sip_node_request_duration_seconds_bucket | Latency for 95% of messages is more than 0.5 seconds for service '"`UNIQ--nowiki-0000005F-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | SIP Node(s) is not available | Critical | No available SIP Nodes for pod '"`UNIQ--nowiki-0000006B-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with SIP Nodes, and then restart SIP Nodes. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000006C-QINU`"', check if there is an i | sipfe_sip_nodes_total | No available SIP Nodes for pod '"`UNIQ--nowiki-0000006D-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Too many failure responses sent | Critical | Too many failure responses are sent by the Front End service at pod '"`UNIQ--nowiki-00000062-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000063-QINU`"', make sure received requests are valid. | sipfe_responses_total | More than 100 failure responses in 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Too many Kafka pending producer events | Critical | Actions: *Make sure there are no issues with Kafka or '"`UNIQ--nowiki-0000005A-QINU`"' pod's CPU and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for pod '"`UNIQ--nowiki-0000005B-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Too many Kafka producer errors | Critical | Kafka responds with errors at pod '"`UNIQ--nowiki-00000064-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000065-QINU`"', make sure there are no issues with Kafka. | kafka_producer_error_total | More than 100 errors in 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Too many received requests without a response | Critical | Actions: *Collect the service logs for pod '"`UNIQ--nowiki-0000005C-QINU`"'; raise an investigation ticket. *Restart the service. | sipfe_requests_total | For too many requests, the Front End service at pod '"`UNIQ--nowiki-0000005D-QINU`"' did not send any response (more than 100 requests without a response, measured over 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics | Too many SIP Cluster Service error responses | Critical | SIP Cluster Service responds with errors at pod '"`UNIQ--nowiki-00000066-QINU`"'. Actions: *If the alarm is triggered for multiple pods, make sure there are no issues with the SIP Cluster Service (CPU, memory, or network overload). *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000006 | sipfe_sip_node_responses_total | More than 100 errors in 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Container restored repeatedly | Critical | Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000042-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Number of running strategies is critical | Critical | Too many active sessions. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check the number of voice, digital, and callback calls in the system. | orsnode_strategies | More than 600 strategies running in 5 consecutive seconds. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Number of running strategies is too high | Warning | Too many active sessions. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check the number of voice, digital, and callback calls in the system. | orsnode_strategies | More than 400 strategies running in 5 consecutive seconds. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-00000047-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000048-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000049-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000004A-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod in Pending state | Warning | Pod '"`UNIQ--nowiki-0000003D-QINU`"' is in Pending state. Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003E-QINU`"', check the health | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000003F-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod in Unknown state | Warning | Pod '"`UNIQ--nowiki-0000003A-QINU`"' is in Unknown state. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000003B-QINU`"', check whether the image is correct and if th | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000003C-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000043-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000044-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000045-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000046-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod Not ready for 10 minutes | Critical | Pod '"`UNIQ--nowiki-00000040-QINU`"' in NotReady state. Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000041-QINU`"' in NotReady state for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Pod status Failed | Warning | Pod '"`UNIQ--nowiki-00000038-QINU`"' failed. Actions: *One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000039-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Redis disconnected for 10 minutes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000036-QINU`"', check if there is an issue with the pod. | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-00000037-QINU`"' for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics | Redis disconnected for 5 minutes | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If alarm is triggered only for pod '"`UNIQ--nowiki-00000034-QINU`"', check if there is an issue with the pod. | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-00000035-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Container restarted repeatedly | Critical | Actions: *One of the container in the pod has entered a Failed state. Check the Kibana logs for the reason. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000060-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Kafka events latency is too high | Warning | Actions: *If the alarm is triggered for multiple topics, make sure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or netwo | kafka_consumer_latency_bucket | Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Kafka not available | Critical | Kafka is not available for pod '"`UNIQ--nowiki-00000050-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000051-QINU`"', check if there is an issue wit | kafka_producer_state, kafka_consumer_state | Kafka is not available for pod '"`UNIQ--nowiki-00000052-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-00000061-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000062-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000067-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000068-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod Failed | Warning | Pod '"`UNIQ--nowiki-00000057-QINU`"' failed. Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000058-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000063-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_memory_working_set_bytes, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000064-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000065-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs: raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_limits | Container '"`UNIQ--nowiki-00000066-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod Not ready for 10 minutes | Critical | Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-0000005F-QINU`"' is in the NotReady state for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod Pending state | Warning | Pod '"`UNIQ--nowiki-0000005C-QINU`"' is in Pending state. Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000005D-QINU`"', check the health of t | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000005E-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Pod Unknown state | Warning | Pod '"`UNIQ--nowiki-00000059-QINU`"' is in Unknown state. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000005A-QINU`"', check whether the image is correct and if th | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000005B-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Redis disconnected for 10 minutes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000055-QINU`"', check if there is an issue with the pod. | redis_state | Redis is not available for pod '"`UNIQ--nowiki-00000056-QINU`"' for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Redis disconnected for 5 minutes | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000053-QINU`"', check if there is an issue with the pod. | redis_state | Redis is not available for pod '"`UNIQ--nowiki-00000054-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Too many Kafka consumer crashes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | There were more than 3 Kafka consumer crashes within 5 minutes for service '"`UNIQ--nowiki-0000004F-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Too many Kafka consumer failed health checks | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-0000004A-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | Health check failed more than 10 times in 5 minutes for Kafka consumer for topic  '"`UNIQ--nowiki-0000004B-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics | Too many Kafka consumer request timeouts | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | There were more than 10 request timeouts within 5 minutes for the Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Container restored repeatedly | Critical | Container '"`UNIQ--nowiki-0000004A-QINU`"' was repeatedly restarted. Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-0000004B-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Number of Redis streams is too high | Warning | Too many active sessions. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has reached. *Check the number of voice, digital, and callback calls in the system. | rqnode_streams | More than 10000 active streams running. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-00000050-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000051-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000052-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000053-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod failed | Warning | Pod '"`UNIQ--nowiki-00000040-QINU`"' failed. Actions: *One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000041-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-0000004C-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000004D-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-0000004E-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000004F-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod not ready for 10 minutes | Critical | Pod '"`UNIQ--nowiki-00000048-QINU`"' in NotReady state. Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000049-QINU`"' in NotReady state for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod Pending state | Warning | Pod '"`UNIQ--nowiki-00000045-QINU`"' is in the Pending state. Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000046-QINU`"', check the hea | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000047-QINU`"' is in the Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Pod Unknown state | Warning | Pod '"`UNIQ--nowiki-00000042-QINU`"' in Unknown state. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000043-QINU`"', check whether the image is correct and if t | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000044-QINU`"' in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Redis disconnected for 10 minutes | Critical | Redis is not available for the pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003E-QINU`"', check to see if there | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-0000003F-QINU`"' for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics | Redis disconnected for 5 minutes | Warning | Redis is not available for the pod '"`UNIQ--nowiki-0000003A-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003B-QINU`"', check to see if there is any is | redis_state | Redis is not available for the pod '"`UNIQ--nowiki-0000003C-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Calls activity drop | Warning | A noticeable reduction in the number of active calls on a specific SIP Server and no new calls are arriving for processing. Actions: *If a problematic SIP Server is primary, do a switchover, and then restart the former primary server. *If a problematic SIP Server is backup, restart the backup serv | sips_calls, sips_calls_created | The absolute value of active calls on a specific SIP Server dropped by more than 30 calls in 2 minutes and no new calls are arriving at the SIP Server for processing. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Container Restarted Repeatedly | Critical | Container '"`UNIQ--nowiki-00000053-QINU`"' was repeatedly restarted. Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000054-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Dial Plan Node Down | Critical | No Dial Plan nodes are reachable from SIP Server and all connections to Dial Plan nodes are down. Actions: *Check the network connection between SIP Server and the Dial Plan node host. *Check the Dial Plan node CPU and memory usage. | sips_dp_active_connections | All connections to Dial Plan nodes are down. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Dial Plan node is overloaded | Critical | Dial Plan node is overloaded as the response latency increases. Actions: *Check that the inbound call rate to SIP Server is not too high. *Check the Dial Plan node CPU and memory usage. *Check the network connection between SIP Server and Dial Plan nodes. | sips_dp_average_response_latency | Dial Plan node is overloaded as the response latency increases (more than 1000). |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Dial Plan Queue Increase | Critical | Because Dial Plan requests are huge in size or there is a connection issue with the Dial Plan node, the processing queue size increases in size. Actions: *Check SIP Server inbound call rate. *Check the connection between SIP Server and the Dial Plan node. | sips_dp_queue_size | The processing queue size is greater than 10 requests for 1 minute. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Dialplan Node problem | Warning | Dial Plan node rejects requests with an error or it doesn't respond to requests and requests are timed out. Actions: *Check the network connection between SIP Server and the Dial Plan host. *Check that Dial Plan nodes are running. | sips_dp_timeouts | During 1 minute, the Dial Plan node rejects more than 5 requests with an error or more than 5 requests time out because the Dial Plan node fails to respond. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Kafka not available | Critical | Kafka is not available for pod '"`UNIQ--nowiki-0000004D-QINU`"'. Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the pod. | kafka_producer_state | Kafka is not available for pod '"`UNIQ--nowiki-0000004F-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Media service is out of service | Critical | Media service is out of service. Actions: *Troubleshoot the SIP Server-to-Resource Manager (RM) network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary. *Troubleshoot RM, consider RM restart. *After 5 minutes, redirect traffic to another s | sips_msml_in_service | Media service is out of service for more than 1 minute. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000005A-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-0000005B-QINU`"'; raise an investi | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000005C-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000057-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000058-QINU`"'; raise an inv | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000059-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000060-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000061-QINU`"'; raise an inv | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000062-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-0000005D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service for pod '"`UNIQ--nowiki-0000005E-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000005F-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pod Status Error | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000050-QINU`"' is in Failed, Unknown, or Pending state. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pod Status NotReady | Warning | Pod '"`UNIQ--nowiki-00000051-QINU`"' is in NotReady state. Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000052-QINU`"' is in NotReady state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pods less than Min Replicas | Critical | The current number of replicas is less than the minimum replicas that should be available. This might be because Kubernetes cannot deploy a new pod or pods are failing to be active/ready. Actions: *If all services have the same issue, then check Kubernetes nodes and Consul health. *If the issue is | kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas | For 5 consecutive minutes, the number of replicas is less than the minimum replicas that should be available. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Pods scaled up greater than 80% | Critical | The current number of replicas is more than 80% of the maximum number of replicas. Actions: *Check if max replicas must be modified based on load. | kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas | For 5 consecutive minutes, the number of replicas is more than 80% of the maximum number of replicas. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Ready Pods below 60% | Critical | The number of statefulset '"`UNIQ--nowiki-00000055-QINU`"' pods in the Ready state has dropped below 60%. Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas_current | For the last 5 minutes, fewer than 60% of the currently available statefulset '"`UNIQ--nowiki-00000056-QINU`"' pods have been in the Ready state. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Redis not available | Critical | Redis is not available for pod '"`UNIQ--nowiki-00000063-QINU`"'. Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000064-QINU`"', check if there is an issue with the pod. | redis_state | Redis is not available for pod '"`UNIQ--nowiki-00000065-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Routing timeout counter growth | Warning | The trigger detects that routing timeouts are increasing. Actions: *Check the URS_RESPONSE_MORE5SEC stat value. If it's increasing, then investigate why URS doesn't respond to SIP Server in time. *Check SIPS-to-URS network connectivity. | sips_routing_timeouts | The absolute value of NROUTINGTIMEOUTS on a specific SIP Server increased by more than 20 in 2 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | SIP Node HealthCheck Fail | Critical | SIP Node health level fails for pod '"`UNIQ--nowiki-0000004B-QINU`"'. Actions: *Check for failure of dependent services (Redis/Kafka/SIP Proxy/GVP/Dial Plan). *Check for Envoy proxy failure, then restart the pod. | sipnode_health_level | SIP Node health level fails for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | SIP Proxy is out of service | Critical | Actions: *Troubleshoot the SIP Server-to-SIP Proxy nodes network connections. Collect network stats and escalate to the Network team to resolve network issues, if necessary. *Troubleshoot SIP Proxy nodes. | sips_sipproxy_in_service | SIP Proxy is out of service. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | SIP Proxy overloaded | Critical | SIP Proxy is overloaded. Actions: *Check SIP Proxy nodes for CPU and memory usage. *If SIP Proxy nodes have acceptable CPU and memory usage, then check for errors or a "hang-up" state which could delay SIP Proxy in forwarding. *Check the SBC side for network delays. | sips_sip_response_time_ms_sum, sips_sip_response_time_ms_count | Response time is greater than 20 milliseconds for 1 minute. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | SIP Server main thread consuming more than 65% CPU for 5 mins | Warning | Main thread consumes too much CPU. Actions: *Collect SIP Server Main thread logs; that is, log files without index in the file name (appname_date.log files). Raise an investigation ticket. | sips_cpu_usage_main | Main thread consumes too much CPU (more than 65% for 5 consecutive minutes). |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | SIP softswitch is out of service | Critical | Actions: *Troubleshoot the SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary. *Troubleshoot the SBC. | sips_softswitch_in_service | SIP softswitch is out of service. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | SIP trunk is out of service | Critical | SIP trunk is out of service. Actions: *For Primary and Secondary trunks: **Troubleshoot SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary. **Troubleshoot the SBC. For Inter-SIP Server trunks: troubleshoot the SIP Se | sips_trunk_in_service | SIP trunk is out of service for more than 1 minute. |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Too many Kafka pending events | Critical | Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000048-QINU`"'. Actions: *Ensure there are no issues with Kafka, '"`UNIQ--nowiki-00000049-QINU`"' pod's CPU, and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for service '"`UNIQ--nowiki-0000004A-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics | Too many Kafka producer errors | Critical | Kafka responds with errors at pod '"`UNIQ--nowiki-00000066-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000067-QINU`"', ensure there are no issues with Kafka. | kafka_producer_error_total | More than 100 errors for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Config node fail | Warning | The request to the config node failed. Action: *Check if there is any problem with pod '"`UNIQ--nowiki-00000079-QINU`"' and config node. | http_client_response_count | Requests to the config node fail for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Container restarted repeatedly | Critical | Container '"`UNIQ--nowiki-00000062-QINU`"' was repeatedly restarted. Actions: *Check to see if a new version of the image was deployed. Also check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000063-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | No sip-nodes available for 2 minutes | Critical | No sip-nodes are available for the pod '"`UNIQ--nowiki-00000064-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with sip-nodes. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000065-QINU`"', check to see if there is any issues with t | sipproxy_active_sip_nodes_count | No sip-nodes are available for the pod '"`UNIQ--nowiki-00000066-QINU`"' for 2 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-00000070-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000071-QINU`"' and raise an investi | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000072-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-0000006D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-0000006E-QINU`"' and raise an inv | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000006F-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod memory greater than 65% | Warning | Pod '"`UNIQ--nowiki-00000076-QINU`"' has high memory usage. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000077-QINU`"' and raise an inv | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000078-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000073-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service for pod '"`UNIQ--nowiki-00000074-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000075-QINU`"' memory usage exceeded 80% for 5 minutes |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod status failed | Warning | Actions: *Restart the pod and check to see if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000005B-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod status NotReady | Critical | Pod '"`UNIQ--nowiki-00000060-QINU`"' is in NotReady state. Actions: *Restart the pod and check to see if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000061-QINU`"' is in NotReady state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod status Pending | Warning | Pod '"`UNIQ--nowiki-0000005E-QINU`"' is in Pending state. Actions: *Restart the pod and check to see if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000005F-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Pod status Unknown | Warning | Pod '"`UNIQ--nowiki-0000005C-QINU`"' is in Unknown state. Actions: *Restart the pod and check to see if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-0000005D-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | SIP server response time too high | Warning | Actions: *If the alarm is triggered for multiple sipproxy-nodes, make sure there are no issues on '"`UNIQ--nowiki-00000057-QINU`"'. *If the alarm is triggered only for sipproxy-node '"`UNIQ--nowiki-00000058-QINU`"', check to see if there is an issue with the service related to the topic (CPU, m | sipproxy_response_latency_bucket | SIP response latency for more than 95% of messages forwarded to '"`UNIQ--nowiki-00000059-QINU`"' is more than 1 second for sipproxy-node '"`UNIQ--nowiki-0000005A-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | sip-node capacity limit reached | Warning | The sip-node '"`UNIQ--nowiki-00000067-QINU`"' hit capacity limit on '"`UNIQ--nowiki-00000068-QINU`"'. Actions: *If alarm is triggered for multiple services make sure there is no issues with sip-node '"`UNIQ--nowiki-00000069-QINU`"'. *If alarm is triggered only for pod '"`UNIQ--nowiki-000000 | sipproxy_sip_node_is_capacity_available | The sip-node '"`UNIQ--nowiki-0000006B-QINU`"' hit capacity limit on '"`UNIQ--nowiki-0000006C-QINU`"' for 3 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics | Too many Kafka pending events | Critical | Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000054-QINU`"'. This alert means there are issues with SIP REGISTER processing on this voice-sipproxy. Actions: *Make sure there are no issues with Kafka or with the '"`UNIQ--nowiki-00000055-QINU`"' pod's CPU and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000056-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | ContainerRestartedRepeatedly | Critical | The Voicemail pod restarts repeatedly. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000022-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | PodStatusNotReadyfor10mins | Critical | The Voicemail pod is down. | kube_pod_status_ready | The Voicemail pod is down for more than 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | VoicemailConfigHealthFailedCritical | Critical | Voicemail Service '"`UNIQ--nowiki-00000025-QINU`"' GWS service is not available. | voicemail_config_node_status | Voicemail Service '"`UNIQ--nowiki-00000026-QINU`"' GWS service is not available for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | VoicemailConfigRequestFailureCritical | Critical | Voicemail Service '"`UNIQ--nowiki-0000001E-QINU`"' unable to connect to Config Node. | voicemail_config_request_failed_total | At least 6 requests failed per minute for the past 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | VoicemailEnvoyHealthFailedCritical | Critical | Voicemail Service '"`UNIQ--nowiki-00000023-QINU`"' Envoy service is not available. | voicemail_envoy_proxy_status | Voicemail Service '"`UNIQ--nowiki-00000024-QINU`"' Envoy service is not available for 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | VoicemailGWSHealthFailedCritical | Critical | Voicemail Service '"`UNIQ--nowiki-00000027-QINU`"' GWS service is not available. | voicemail_gws_status | Voicemail Service '"`UNIQ--nowiki-00000028-QINU`"' GWS service is not available for 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | VoicemailRedisConnectionDown | Critical | Voicemail Service '"`UNIQ--nowiki-0000001F-QINU`"' unable to connect to the Redis cluster. | voicemail_redis_connection_failure | At least 6 requests failed per minute for the past 10 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | voicemail_node_cpu_usage_80 | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000021-QINU`"'. | container_cpu_usage_seconds_total, kube_pod_container_resource_requests_cpu_cores | The Voicemail pod exceeded 80% CPU usage for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | voicemail_node_memory_usage_80 | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000020-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | The Voicemail pod exceeded 80% memory usage for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics | voicemail_storage_failed_account | Outage | The Storage account is down and, as a result, the service will not be able to fetch the data. | voicemail_storage_failed_account | The Storage account is down. |
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics | webrtc-gateway-es | warning | Specifies that the Gateway Pod has lost connection to ElasticSearch | wrtc_system_error | Need input |
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics | webrtc-gateway-gauth | warning | Specifies that the Gateway Pod has lost connection to Auth service | wrtc_system_error | Need input |
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics | webrtc-gateway-gws | warning | Specifies that the Gateway Pod has lost connection to the Environment Service | wrtc_system_error | Need input |
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics | webrtc-gateway-signins | warning | Specifies the number of sign-ins | wrtc_current_signins | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerCPUreached70percentForConfigserver | HIGH | The trigger will flag an alarm when the Configserver container CPU utilization goes beyond 70% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerMemoryUseOver1GBForConfigserver | HIGH | The trigger will flag an alarm when the Configserver container working memory has exceeded 1GB for 15 mins | container_memory_working_set_bytes | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerMemoryUseOver90PercentForConfigserver | HIGH | The trigger will flag an alarm when the Configserver container working memory use is over 90% of the limit for 15 mins | container_memory_working_set_bytes, kube_pod_container_resource_limits_memory_bytes | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerNotRunningForConfigserver | HIGH | This alert is triggered when the Configserver container has not been running for 15 minutes | kube_pod_container_status_running | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerNotRunningForServiceHandler | MEDIUM | This alert is triggered when the service-handler container has not been running for 15 minutes | kube_pod_container_status_running | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerRestartsOver4ForConfigserver | HIGH | This alert is triggered when the Configserver container restarts in 15 mins exceeded 4 | kube_pod_container_status_restarts_total | 15mins |
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics | ContainerRestartsOver4ForServiceHandler | MEDIUM | This alert is triggered when the service-handler container restarts exceeded 4 for 15 mins | kube_pod_container_status_running | 15mins |
GVP/Current/GVPPEGuide/GVP MCP Metrics | ContainerCPUreached70percentForMCP | HIGH | The trigger will flag an alarm when the MCP container CPU utilization goes beyond 70% for 5 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
GVP/Current/GVPPEGuide/GVP MCP Metrics | ContainerMemoryUseOver7GBForMCP | HIGH | The trigger will flag an alarm when the MCP container working memory has exceeded 7GB for 5 mins | container_memory_working_set_bytes | 15mins |
GVP/Current/GVPPEGuide/GVP MCP Metrics | ContainerMemoryUseOver90PercentForMCP | HIGH | The trigger will flag an alarm when the MCP container working memory use is over 90% of the limit for 5 mins | container_memory_working_set_bytes, kube_pod_container_resource_limits_memory_bytes | 15mins |
GVP/Current/GVPPEGuide/GVP MCP Metrics | ContainerRestartsOver2ForMCP | HIGH | The trigger will flag an alarm when the MCP container restarts exceeded 2 for 15 mins | kube_pod_container_status_restarts_total | 15mins |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_MEDIA_ERROR_CRITICAL | CRITICAL | Number of LMSIP media errors exceeded critical limit | gvp_mcp_log_parser_eror_total {LogID="33008",endpoint="mcplog"...} | 30mins |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_SDP_PARSE_ERROR | WARNING | Number of SDP parse errors exceeded limit | gvp_mcp_log_parser_eror_total {LogID="33006",endpoint="mcplog"...} | N/A |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_WEBSOCKET_CLIENT_OPEN_ERROR | HIGH | There are errors opening a session with a websocket client | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | N/A |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_WEBSOCKET_CLIENT_PROTOCOL_ERROR | HIGH | There are protocol errors with a websocket client | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | N/A |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_WEBSOCKET_TOKEN_CONFIG_ERROR | HIGH | There are errors getting information for Auth token with a websocket client | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | N/A |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_WEBSOCKET_TOKEN_CREATE_ERROR | HIGH | There are errors creating a JWT token with a websocket client | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | N/A |
GVP/Current/GVPPEGuide/GVP MCP Metrics | MCP_WEBSOCKET_TOKEN_FETCH_ERROR | HIGH | There are errors fetching Auth token with a websocket client | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | N/A |
GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_FETCH_RESOURCE_ERROR | MEDIUM | Number of VXMLi fetch errors exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | 1min |
GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_FETCH_RESOURCE_ERROR_4XX | WARNING | Number of VXMLi 4xx fetch errors exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40032",endpoint="mcplog"...} | 1min |
GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_FETCH_RESOURCE_TIMEOUT | MEDIUM | Number of VXMLi fetch timeouts exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | 1min |
GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_PARSE_ERROR | WARNING | Number of VXMLi parse errors exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...} | 1min |
GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerCPUreached80percent | HIGH | The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerMemoryUsage80percent | HIGH | The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins | container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes | 15mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerRestartedRepeatedly | CRITICAL | The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins | kube_pod_container_status_restarts_total | 15mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | InitContainerFailingRepeatedly | CRITICAL | The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins | kube_pod_init_container_status_restarts_total | 15mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | PodStatusNotReady | CRITICAL | The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file. | kube_pod_status_ready | 30mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | PVC50PercentFilled | HIGH | This trigger will flag an alarm when the RS PVC size is 50% filled
|
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes | 15mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | PVC80PercentFilled | CRITICAL | This trigger will flag an alarm when the RS PVC size is 80% filled
|
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes | 5mins |
GVP/Current/GVPPEGuide/Reporting Server Metrics | RSQueueSizeCritical | HIGH | The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins | rsQueueSize | 15mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerCPUreached80percentForRM0 | HIGH | The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerCPUreached80percentForRM1 | HIGH | The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerMemoryUsage80percentForRM0 | HIGH | The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins | container_memory_rss, kube_pod_container_resource_limits_memory_bytes | 15mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerMemoryUsage80percentForRM1 | HIGH | The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins | container_memory_rss, kube_pod_container_resource_limits_memory_bytes | 15mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerRestartedRepeatedly | CRITICAL | The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins | kube_pod_container_status_restarts_total | 15 mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | InitContainerFailingRepeatedly | CRITICAL | The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins. | kube_pod_init_container_status_restarts_total | 15 mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | MCPPortsExceeded | HIGH | All the MCP ports in MCP LRG are exceeded | gvp_rm_log_parser_eror_total | 1min |
GVP/Current/GVPPEGuide/Resource Manager Metrics | PodStatusNotReady | CRITICAL | The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml. | kube_pod_status_ready | 30mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RM Service Down | CRITICAL | RM pods are not in ready state and RM service is not available | kube_pod_container_status_running | 0 |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMConfigServerConnectionLost | HIGH | RM lost connection to GVP Configuration Server for 5mins. | gvp_rm_log_parser_warn_total | 5 mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMInterNodeConnectivityBroken | HIGH | Inter-node connectivity between RM nodes is lost for 5mins. | gvp_rm_log_parser_warn_total | 5 mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMMatchingIVRTenantNotFound | MEDIUM | Matching IVR profile tenant could not be found for 2mins | gvp_rm_log_parser_eror_total | 2mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMResourceAllocationFailed | MEDIUM | RM Resource allocation failed for 1mins | gvp_rm_log_parser_eror_total | 1min |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMServiceDegradedTo50Percentage | HIGH | One of the RM container is not in running state for 5mins | kube_pod_container_status_running | 5mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMSocketInterNodeError | HIGH | RM Inter node Socket Error for 5mins. | gvp_rm_log_parser_eror_total | 5mins |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMTotal4XXErrorForINVITE | MEDIUM | The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm. | rmTotal4xxInviteSent | 1min |
GVP/Current/GVPPEGuide/Resource Manager Metrics | RMTotal5XXErrorForINVITE | HIGH | The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm. | rmTotal5xxInviteSent | 5 mins |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES-NODE-JS-DELAY-WARNING | Warning | Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment. | application_ccecp_nodejs_eventloop_lag_seconds | Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_ENQUEUE_LIMIT_REACHED | Info | GES is throttling callbacks to a given phone number. | CB_ENQUEUE_LIMIT_REACHED | Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_SUBMIT_FAILED | Info | GES has failed to submit a callback to ORS. | CB_SUBMIT_FAILED | Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_TTL_LIMIT_REACHED | Info | GES is throttling callbacks for a specific tenant. | CB_TTL_LIMIT_REACHED | Triggered when GES has started throttling callbacks within the past 2 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CPU_USAGE | Info | GES has high CPU usage for 1 minute. | ges_process_cpu_seconds_total | Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_DNS_FAILURE | Warning | A GES pod has encountered difficulty resolving DNS requests. | DNS_FAILURE | Triggered when GES encounters any DNS failures within the last 30 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_AUTH_DOWN | Warning | Connection to the Genesys Authentication Service is down. | GWS_AUTH_STATUS | Triggered when the connection to the Genesys Authentication Service is down for 5 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_CONFIG_DOWN | Warning | Connection to the GWS Configuration Service is down. | GWS_CONFIG_STATUS | Triggered when the connection to the GWS Configuration Service is down. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_ENVIRONMENT_DOWN | Warning | Connection to the GWS Environment Service is down. | GWS_ENV_STATUS | Triggered when the connection to the GWS Environment Service is down. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_INCORRECT_CLIENT_CREDENTIALS | Warning | The GWS client credentials provided to GES are incorrect. | GWS_INCORRECT_CLIENT_CREDENTIALS | Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_SERVER_ERROR | Warning | GES has encountered server or connection errors with GWS. | GWS_SERVER_ERROR | Triggered when there has been a GWS server error in the past 5 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HEALTH | Critical | One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down. | GES_HEALTH | Triggered when any component is down for any length of time. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_400_POD | Info | An individual GES pod is returning excessive HTTP 400 results. | ges_http_failed_requests_total, http_400_tolerance | Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_401_POD | Info | An individual GES pod is returning excessive HTTP 401 results. | ges_http_failed_requests_total, http_401_tolerance | Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_404_POD | Info | An individual GES pod is returning excessive HTTP 404 results. | ges_http_failed_requests_total, http_404_tolerance | Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_500_POD | Info | An individual GES pod is returning excessive HTTP 500 results. | ges_http_failed_requests_total, http_500_tolerance | Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_INVALID_CONTENT_LENGTH | Info | Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port. | INVALID_CONTENT_LENGTH, invalid_content_length_tolerance | Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_LOGGING_FAILURE | Warning | GES has failed to write a message to the log. | LOGGING_FAILURE | Triggered when there are any failures writing to the logs. Silenced after 1 minute. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_MEMORY_USAGE | Info | GES has high memory usage for a period of 90 seconds. | ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes | Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NEXUS_ACCESS_FAILURE | Warning | GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback. | NEXUS_ACCESS_FAILURE | Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NOT_READY_CRITICAL | Critical | GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. | kube_pod_container_status_ready | Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NOT_READY_WARNING | Warning | GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. | kube_pod_container_status_ready | Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_ORS_REDIS_DOWN | Critical | Connection to ORS_REDIS is down. | ORS_REDIS_STATUS | Triggered when the ORS_REDIS connection is down for 5 consecutive minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_PODS_RESTART | Critical | GES pods have been excessively crashing and restarting. | kube_pod_container_status_restarts_total | Triggered when there have been more than five pod restarts in the past 15 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_RBAC_CREATE_VQ_PROXY_ERROR | Info | Fires if there are issues with GES managing VQ Proxy Objects. | RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance | Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_SLOW_HTTP_RESPONSE_TIME | Warning | Fired if the average response time for incoming requests begins to lag. | ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count | Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_UNCAUGHT_EXCEPTION | Warning | There has been an uncaught exception within GES. | UNCAUGHT_EXCEPTION | Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute. |
PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_UP | Critical | Fires when fewer than two GES pods have been up for the last 15 minutes. | Triggered when fewer than two GES pods are up for 15 consecutive minutes. | |
PEC-DC/Current/DCPEGuide/DCMetrics | Memory usage is above 3000 Mb | Critical | Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes. | nexus_process_resident_memory_bytes | For 15 minutes |
PEC-DC/Current/DCPEGuide/DCMetrics | Nexus error rate | Critical | Triggered when the error rate on this pod is greater than 20% for 15 minutes. | nexus_errors_total, nexus_request_total | For 15 minutes |
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | Database connections above 75 | HIGH | Triggered when pod database connections number is above 75. | Default number of connections: 75 | |
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | IWD DB errors | CRITICAL | Triggered when IWD experiences more than 2 errors within 1 minute during operations with database. | Default number of errors: 2 | |
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | IWD error rate | CRITICAL | Triggered when the number of errors in IWD exceeds the threshold for 15 min period. | Default number of errors: 2 | |
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | Memory usage is above 3000 Mb | CRITICAL | Triggered when the pod memory usage is above 3000 MB. | Default memory usage: 3000 MB | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-API-LatencyHigh | HIGH | Triggered when the latency for API responses is beyond the defined threshold. | 2500ms for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-API-Redis-Connection-Failed | HIGH | Triggered when the connection to redis fails for more than 1 minute. | 1m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-EXT-Ingress-Error-Rate | HIGH | Triggered when the Ingress error rate is above the specified threshold. | 20% for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
PEC-OU/Current/CXCPEGuide/APIAMetrics | cxc_api_too_many_errors_from_auth | HIGH | Triggered when there are too many error responses from the auth service for more than the specified time threshold. | 1m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-CM-Redis-Connection-Failed | HIGH | Triggered when the connection to redis fails for more than 1 minute. | 1m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-CPUUsage | HIGH | Triggered when a the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-CoM-Redis-no-active-connections | HIGH | Triggered when CX Contact compliance has no active redis connection for 2 minutes | 2m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-Compliance-LatencyHigh | HIGH | Triggered when the latency for API responses is beyond the defined threshold. | 5000ms for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold. | 300% for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-DM-LatencyHigh | HIGH | Triggered when the latency for dial manager is above the defined threshold. | 5000ms for 5m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m |