Cargo query

Jump to: navigation, search

Showing below up to 500 results in range #101 to #600.

View (previous 500 | next 500) (20 | 50 | 100 | 250 | 500)

Page Alert Severity AlertDescription BasedOn Threshold
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_FETCH_RESOURCE_TIMEOUT MEDIUM Number of VXMLi fetch timeouts exceeded limit gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} 1min
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_PARSE_ERROR WARNING Number of VXMLi parse errors exceeded limit gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...} 1min
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerCPUreached80percent HIGH The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerMemoryUsage80percent HIGH The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerRestartedRepeatedly CRITICAL The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins kube_pod_container_status_restarts_total 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics InitContainerFailingRepeatedly CRITICAL The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins kube_pod_init_container_status_restarts_total 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics PodStatusNotReady CRITICAL The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file. kube_pod_status_ready 30mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics PVC50PercentFilled HIGH This trigger will flag an alarm when the RS PVC size is 50% filled
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics PVC80PercentFilled CRITICAL This trigger will flag an alarm when the RS PVC size is 80% filled
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes 5mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics RSQueueSizeCritical HIGH The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins rsQueueSize 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerCPUreached80percentForRM0 HIGH The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerCPUreached80percentForRM1 HIGH The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerMemoryUsage80percentForRM0 HIGH The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins container_memory_rss, kube_pod_container_resource_limits_memory_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerMemoryUsage80percentForRM1 HIGH The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins container_memory_rss, kube_pod_container_resource_limits_memory_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerRestartedRepeatedly CRITICAL The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins kube_pod_container_status_restarts_total 15 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics InitContainerFailingRepeatedly CRITICAL The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins. kube_pod_init_container_status_restarts_total 15 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics MCPPortsExceeded HIGH All the MCP ports in MCP LRG are exceeded gvp_rm_log_parser_eror_total 1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics PodStatusNotReady CRITICAL The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml. kube_pod_status_ready 30mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RM Service Down CRITICAL RM pods are not in ready state and RM service is not available kube_pod_container_status_running 0
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMConfigServerConnectionLost HIGH RM lost connection to GVP Configuration Server for 5mins. gvp_rm_log_parser_warn_total 5 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMInterNodeConnectivityBroken HIGH Inter-node connectivity between RM nodes is lost for 5mins. gvp_rm_log_parser_warn_total 5 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMMatchingIVRTenantNotFound MEDIUM Matching IVR profile tenant could not be found for 2mins gvp_rm_log_parser_eror_total 2mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMResourceAllocationFailed MEDIUM RM Resource allocation failed for 1mins gvp_rm_log_parser_eror_total 1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMServiceDegradedTo50Percentage HIGH One of the RM container is not in running state for 5mins kube_pod_container_status_running 5mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMSocketInterNodeError HIGH RM Inter node Socket Error for 5mins. gvp_rm_log_parser_eror_total 5mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMTotal4XXErrorForINVITE MEDIUM The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm. rmTotal4xxInviteSent 1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMTotal5XXErrorForINVITE HIGH The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm. rmTotal5xxInviteSent 5 mins
Draft:GWS/Current/GWSPEGuide/GWSMetrics CPUThrottling Critical Containers are being throttled more than 1 time per second. container_cpu_cfs_throttled_periods_total 1
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_500_responces_java Critical Too many 500 responses. gws_responses_total 10
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_5xx_responces_count Critical Too many 5xx responses. gws_responses_total 60
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_cpu_usage Warning High container CPU usage. container_cpu_usage_seconds_total 300%
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_jvm_gc_pause_seconds_count Critical JVM garbage collection occurs too often. jvm_gc_pause_seconds_count 10
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_jvm_threads_deadlocked Critical Deadlocked JVM threads exist. jvm_threads_deadlocked 0
Draft:GWS/Current/GWSPEGuide/GWSMetrics netstat_Tcp_RetransSegs Warning High number of TCP RetransSegs (retransmitted segments). node_netstat_Tcp_RetransSegs 2000
Draft:GWS/Current/GWSPEGuide/GWSMetrics total_count_of_errors_during_context_initialization Warning Total count of errors during context initialization. gws_context_error_total 1200
Draft:GWS/Current/GWSPEGuide/GWSMetrics total_count_of_errors_in_PSDK_connections Warning Total count of errors in PSDK connections. psdk_conn_error_total 3
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics DesiredPodsDontMatchSpec Critical The Workspace Service deployment doesn't have the desired number of replicas. kube_deployment_status_replicas_available, kube_deployment_spec_replicas Fired when number of available replicas does not equal to configured number.
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_app_workspace_incoming_requests Critical High rate of incoming requests from Workspace Web Edition. gws_app_workspace_incoming_requests 10
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_high_500_responces_workspace Critical The Workspace Service has too many 500 responses. gws_app_workspace_requests 10
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_high_cpu_usage Warning High container CPU usage. container_cpu_usage_seconds_total 300%
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_high_nodejs_eventloop_lag_seconds Critical The Node.js event loop is too slow. nodejs_eventloop_lag_seconds 0.2
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES-NODE-JS-DELAY-WARNING Warning Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment. application_ccecp_nodejs_eventloop_lag_seconds Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_ENQUEUE_LIMIT_REACHED Info GES is throttling callbacks to a given phone number. CB_ENQUEUE_LIMIT_REACHED Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_SUBMIT_FAILED Info GES has failed to submit a callback to ORS. CB_SUBMIT_FAILED Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_TTL_LIMIT_REACHED Info GES is throttling callbacks for a specific tenant. CB_TTL_LIMIT_REACHED Triggered when GES has started throttling callbacks within the past 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CPU_USAGE Info GES has high CPU usage for 1 minute. ges_process_cpu_seconds_total Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_DNS_FAILURE Warning A GES pod has encountered difficulty resolving DNS requests. DNS_FAILURE Triggered when GES encounters any DNS failures within the last 30 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_AUTH_DOWN Warning Connection to the Genesys Authentication Service is down. GWS_AUTH_STATUS Triggered when the connection to the Genesys Authentication Service is down for 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_CONFIG_DOWN Warning Connection to the GWS Configuration Service is down. GWS_CONFIG_STATUS Triggered when the connection to the GWS Configuration Service is down.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_ENVIRONMENT_DOWN Warning Connection to the GWS Environment Service is down. GWS_ENV_STATUS Triggered when the connection to the GWS Environment Service is down.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_INCORRECT_CLIENT_CREDENTIALS Warning The GWS client credentials provided to GES are incorrect. GWS_INCORRECT_CLIENT_CREDENTIALS Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_SERVER_ERROR Warning GES has encountered server or connection errors with GWS. GWS_SERVER_ERROR Triggered when there has been a GWS server error in the past 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HEALTH Critical One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down. GES_HEALTH Triggered when any component is down for any length of time.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_400_POD Info An individual GES pod is returning excessive HTTP 400 results. ges_http_failed_requests_total, http_400_tolerance Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_401_POD Info An individual GES pod is returning excessive HTTP 401 results. ges_http_failed_requests_total, http_401_tolerance Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_404_POD Info An individual GES pod is returning excessive HTTP 404 results. ges_http_failed_requests_total, http_404_tolerance Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_500_POD Info An individual GES pod is returning excessive HTTP 500 results. ges_http_failed_requests_total, http_500_tolerance Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_INVALID_CONTENT_LENGTH Info Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port. INVALID_CONTENT_LENGTH, invalid_content_length_tolerance Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_LOGGING_FAILURE Warning GES has failed to write a message to the log. LOGGING_FAILURE Triggered when there are any failures writing to the logs. Silenced after 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_MEMORY_USAGE Info GES has high memory usage for a period of 90 seconds. ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NEXUS_ACCESS_FAILURE Warning GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback. NEXUS_ACCESS_FAILURE Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NOT_READY_CRITICAL Critical GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. kube_pod_container_status_ready Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NOT_READY_WARNING Warning GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. kube_pod_container_status_ready Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_ORS_REDIS_DOWN Critical Connection to ORS_REDIS is down. ORS_REDIS_STATUS Triggered when the ORS_REDIS connection is down for 5 consecutive minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_PODS_RESTART Critical GES pods have been excessively crashing and restarting. kube_pod_container_status_restarts_total Triggered when there have been more than five pod restarts in the past 15 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_RBAC_CREATE_VQ_PROXY_ERROR Info Fires if there are issues with GES managing VQ Proxy Objects. RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_SLOW_HTTP_RESPONSE_TIME Warning Fired if the average response time for incoming requests begins to lag. ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_UNCAUGHT_EXCEPTION Warning There has been an uncaught exception within GES. UNCAUGHT_EXCEPTION Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_UP Critical Fires when fewer than two GES pods have been up for the last 15 minutes. Triggered when fewer than two GES pods are up for 15 consecutive minutes.
Draft:PEC-DC/Current/DCPEGuide/DCMetrics Memory usage is above 3000 Mb Critical Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes. nexus_process_resident_memory_bytes For 15 minutes
Draft:PEC-DC/Current/DCPEGuide/DCMetrics Nexus error rate Critical Triggered when the error rate on this pod is greater than 20% for 15 minutes. nexus_errors_total, nexus_request_total For 15 minutes
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts Database connections above 75 HIGH Triggered when pod database connections number is above 75. Default number of connections: 75
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts IWD DB errors CRITICAL Triggered when IWD experiences more than 2 errors within 1 minute during operations with database. Default number of errors: 2
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts IWD error rate CRITICAL Triggered when the number of errors in IWD exceeds the threshold for 15 min period. Default number of errors: 2
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts Memory usage is above 3000 Mb CRITICAL Triggered when the pod memory usage is above 3000 MB. Default memory usage: 3000 MB
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-API-LatencyHigh HIGH Triggered when the latency for API responses is beyond the defined threshold. 2500ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-API-Redis-Connection-Failed HIGH Triggered when the connection to redis fails for more than 1 minute. 1m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-EXT-Ingress-Error-Rate HIGH Triggered when the Ingress error rate is above the specified threshold. 20% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics cxc_api_too_many_errors_from_auth HIGH Triggered when there are too many error responses from the auth service for more than the specified time threshold. 1m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-CM-Redis-Connection-Failed HIGH Triggered when the connection to redis fails for more than 1 minute. 1m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-CPUUsage HIGH Triggered when a the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-CoM-Redis-no-active-connections HIGH Triggered when CX Contact compliance has no active redis connection for 2 minutes 2m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-Compliance-LatencyHigh HIGH Triggered when the latency for API responses is beyond the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold. 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-DM-LatencyHigh HIGH Triggered when the latency for dial manager is above the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-JS-LatencyHigh HIGH Triggered when the latency for job scheduler is above the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-LB-LatencyHigh HIGH Triggered when the latency for list builder is above the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-LM-LatencyHigh HIGH Triggered when the latency for list manager is above the defined threshold 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics cxc_list_manager_too_many_errors_from_auth HIGH Triggered when there are too many error responses from the auth service (list manager) for more than the specified time threshold. 1m
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics gcxi__cluster__info This alert indicates problems with the cluster states. Applicable only if you have two or more nodes in a cluster. gcxi__cluster__info
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics gcxi__projects__status If the value of cxi__projects__status is greater than 0, this alarm is set, indicating that reporting is not functioning properly. cxi__projects__status < 0
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics raa-errors '''Specified by''': raa.prometheusRule.alerts.raa-errors.labels.severity in values.yaml.
'''Recommended value''': warning
A nonzero value indicates that errors have been logged during the scrape interval. gcxi_raa_error_count >0
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics raa-health '''Specified by''': raa.prometheusRule.alerts.labels.severity
'''Recommended value:''' severe
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. gcxi_raa_health_level Specified by: raa.prometheusRule.alerts.health.for
'''Recommended value''': 30m
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics raa-long-aggregation '''Specified by''': raa.prometheusRule.alerts.longAggregation.labels.severity in values.yaml.
'''Recommended value''': warning
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml.
'''Recommended value''': 300
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics GcaOOMKilled Critical Triggered when a GCA pod is restarted because of OOMKilled. kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason 1
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics GcaPodCrashLooping Critical Triggered when a GCA pod is crash looping. kube_pod_container_status_restarts_total The restart rate is greater than 0 for 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspFlinkJobDown Critical Triggered when the GSP Flink job is not running (number of running jobs equals to 0 or metric is not available) flink_jobmanager_numRunningJobs For 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspNoTmRegistered Critical Triggered when there are no registered TaskManagers (or metric not available) flink_jobmanager_numRegisteredTaskManagers For 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspOOMKilled Critical Triggered when a GSP pod is restarted because of OOMKilled kube_pod_container_status_restarts_total 0
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspUnknownPerson High Triggered when GSP encounters unknown person(s) flink_taskmanager_job_task_operator_tenant_error_total{error="unknown_person",service="gsp"} For 5 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_connected_configservers Critical Pulse DCU Collector is not connected to ConfigServer. pulse_collector_connection_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_connected_dbservers Critical Pulse DCU Collector is not connected to DbServer. pulse_collector_connection_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_connected_statservers Critical Pulse DCU Collector is not connected to Stat Server. pulse_collector_connection_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_snapshot_writing Critical Pulse DCU Collector does not write snapshots. pulse_collector_snapshot_writing_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_cpu Critical Detected critical CPU usage by Pulse DCU Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_disk Critical Detected critical disk usage by Pulse DCU Pod. kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes 90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_memory Critical Detected critical memory usage by Pulse DCU Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_nonrunning_instances Critical Triggered when Pulse DCU instances are down. kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_connected_configservers Critical Pulse DCU Stat Server is not connected to ConfigServer. pulse_statserver_server_connected_seconds for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_connected_ixnservers Critical Pulse DCU Stat Server is not connected to IxnServers. pulse_statserver_server_connected_seconds 2
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_connected_tservers Critical Pulse DCU Stat Server is not connected to T-Servers. pulse_statserver_server_connected_number 2
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_failed_dn_registrations Critical Detected critical DN registration failures on Pulse DCU Stat Server. pulse_statserver_dn_failed, pulse_statserver_dn_registered 0.5%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_monitor_data_unavailable Critical Pulse DCU Monitor Agents do not provide data. pulse_monitor_check_duration_seconds, kube_statefulset_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_too_frequent_restarts Critical Detected too frequent restarts of DCU Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_critical_cpu Critical Detected critical CPU usage by Pulse LDS Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_critical_memory Critical Detected critical memory usage by Pulse LDS Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_critical_nonrunning_instances Critical Triggered when Pulse LDS instances are down. kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_monitor_data_unavailable Critical Pulse LDS Monitor Agents do not provide data. pulse_monitor_check_duration_seconds, kube_statefulset_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_no_connected_senders Critical Pule LDS is not connected to upstream servers. pulse_lds_senders_number for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_no_registered_dns Critical No DNs are registered on Pulse LDS. pulse_lds_sender_registered_dns_number for 30 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_too_frequent_restarts Critical Detected too frequent restarts of LDS Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_5xx Critical Detected critical 5xx errors per second for Pulse container. http_server_requests_seconds_count 15%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_cpu Critical Detected critical CPU usage by Pulse Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_hikari_cp Critical Detected critical Hikari connections pool usage by Pulse container. hikaricp_connections_active, hikaricp_connections 90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_memory Critical Detected critical memory usage by Pulse Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_pulse_health Critical Detected critical number of healthy Pulse containers. pulse_health_all_Boolean 50%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_running_instances Critical Triggered when Pulse instances are down. kube_deployment_status_replicas_available, kube_deployment_status_replicas 75%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_service_down Critical All Pulse instances are down. up for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_too_frequent_restarts Critical Detected too frequent restarts of Pulse Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_critical_cpu Critical Detected critical CPU usage by Pulse Permissions Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_critical_memory Critical Detected critical memory usage by Pulse Permissions Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_critical_running_instances Critical Triggered when Pulse Permissions instances are down. kube_deployment_status_replicas_available, kube_deployment_status_replicas 75%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_too_frequent_restarts Critical Detected too frequent restarts of Permissions Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:STRMS/Current/STRMSPEGuide/ServiceMetrics streams_GWS_AUTH_DOWN critical Unable to connect to GWS auth service gws_auth_down 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_BATCH_LAG_TIME warning Message handling exceeds 2 secs 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_DOWN critical The number of running instances is 0 sum(up) < 1 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_ENDPOINT_CONNECTION_DOWN warning Unable to connect to a customer endpoint endpoint_connection_down 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_ENGAGE_KAFKA_CONNECTION_DOWN critical Unable to connect to Engage Kafka engage_kafka_main_connection_down 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_GWS_AUTH_DOWN Critical Unable to connect to GWS auth service gws_auth_down 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_GWS_CONFIG_DOWN critical Unable to connect to GWS config service gws_config_down
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_GWS_ENV_DOWN critical Unable to connect to GWS environment service gws_env_down 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_INIT_ERROR critical Aborted due to initialization error e.g., KAFKA_FQDN is not defined application_streams_init_error > 0 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_REDIS_DOWN critical redis_connection_down 10 seconds
Draft:TLM/Current/TLMPEGuide/TLMMetrics Http Errors Occurrences Exceeded Threshold High Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} >500 in 5 minutes
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry CPU Utilization is Greater Than Threshold High Triggered when average CPU usage is more than 60% node_cpu_seconds_total >60%
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry Dependency Status Low Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus telemetry_dependency_status <80
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry GAuth Time Alert High Triggered when there is no connection to the GAuth service telemetry_gws_auth_req_time >10000
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry Healthy Pod Count Alert High Triggered when the number of healthy pods drops to critical level kube_pod_container_status_ready <2
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry High Network Traffic High Triggered when network traffic exceeds 10MB/second for 5 minutes node_network_transmit_bytes_total, node_network_receive_bytes_total >10MBps
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry Memory Usage is Greater Than Threshold High Triggered when average memory usage is more than 60% container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores >60%
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_elasticsearch_health_status critical Triggered when there is no connection to ElasticSearch ucsx_elasticsearch_health_status 2 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_elasticsearch_slow_processing_time critical Triggered when Elasticsearch internal processing time > 500 ms ucsx_elastic_search_sum, ucsx_elastic_search_count 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_high_cpu_utilization warning Triggered when average CPU usage is more than 80% ucsx_performance 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_high_http_request_rate warning Triggered when request rate is more than 120 requests per seconds on one UCS-X instance ucsx_http_request_duration_count 30 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_high_memory_usage warning Triggered when average CPU usage is more than 800 Mb ucsx_memory 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_overloaded warning Triggered when overload protection rate is more than 0 ucsx_overload_protection_count 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_slow_http_response critical Triggered when average http response time > 500 ms ucsx_http_request_duration_sum, ucsx_http_request_duration_count 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_masterdb_health_status warning Triggered when there is no connection to master DB ucsx_masterdb_health_status 2 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_tenantdb_health_status critical Triggered when there is no connection to tenant DB ucsx_tenantdb_health_status 2 minutes
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Agent service fail Critical Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005B-QINU`"', then restart the pod. agent_health_level Agent health level is Fail for pod '"`UNIQ--nowiki-0000005C-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Config node fail Warning Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005D-QINU`"' and the config node. http_client_response_count Requests to the config node fail for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Container restarted repeatedly Critical Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000056-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Kafka events latency is too high Warning Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network kafka_consumer_latency_bucket Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Kafka not available Critical Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000057-QINU`"', check if there is an issue with the pod. kafka_producer_state, kafka_consumer_state Kafka is not available for pod '"`UNIQ--nowiki-00000058-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Max replicas is not sufficient for 5 mins Critical The desired number of replicas is higher than the current available replicas for the past 5 minutes. kube_statefulset_replicas, kube_statefulset_status_replicas The desired number of replicas is higher than the current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000005E-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000005F-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000060-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000061-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000062-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000063-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000064-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000065-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status Failed Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000052-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status NotReady Critical Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000055-QINU`"' is in NotReady status for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status Pending Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000054-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status Unknown Warning Actions: *Restart the pod. Check if there are any issues with pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000053-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Possible messages lost Critical Actions: *Check Kafka and '"`UNIQ--nowiki-0000004A-QINU`"' service overload, network degradation. kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total Number of sent requests is two times higher than received for topic '"`UNIQ--nowiki-0000004B-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Redis not available Critical Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000059-QINU`"', check if there is an issue with the pod. agent_redis_state, agent_stream_redis_state Redis is not available for pod '"`UNIQ--nowiki-0000005A-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka consumer crashes Critical Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-00000050-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 3 Kafka consumer crashes in 5 minutes for service '"`UNIQ--nowiki-00000051-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka consumer failed health checks Warning Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service. kafka_consumer_error_total Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka consumer request timeouts Warning Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka pending events Critical Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000066-QINU`"' pod's CPU and network. kafka_producer_queue_depth Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000067-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Container restarted repeatedly Critical Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-0000004A-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Kafka events latency is too high Critical Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-0000003E-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network kafka_consumer_latency_bucket Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-0000003F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Kafka not available Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004B-QINU`"', check if there is an issue with the pod. kafka_producer_state, kafka_consumer_state Kafka is not available for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Max replicas is not sufficient for 5 mins Critical The desired number of replicas is higher than the current available replicas for the past 5 minutes. kube_statefulset_replicas, kube_statefulset_status_replicas The desired number of replicas is higher than the current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000004F-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000050-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000051-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000052-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000053-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000054-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000055-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000056-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status Failed Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000046-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status NotReady Critical Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000049-QINU`"' is in NotReady status for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status Pending Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000048-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status Unknown Warning Actions: *Restart the pod. Check if there are any issues with pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000047-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Redis not available Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004D-QINU`"', check if there is an issue with the pod. callthread_redis_state Redis is not available for pod '"`UNIQ--nowiki-0000004E-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka consumer crashes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000044-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 3 Kafka consumer crashes in 5 minutes for topic '"`UNIQ--nowiki-00000045-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka consumer failed health checks Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000040-QINU`"', check if there is an issue with the service. kafka_consumer_error_total Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000041-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka consumer request timeouts Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000042-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000043-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka pending events Critical Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000057-QINU`"' service's CPU and network. kafka_producer_queue_depth Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000058-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Container restarted repeatedly Critical Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000038-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000003E-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000040-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod Failed Warning Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_status_phase Pod failed '"`UNIQ--nowiki-00000032-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000039-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000003A-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-0000003B-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000003C-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod Not ready for 10 minutes Critical Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000037-QINU`"' is in NotReady state for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod Pending state Warning Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000035-QINU`"', check the health of the pod. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000036-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod Unknown state Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000033-QINU`"', check to see whether the image is correct and if the container is starting up. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000034-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Redis disconnected for 10 minutes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000030-QINU`"', check to see if there is an issue with the pod. redis_state Redis is not available for the pod '"`UNIQ--nowiki-00000031-QINU`"' for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Redis disconnected for 5 minutes Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000002E-QINU`"', check to see if there is an issue with the pod. redis_state Redis is not available for pod '"`UNIQ--nowiki-0000002F-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Aggregated service health failing for 5 minutes Critical Actions: *Check the dialplan dashboard for Aggregated Service Health errors and, in case of a Redis error, first check for any issues/crashes in the pod and then restart Redis. *In the case of an Envoy error, the dialplan container will be restarted by the liveness probe. If the issue still exists dialplan_health_level Dependent services or the Envoy sidecar is not available for 5 minutes in the pod '"`UNIQ--nowiki-00000032-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics DialPlan processing time > 0.5 seconds Warning Actions: *If the alarm is generated for all dialplan pods, then Redis or network delay might be the most probable cause. *If the alarm is generated in a single dialplan pod, then it might be due to Envoy or a network issue. dialplan_response_time When the latency for 95% of the dial plan messages is more than 0.5 seconds for a duration of 5 minutes, then this warning alarm is raised for the '"`UNIQ--nowiki-00000030-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics DialPlan processing time > 2 seconds Critical Actions: *If the alarm is generated for all dialplan pods, then Redis or network delay might be the most probable cause. *If the alarm is generated in a single dialplan pod, then it might be due to Envoy or a network issue. dialplan_response_time If the latency for 95% of the dial plan messages is more than 2 seconds for a duration of 5 minutes, then this warning alarm is raised for the '"`UNIQ--nowiki-00000031-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-00000041-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000042-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000043-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000044-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod Failed Warning Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000037-QINU`"' failed.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-0000003E-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000040-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod Not ready for 10 minutes Critical Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. kube_pod_status_ready Pod '"`UNIQ--nowiki-0000003C-QINU`"' is in the NotReady state for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod Pending state Warning Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003A-QINU`"', check the health of the pod. kube_pod_status_phase Pod '"`UNIQ--nowiki-0000003B-QINU`"' is in the Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Pod Unknown state Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000038-QINU`"', check whether the image is correct and if the container is starting up. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000039-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Redis disconnected for 10 minutes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis and then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000035-QINU`"', check to see if there is an issue with the pod. redis_state Redis is not available for the pod '"`UNIQ--nowiki-00000036-QINU`"' for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceDialPlanServiceMetrics Redis disconnected for 5 minutes Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis and then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000033-QINU`"', check to see if there is an issue with the pod. redis_state Redis is not available for the pod '"`UNIQ--nowiki-00000034-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Container restarted repeatedly Critical Container '"`UNIQ--nowiki-00000076-QINU`"' was restarted 5 or more times within 15 minutes. Actions: *Check if a new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000077-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Kafka not available Critical Kafka is not available for pod '"`UNIQ--nowiki-00000068-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000069-QINU`"', check if there is an issue wit kafka_producer_state Kafka is not available for pod '"`UNIQ--nowiki-0000006A-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Max replicas is not sufficient for 5 mins Critical For the past 5 minutes, the desired number of replicas is higher than the number of replicas currently available. Actions: *Check resources available for Kubernetes. Increase resources, if necessary. kube_statefulset_replicas, kube_statefulset_status_replicas Desired number of replicas is higher than current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics No requests received Critical Absence of received requests for pod '"`UNIQ--nowiki-00000060-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000061-QINU`"', make sure there are no issues with Orchestration Service and Tenant Service or the network to them. sipfe_requests_total increase(sipfe_requests_total{pod=~"sipfe-.+"}[5m]) <= 0 and increase(sipfe_requests_total{pod=~"sipfe-.+"}[10m]) > 100
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-00000078-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000079-QINU`"'; raise an investi container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000007A-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-0000007B-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000007C-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-0000007D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-0000007E-QINU`"'; raise an inv container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000007F-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000080-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service for pod '"`UNIQ--nowiki-00000081-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000082-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod status Failed Warning Pod '"`UNIQ--nowiki-0000006E-QINU`"' is in Failed state. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-0000006F-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod status NotReady Critical Pod '"`UNIQ--nowiki-00000074-QINU`"' is in the NotReady state for 5 minutes. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000075-QINU`"' is in the NotReady state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod status Pending Warning Pod '"`UNIQ--nowiki-00000072-QINU`"' is in Pending state for 5 minutes. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000073-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pod status Unknown Warning Pod '"`UNIQ--nowiki-00000070-QINU`"' is in Unknown state for 5 minutes. Actions: *Restart the pod. Check to see if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000071-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pods less than Min Replicas Critical The current number of replicas is lower than the minimum number of replicas that should be available. Actions: *Check if Kubernetes cannot deploy new pods or if pods are failing in their status to be active/read. kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas For the past 5 minutes, the current number of replicas is lower than the minimum number of replicas that should be available.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Pods scaled up greater than 80% Critical For the past 5 minutes, the desired number of replicas is greater than the number of replicas currently available. Actions: *Check resources available for Kubernetes. Increase resources, if necessary. kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas (kube_hpa_status_current_replicas{namespace="voice",hpa="sipfe-node-hpa"} * 100) / kube_hpa_spec_max_replicas{namespace="voice",hpa="sipfe-node-hpa"} > 80 for: 5m
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics SIP Cluster Service response latency is too high Critical Actions: *If the alarm is triggered for multiple pods, make sure there are no issues with the SIP Cluster Service (CPU, memory, or network overload). *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000005E-QINU`"', check if there is an issue with the pod (CPU, memory, or network overload sipfe_sip_node_request_duration_seconds_bucket Latency for 95% of messages is more than 0.5 seconds for service '"`UNIQ--nowiki-0000005F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics SIP Node(s) is not available Critical No available SIP Nodes for pod '"`UNIQ--nowiki-0000006B-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with SIP Nodes, and then restart SIP Nodes. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000006C-QINU`"', check if there is an i sipfe_sip_nodes_total No available SIP Nodes for pod '"`UNIQ--nowiki-0000006D-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Too many failure responses sent Critical Too many failure responses are sent by the Front End service at pod '"`UNIQ--nowiki-00000062-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000063-QINU`"', make sure received requests are valid. sipfe_responses_total More than 100 failure responses in 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Too many Kafka pending producer events Critical Actions: *Make sure there are no issues with Kafka or '"`UNIQ--nowiki-0000005A-QINU`"' pod's CPU and network. kafka_producer_queue_depth Too many Kafka producer pending events for pod '"`UNIQ--nowiki-0000005B-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Too many Kafka producer errors Critical Kafka responds with errors at pod '"`UNIQ--nowiki-00000064-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000065-QINU`"', make sure there are no issues with Kafka. kafka_producer_error_total More than 100 errors in 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Too many received requests without a response Critical Actions: *Collect the service logs for pod '"`UNIQ--nowiki-0000005C-QINU`"'; raise an investigation ticket. *Restart the service. sipfe_requests_total For too many requests, the Front End service at pod '"`UNIQ--nowiki-0000005D-QINU`"' did not send any response (more than 100 requests without a response, measured over 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceFrontEndServiceMetrics Too many SIP Cluster Service error responses Critical SIP Cluster Service responds with errors at pod '"`UNIQ--nowiki-00000066-QINU`"'. Actions: *If the alarm is triggered for multiple pods, make sure there are no issues with the SIP Cluster Service (CPU, memory, or network overload). *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000006 sipfe_sip_node_responses_total More than 100 errors in 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Container restored repeatedly Critical Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000042-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Number of running strategies is critical Critical Too many active sessions. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check the number of voice, digital, and callback calls in the system. orsnode_strategies More than 600 strategies running in 5 consecutive seconds.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Number of running strategies is too high Warning Too many active sessions. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check the number of voice, digital, and callback calls in the system. orsnode_strategies More than 400 strategies running in 5 consecutive seconds.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-00000047-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000048-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000049-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000004A-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod in Pending state Warning Pod '"`UNIQ--nowiki-0000003D-QINU`"' is in Pending state. Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003E-QINU`"', check the health kube_pod_status_phase Pod '"`UNIQ--nowiki-0000003F-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod in Unknown state Warning Pod '"`UNIQ--nowiki-0000003A-QINU`"' is in Unknown state. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000003B-QINU`"', check whether the image is correct and if th kube_pod_status_phase Pod '"`UNIQ--nowiki-0000003C-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000043-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000044-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000045-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000046-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod Not ready for 10 minutes Critical Pod '"`UNIQ--nowiki-00000040-QINU`"' in NotReady state. Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000041-QINU`"' in NotReady state for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Pod status Failed Warning Pod '"`UNIQ--nowiki-00000038-QINU`"' failed. Actions: *One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000039-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Redis disconnected for 10 minutes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000036-QINU`"', check if there is an issue with the pod. redis_state Redis is not available for the pod '"`UNIQ--nowiki-00000037-QINU`"' for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceOrchestrationServiceMetrics Redis disconnected for 5 minutes Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If alarm is triggered only for pod '"`UNIQ--nowiki-00000034-QINU`"', check if there is an issue with the pod. redis_state Redis is not available for the pod '"`UNIQ--nowiki-00000035-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Container restarted repeatedly Critical Actions: *One of the container in the pod has entered a Failed state. Check the Kibana logs for the reason. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000060-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Kafka events latency is too high Warning Actions: *If the alarm is triggered for multiple topics, make sure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or netwo kafka_consumer_latency_bucket Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Kafka not available Critical Kafka is not available for pod '"`UNIQ--nowiki-00000050-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000051-QINU`"', check if there is an issue wit kafka_producer_state, kafka_consumer_state Kafka is not available for pod '"`UNIQ--nowiki-00000052-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-00000061-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000062-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000067-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. container_cpu_usage_seconds_total, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000068-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod Failed Warning Pod '"`UNIQ--nowiki-00000057-QINU`"' failed. Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000058-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000063-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_memory_working_set_bytes, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000064-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000065-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs: raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_limits Container '"`UNIQ--nowiki-00000066-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod Not ready for 10 minutes Critical Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. kube_pod_status_ready Pod '"`UNIQ--nowiki-0000005F-QINU`"' is in the NotReady state for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod Pending state Warning Pod '"`UNIQ--nowiki-0000005C-QINU`"' is in Pending state. Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000005D-QINU`"', check the health of t kube_pod_status_phase Pod '"`UNIQ--nowiki-0000005E-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Pod Unknown state Warning Pod '"`UNIQ--nowiki-00000059-QINU`"' is in Unknown state. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000005A-QINU`"', check whether the image is correct and if th kube_pod_status_phase Pod '"`UNIQ--nowiki-0000005B-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Redis disconnected for 10 minutes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000055-QINU`"', check if there is an issue with the pod. redis_state Redis is not available for pod '"`UNIQ--nowiki-00000056-QINU`"' for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Redis disconnected for 5 minutes Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000053-QINU`"', check if there is an issue with the pod. redis_state Redis is not available for pod '"`UNIQ--nowiki-00000054-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Too many Kafka consumer crashes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service. kafka_consumer_error_total There were more than 3 Kafka consumer crashes within 5 minutes for service '"`UNIQ--nowiki-0000004F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Too many Kafka consumer failed health checks Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-0000004A-QINU`"', check if there is an issue with the service. kafka_consumer_error_total Health check failed more than 10 times in 5 minutes for Kafka consumer for topic  '"`UNIQ--nowiki-0000004B-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceRegistrarServiceMetrics Too many Kafka consumer request timeouts Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service. kafka_consumer_error_total There were more than 10 request timeouts within 5 minutes for the Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Container restored repeatedly Critical Container '"`UNIQ--nowiki-0000004A-QINU`"' was repeatedly restarted. Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-0000004B-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Number of Redis streams is too high Warning Too many active sessions. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has reached. *Check the number of voice, digital, and callback calls in the system. rqnode_streams More than 10000 active streams running.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-00000050-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000051-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000052-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000053-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod failed Warning Pod '"`UNIQ--nowiki-00000040-QINU`"' failed. Actions: *One of the containers in the pod has entered a Failed state. Check the Kibana logs for the reason. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000041-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-0000004C-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000004D-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-0000004E-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000004F-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod not ready for 10 minutes Critical Pod '"`UNIQ--nowiki-00000048-QINU`"' in NotReady state. Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000049-QINU`"' in NotReady state for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod Pending state Warning Pod '"`UNIQ--nowiki-00000045-QINU`"' is in the Pending state. Actions: *If the alarm is triggered for multiple services, make sure the Kubernetes nodes where the pod is running are alive in the cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000046-QINU`"', check the hea kube_pod_status_phase Pod '"`UNIQ--nowiki-00000047-QINU`"' is in the Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Pod Unknown state Warning Pod '"`UNIQ--nowiki-00000042-QINU`"' in Unknown state. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with the Kubernetes cluster. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-00000043-QINU`"', check whether the image is correct and if t kube_pod_status_phase Pod '"`UNIQ--nowiki-00000044-QINU`"' in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Redis disconnected for 10 minutes Critical Redis is not available for the pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003E-QINU`"', check to see if there redis_state Redis is not available for the pod '"`UNIQ--nowiki-0000003F-QINU`"' for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceRQServiceMetrics Redis disconnected for 5 minutes Warning Redis is not available for the pod '"`UNIQ--nowiki-0000003A-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, restart Redis. *If the alarm is triggered only for the pod '"`UNIQ--nowiki-0000003B-QINU`"', check to see if there is any is redis_state Redis is not available for the pod '"`UNIQ--nowiki-0000003C-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Calls activity drop Warning A noticeable reduction in the number of active calls on a specific SIP Server and no new calls are arriving for processing. Actions: *If a problematic SIP Server is primary, do a switchover, and then restart the former primary server. *If a problematic SIP Server is backup, restart the backup serv sips_calls, sips_calls_created The absolute value of active calls on a specific SIP Server dropped by more than 30 calls in 2 minutes and no new calls are arriving at the SIP Server for processing.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Container Restarted Repeatedly Critical Container '"`UNIQ--nowiki-00000053-QINU`"' was repeatedly restarted. Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000054-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Dial Plan Node Down Critical No Dial Plan nodes are reachable from SIP Server and all connections to Dial Plan nodes are down. Actions: *Check the network connection between SIP Server and the Dial Plan node host. *Check the Dial Plan node CPU and memory usage. sips_dp_active_connections All connections to Dial Plan nodes are down.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Dial Plan node is overloaded Critical Dial Plan node is overloaded as the response latency increases. Actions: *Check that the inbound call rate to SIP Server is not too high. *Check the Dial Plan node CPU and memory usage. *Check the network connection between SIP Server and Dial Plan nodes. sips_dp_average_response_latency Dial Plan node is overloaded as the response latency increases (more than 1000).
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Dial Plan Queue Increase Critical Because Dial Plan requests are huge in size or there is a connection issue with the Dial Plan node, the processing queue size increases in size. Actions: *Check SIP Server inbound call rate. *Check the connection between SIP Server and the Dial Plan node. sips_dp_queue_size The processing queue size is greater than 10 requests for 1 minute.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Dialplan Node problem Warning Dial Plan node rejects requests with an error or it doesn't respond to requests and requests are timed out. Actions: *Check the network connection between SIP Server and the Dial Plan host. *Check that Dial Plan nodes are running. sips_dp_timeouts During 1 minute, the Dial Plan node rejects more than 5 requests with an error or more than 5 requests time out because the Dial Plan node fails to respond.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Kafka not available Critical Kafka is not available for pod '"`UNIQ--nowiki-0000004D-QINU`"'. Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the pod. kafka_producer_state Kafka is not available for pod '"`UNIQ--nowiki-0000004F-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Media service is out of service Critical Media service is out of service. Actions: *Troubleshoot the SIP Server-to-Resource Manager (RM) network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary. *Troubleshoot RM, consider RM restart. *After 5 minutes, redirect traffic to another s sips_msml_in_service Media service is out of service for more than 1 minute.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000005A-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-0000005B-QINU`"'; raise an investi container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000005C-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000057-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000058-QINU`"'; raise an inv container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000059-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000060-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000061-QINU`"'; raise an inv container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000062-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-0000005D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service for pod '"`UNIQ--nowiki-0000005E-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000005F-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pod Status Error Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000050-QINU`"' is in Failed, Unknown, or Pending state.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pod Status NotReady Warning Pod '"`UNIQ--nowiki-00000051-QINU`"' is in NotReady state. Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000052-QINU`"' is in NotReady state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pods less than Min Replicas Critical The current number of replicas is less than the minimum replicas that should be available. This might be because Kubernetes cannot deploy a new pod or pods are failing to be active/ready. Actions: *If all services have the same issue, then check Kubernetes nodes and Consul health. *If the issue is kube_hpa_status_current_replicas, kube_hpa_spec_min_replicas For 5 consecutive minutes, the number of replicas is less than the minimum replicas that should be available.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Pods scaled up greater than 80% Critical The current number of replicas is more than 80% of the maximum number of replicas. Actions: *Check if max replicas must be modified based on load. kube_hpa_status_current_replicas, kube_hpa_spec_max_replicas For 5 consecutive minutes, the number of replicas is more than 80% of the maximum number of replicas.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Ready Pods below 60% Critical The number of statefulset '"`UNIQ--nowiki-00000055-QINU`"' pods in the Ready state has dropped below 60%. Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas_current For the last 5 minutes, fewer than 60% of the currently available statefulset '"`UNIQ--nowiki-00000056-QINU`"' pods have been in the Ready state.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Redis not available Critical Redis is not available for pod '"`UNIQ--nowiki-00000063-QINU`"'. Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000064-QINU`"', check if there is an issue with the pod. redis_state Redis is not available for pod '"`UNIQ--nowiki-00000065-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Routing timeout counter growth Warning The trigger detects that routing timeouts are increasing. Actions: *Check the URS_RESPONSE_MORE5SEC stat value. If it's increasing, then investigate why URS doesn't respond to SIP Server in time. *Check SIPS-to-URS network connectivity. sips_routing_timeouts The absolute value of NROUTINGTIMEOUTS on a specific SIP Server increased by more than 20 in 2 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics SIP Node HealthCheck Fail Critical SIP Node health level fails for pod '"`UNIQ--nowiki-0000004B-QINU`"'. Actions: *Check for failure of dependent services (Redis/Kafka/SIP Proxy/GVP/Dial Plan). *Check for Envoy proxy failure, then restart the pod. sipnode_health_level SIP Node health level fails for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics SIP Proxy is out of service Critical Actions: *Troubleshoot the SIP Server-to-SIP Proxy nodes network connections. Collect network stats and escalate to the Network team to resolve network issues, if necessary. *Troubleshoot SIP Proxy nodes. sips_sipproxy_in_service SIP Proxy is out of service.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics SIP Proxy overloaded Critical SIP Proxy is overloaded. Actions: *Check SIP Proxy nodes for CPU and memory usage. *If SIP Proxy nodes have acceptable CPU and memory usage, then check for errors or a "hang-up" state which could delay SIP Proxy in forwarding. *Check the SBC side for network delays. sips_sip_response_time_ms_sum, sips_sip_response_time_ms_count Response time is greater than 20 milliseconds for 1 minute.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics SIP Server main thread consuming more than 65% CPU for 5 mins Warning Main thread consumes too much CPU. Actions: *Collect SIP Server Main thread logs; that is, log files without index in the file name (appname_date.log files). Raise an investigation ticket. sips_cpu_usage_main Main thread consumes too much CPU (more than 65% for 5 consecutive minutes).
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics SIP softswitch is out of service Critical Actions: *Troubleshoot the SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary. *Troubleshoot the SBC. sips_softswitch_in_service SIP softswitch is out of service.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics SIP trunk is out of service Critical SIP trunk is out of service. Actions: *For Primary and Secondary trunks: **Troubleshoot SIP Server-to-SBC network connection. Collect network stats and escalate to the Network team to resolve network issues, if necessary. **Troubleshoot the SBC. For Inter-SIP Server trunks: troubleshoot the SIP Se sips_trunk_in_service SIP trunk is out of service for more than 1 minute.
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Too many Kafka pending events Critical Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000048-QINU`"'. Actions: *Ensure there are no issues with Kafka, '"`UNIQ--nowiki-00000049-QINU`"' pod's CPU, and network. kafka_producer_queue_depth Too many Kafka producer pending events for service '"`UNIQ--nowiki-0000004A-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceSIPClusterServiceMetrics Too many Kafka producer errors Critical Kafka responds with errors at pod '"`UNIQ--nowiki-00000066-QINU`"'. Actions: *For pod '"`UNIQ--nowiki-00000067-QINU`"', ensure there are no issues with Kafka. kafka_producer_error_total More than 100 errors for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Config node fail Warning The request to the config node failed. Action: *Check if there is any problem with pod '"`UNIQ--nowiki-00000079-QINU`"' and config node. http_client_response_count Requests to the config node fail for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Container restarted repeatedly Critical Container '"`UNIQ--nowiki-00000062-QINU`"' was repeatedly restarted. Actions: *Check to see if a new version of the image was deployed. Also check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000063-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics No sip-nodes available for 2 minutes Critical No sip-nodes are available for the pod '"`UNIQ--nowiki-00000064-QINU`"'. Actions: *If the alarm is triggered for multiple services, make sure there are no issues with sip-nodes. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000065-QINU`"', check to see if there is any issues with t sipproxy_active_sip_nodes_count No sip-nodes are available for the pod '"`UNIQ--nowiki-00000066-QINU`"' for 2 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-00000070-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000071-QINU`"' and raise an investi container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000072-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-0000006D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-0000006E-QINU`"' and raise an inv container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000006F-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod memory greater than 65% Warning Pod '"`UNIQ--nowiki-00000076-QINU`"' has high memory usage. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs for pod '"`UNIQ--nowiki-00000077-QINU`"' and raise an inv container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000078-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000073-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service for pod '"`UNIQ--nowiki-00000074-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000075-QINU`"' memory usage exceeded 80% for 5 minutes
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod status failed Warning Actions: *Restart the pod and check to see if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-0000005B-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod status NotReady Critical Pod '"`UNIQ--nowiki-00000060-QINU`"' is in NotReady state. Actions: *Restart the pod and check to see if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000061-QINU`"' is in NotReady state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod status Pending Warning Pod '"`UNIQ--nowiki-0000005E-QINU`"' is in Pending state. Actions: *Restart the pod and check to see if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-0000005F-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Pod status Unknown Warning Pod '"`UNIQ--nowiki-0000005C-QINU`"' is in Unknown state. Actions: *Restart the pod and check to see if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-0000005D-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics SIP server response time too high Warning Actions: *If the alarm is triggered for multiple sipproxy-nodes, make sure there are no issues on '"`UNIQ--nowiki-00000057-QINU`"'. *If the alarm is triggered only for sipproxy-node '"`UNIQ--nowiki-00000058-QINU`"', check to see if there is an issue with the service related to the topic (CPU, m sipproxy_response_latency_bucket SIP response latency for more than 95% of messages forwarded to '"`UNIQ--nowiki-00000059-QINU`"' is more than 1 second for sipproxy-node '"`UNIQ--nowiki-0000005A-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics sip-node capacity limit reached Warning The sip-node '"`UNIQ--nowiki-00000067-QINU`"' hit capacity limit on '"`UNIQ--nowiki-00000068-QINU`"'. Actions: *If alarm is triggered for multiple services make sure there is no issues with sip-node '"`UNIQ--nowiki-00000069-QINU`"'. *If alarm is triggered only for pod '"`UNIQ--nowiki-000000 sipproxy_sip_node_is_capacity_available The sip-node '"`UNIQ--nowiki-0000006B-QINU`"' hit capacity limit on '"`UNIQ--nowiki-0000006C-QINU`"' for 3 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceSIPProxyServiceMetrics Too many Kafka pending events Critical Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000054-QINU`"'. This alert means there are issues with SIP REGISTER processing on this voice-sipproxy. Actions: *Make sure there are no issues with Kafka or with the '"`UNIQ--nowiki-00000055-QINU`"' pod's CPU and network. kafka_producer_queue_depth Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000056-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics ContainerRestartedRepeatedly Critical The Voicemail pod restarts repeatedly. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000022-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics PodStatusNotReadyfor10mins Critical The Voicemail pod is down. kube_pod_status_ready The Voicemail pod is down for more than 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics VoicemailConfigHealthFailedCritical Critical Voicemail Service '"`UNIQ--nowiki-00000025-QINU`"' GWS service is not available. voicemail_config_node_status Voicemail Service '"`UNIQ--nowiki-00000026-QINU`"' GWS service is not available for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics VoicemailConfigRequestFailureCritical Critical Voicemail Service '"`UNIQ--nowiki-0000001E-QINU`"' unable to connect to Config Node. voicemail_config_request_failed_total At least 6 requests failed per minute for the past 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics VoicemailEnvoyHealthFailedCritical Critical Voicemail Service '"`UNIQ--nowiki-00000023-QINU`"' Envoy service is not available. voicemail_envoy_proxy_status Voicemail Service '"`UNIQ--nowiki-00000024-QINU`"' Envoy service is not available for 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics VoicemailGWSHealthFailedCritical Critical Voicemail Service '"`UNIQ--nowiki-00000027-QINU`"' GWS service is not available. voicemail_gws_status Voicemail Service '"`UNIQ--nowiki-00000028-QINU`"' GWS service is not available for 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics VoicemailRedisConnectionDown Critical Voicemail Service '"`UNIQ--nowiki-0000001F-QINU`"' unable to connect to the Redis cluster. voicemail_redis_connection_failure At least 6 requests failed per minute for the past 10 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics voicemail_node_cpu_usage_80 Critical Critical CPU load for pod '"`UNIQ--nowiki-00000021-QINU`"'. container_cpu_usage_seconds_total, kube_pod_container_resource_requests_cpu_cores The Voicemail pod exceeded 80% CPU usage for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics voicemail_node_memory_usage_80 Critical Critical memory usage for pod '"`UNIQ--nowiki-00000020-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes The Voicemail pod exceeded 80% memory usage for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceVoicemailServiceMetrics voicemail_storage_failed_account Outage The Storage account is down and, as a result, the service will not be able to fetch the data. voicemail_storage_failed_account The Storage account is down.
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics webrtc-gateway-es warning Specifies that the Gateway Pod has lost connection to ElasticSearch wrtc_system_error Need input
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics webrtc-gateway-gauth warning Specifies that the Gateway Pod has lost connection to Auth service wrtc_system_error Need input
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics webrtc-gateway-gws warning Specifies that the Gateway Pod has lost connection to the Environment Service wrtc_system_error Need input
Draft:WebRTC/Current/WebRTCPEGuide/WebRTC Metrics webrtc-gateway-signins warning Specifies the number of sign-ins wrtc_current_signins 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerCPUreached70percentForConfigserver HIGH The trigger will flag an alarm when the Configserver container CPU utilization goes beyond 70% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerMemoryUseOver1GBForConfigserver HIGH The trigger will flag an alarm when the Configserver container working memory has exceeded 1GB for 15 mins container_memory_working_set_bytes 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerMemoryUseOver90PercentForConfigserver HIGH The trigger will flag an alarm when the Configserver container working memory use is over 90% of the limit for 15 mins container_memory_working_set_bytes, kube_pod_container_resource_limits_memory_bytes 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerNotRunningForConfigserver HIGH This alert is triggered when the Configserver container has not been running for 15 minutes kube_pod_container_status_running 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerNotRunningForServiceHandler MEDIUM This alert is triggered when the service-handler container has not been running for 15 minutes kube_pod_container_status_running 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerRestartsOver4ForConfigserver HIGH This alert is triggered when the Configserver container restarts in 15 mins exceeded 4 kube_pod_container_status_restarts_total 15mins
GVP/Current/GVPPEGuide/GVP Configuration Server Metrics ContainerRestartsOver4ForServiceHandler MEDIUM This alert is triggered when the service-handler container restarts exceeded 4 for 15 mins kube_pod_container_status_running 15mins
GVP/Current/GVPPEGuide/GVP MCP Metrics ContainerCPUreached70percentForMCP HIGH The trigger will flag an alarm when the MCP container CPU utilization goes beyond 70% for 5 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
GVP/Current/GVPPEGuide/GVP MCP Metrics ContainerMemoryUseOver7GBForMCP HIGH The trigger will flag an alarm when the MCP container working memory has exceeded 7GB for 5 mins container_memory_working_set_bytes 15mins
GVP/Current/GVPPEGuide/GVP MCP Metrics ContainerMemoryUseOver90PercentForMCP HIGH The trigger will flag an alarm when the MCP container working memory use is over 90% of the limit for 5 mins container_memory_working_set_bytes, kube_pod_container_resource_limits_memory_bytes 15mins
GVP/Current/GVPPEGuide/GVP MCP Metrics ContainerRestartsOver2ForMCP HIGH The trigger will flag an alarm when the MCP container restarts exceeded 2 for 15 mins kube_pod_container_status_restarts_total 15mins
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_MEDIA_ERROR_CRITICAL CRITICAL Number of LMSIP media errors exceeded critical limit gvp_mcp_log_parser_eror_total {LogID="33008",endpoint="mcplog"...} 30mins
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_SDP_PARSE_ERROR WARNING Number of SDP parse errors exceeded limit gvp_mcp_log_parser_eror_total {LogID="33006",endpoint="mcplog"...} N/A
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_WEBSOCKET_CLIENT_OPEN_ERROR HIGH There are errors opening a session with a websocket client gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} N/A
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_WEBSOCKET_CLIENT_PROTOCOL_ERROR HIGH There are protocol errors with a websocket client gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} N/A
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_WEBSOCKET_TOKEN_CONFIG_ERROR HIGH There are errors getting information for Auth token with a websocket client gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} N/A
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_WEBSOCKET_TOKEN_CREATE_ERROR HIGH There are errors creating a JWT token with a websocket client gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} N/A
GVP/Current/GVPPEGuide/GVP MCP Metrics MCP_WEBSOCKET_TOKEN_FETCH_ERROR HIGH There are errors fetching Auth token with a websocket client gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} N/A
GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_FETCH_RESOURCE_ERROR MEDIUM Number of VXMLi fetch errors exceeded limit gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} 1min
GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_FETCH_RESOURCE_ERROR_4XX WARNING Number of VXMLi 4xx fetch errors exceeded limit gvp_mcp_log_parser_eror_total {LogID="40032",endpoint="mcplog"...} 1min
GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_FETCH_RESOURCE_TIMEOUT MEDIUM Number of VXMLi fetch timeouts exceeded limit gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} 1min
GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_PARSE_ERROR WARNING Number of VXMLi parse errors exceeded limit gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...} 1min
GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerCPUreached80percent HIGH The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerMemoryUsage80percent HIGH The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes 15mins
GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerRestartedRepeatedly CRITICAL The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins kube_pod_container_status_restarts_total 15mins
GVP/Current/GVPPEGuide/Reporting Server Metrics InitContainerFailingRepeatedly CRITICAL The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins kube_pod_init_container_status_restarts_total 15mins
GVP/Current/GVPPEGuide/Reporting Server Metrics PodStatusNotReady CRITICAL The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file. kube_pod_status_ready 30mins
GVP/Current/GVPPEGuide/Reporting Server Metrics PVC50PercentFilled HIGH This trigger will flag an alarm when the RS PVC size is 50% filled
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes 15mins
GVP/Current/GVPPEGuide/Reporting Server Metrics PVC80PercentFilled CRITICAL This trigger will flag an alarm when the RS PVC size is 80% filled
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes 5mins
GVP/Current/GVPPEGuide/Reporting Server Metrics RSQueueSizeCritical HIGH The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins rsQueueSize 15mins
GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerCPUreached80percentForRM0 HIGH The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerCPUreached80percentForRM1 HIGH The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerMemoryUsage80percentForRM0 HIGH The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins container_memory_rss, kube_pod_container_resource_limits_memory_bytes 15mins
GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerMemoryUsage80percentForRM1 HIGH The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins container_memory_rss, kube_pod_container_resource_limits_memory_bytes 15mins
GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerRestartedRepeatedly CRITICAL The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins kube_pod_container_status_restarts_total 15 mins
GVP/Current/GVPPEGuide/Resource Manager Metrics InitContainerFailingRepeatedly CRITICAL The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins. kube_pod_init_container_status_restarts_total 15 mins
GVP/Current/GVPPEGuide/Resource Manager Metrics MCPPortsExceeded HIGH All the MCP ports in MCP LRG are exceeded gvp_rm_log_parser_eror_total 1min
GVP/Current/GVPPEGuide/Resource Manager Metrics PodStatusNotReady CRITICAL The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml. kube_pod_status_ready 30mins
GVP/Current/GVPPEGuide/Resource Manager Metrics RM Service Down CRITICAL RM pods are not in ready state and RM service is not available kube_pod_container_status_running 0
GVP/Current/GVPPEGuide/Resource Manager Metrics RMConfigServerConnectionLost HIGH RM lost connection to GVP Configuration Server for 5mins. gvp_rm_log_parser_warn_total 5 mins
GVP/Current/GVPPEGuide/Resource Manager Metrics RMInterNodeConnectivityBroken HIGH Inter-node connectivity between RM nodes is lost for 5mins. gvp_rm_log_parser_warn_total 5 mins
GVP/Current/GVPPEGuide/Resource Manager Metrics RMMatchingIVRTenantNotFound MEDIUM Matching IVR profile tenant could not be found for 2mins gvp_rm_log_parser_eror_total 2mins
GVP/Current/GVPPEGuide/Resource Manager Metrics RMResourceAllocationFailed MEDIUM RM Resource allocation failed for 1mins gvp_rm_log_parser_eror_total 1min
GVP/Current/GVPPEGuide/Resource Manager Metrics RMServiceDegradedTo50Percentage HIGH One of the RM container is not in running state for 5mins kube_pod_container_status_running 5mins
GVP/Current/GVPPEGuide/Resource Manager Metrics RMSocketInterNodeError HIGH RM Inter node Socket Error for 5mins. gvp_rm_log_parser_eror_total 5mins
GVP/Current/GVPPEGuide/Resource Manager Metrics RMTotal4XXErrorForINVITE MEDIUM The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm. rmTotal4xxInviteSent 1min
GVP/Current/GVPPEGuide/Resource Manager Metrics RMTotal5XXErrorForINVITE HIGH The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm. rmTotal5xxInviteSent 5 mins
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES-NODE-JS-DELAY-WARNING Warning Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment. application_ccecp_nodejs_eventloop_lag_seconds Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_ENQUEUE_LIMIT_REACHED Info GES is throttling callbacks to a given phone number. CB_ENQUEUE_LIMIT_REACHED Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_SUBMIT_FAILED Info GES has failed to submit a callback to ORS. CB_SUBMIT_FAILED Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_TTL_LIMIT_REACHED Info GES is throttling callbacks for a specific tenant. CB_TTL_LIMIT_REACHED Triggered when GES has started throttling callbacks within the past 2 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CPU_USAGE Info GES has high CPU usage for 1 minute. ges_process_cpu_seconds_total Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_DNS_FAILURE Warning A GES pod has encountered difficulty resolving DNS requests. DNS_FAILURE Triggered when GES encounters any DNS failures within the last 30 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_AUTH_DOWN Warning Connection to the Genesys Authentication Service is down. GWS_AUTH_STATUS Triggered when the connection to the Genesys Authentication Service is down for 5 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_CONFIG_DOWN Warning Connection to the GWS Configuration Service is down. GWS_CONFIG_STATUS Triggered when the connection to the GWS Configuration Service is down.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_ENVIRONMENT_DOWN Warning Connection to the GWS Environment Service is down. GWS_ENV_STATUS Triggered when the connection to the GWS Environment Service is down.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_INCORRECT_CLIENT_CREDENTIALS Warning The GWS client credentials provided to GES are incorrect. GWS_INCORRECT_CLIENT_CREDENTIALS Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_SERVER_ERROR Warning GES has encountered server or connection errors with GWS. GWS_SERVER_ERROR Triggered when there has been a GWS server error in the past 5 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HEALTH Critical One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down. GES_HEALTH Triggered when any component is down for any length of time.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_400_POD Info An individual GES pod is returning excessive HTTP 400 results. ges_http_failed_requests_total, http_400_tolerance Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_401_POD Info An individual GES pod is returning excessive HTTP 401 results. ges_http_failed_requests_total, http_401_tolerance Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_404_POD Info An individual GES pod is returning excessive HTTP 404 results. ges_http_failed_requests_total, http_404_tolerance Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_500_POD Info An individual GES pod is returning excessive HTTP 500 results. ges_http_failed_requests_total, http_500_tolerance Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_INVALID_CONTENT_LENGTH Info Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port. INVALID_CONTENT_LENGTH, invalid_content_length_tolerance Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_LOGGING_FAILURE Warning GES has failed to write a message to the log. LOGGING_FAILURE Triggered when there are any failures writing to the logs. Silenced after 1 minute.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_MEMORY_USAGE Info GES has high memory usage for a period of 90 seconds. ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NEXUS_ACCESS_FAILURE Warning GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback. NEXUS_ACCESS_FAILURE Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NOT_READY_CRITICAL Critical GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. kube_pod_container_status_ready Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NOT_READY_WARNING Warning GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. kube_pod_container_status_ready Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_ORS_REDIS_DOWN Critical Connection to ORS_REDIS is down. ORS_REDIS_STATUS Triggered when the ORS_REDIS connection is down for 5 consecutive minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_PODS_RESTART Critical GES pods have been excessively crashing and restarting. kube_pod_container_status_restarts_total Triggered when there have been more than five pod restarts in the past 15 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_RBAC_CREATE_VQ_PROXY_ERROR Info Fires if there are issues with GES managing VQ Proxy Objects. RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_SLOW_HTTP_RESPONSE_TIME Warning Fired if the average response time for incoming requests begins to lag. ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_UNCAUGHT_EXCEPTION Warning There has been an uncaught exception within GES. UNCAUGHT_EXCEPTION Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute.
PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_UP Critical Fires when fewer than two GES pods have been up for the last 15 minutes. Triggered when fewer than two GES pods are up for 15 consecutive minutes.
PEC-DC/Current/DCPEGuide/DCMetrics Memory usage is above 3000 Mb Critical Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes. nexus_process_resident_memory_bytes For 15 minutes
PEC-DC/Current/DCPEGuide/DCMetrics Nexus error rate Critical Triggered when the error rate on this pod is greater than 20% for 15 minutes. nexus_errors_total, nexus_request_total For 15 minutes
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts Database connections above 75 HIGH Triggered when pod database connections number is above 75. Default number of connections: 75
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts IWD DB errors CRITICAL Triggered when IWD experiences more than 2 errors within 1 minute during operations with database. Default number of errors: 2
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts IWD error rate CRITICAL Triggered when the number of errors in IWD exceeds the threshold for 15 min period. Default number of errors: 2
PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts Memory usage is above 3000 Mb CRITICAL Triggered when the pod memory usage is above 3000 MB. Default memory usage: 3000 MB
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-API-LatencyHigh HIGH Triggered when the latency for API responses is beyond the defined threshold. 2500ms for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-API-Redis-Connection-Failed HIGH Triggered when the connection to redis fails for more than 1 minute. 1m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-EXT-Ingress-Error-Rate HIGH Triggered when the Ingress error rate is above the specified threshold. 20% for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
PEC-OU/Current/CXCPEGuide/APIAMetrics cxc_api_too_many_errors_from_auth HIGH Triggered when there are too many error responses from the auth service for more than the specified time threshold. 1m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-CM-Redis-Connection-Failed HIGH Triggered when the connection to redis fails for more than 1 minute. 1m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-CPUUsage HIGH Triggered when a the CPU utilization of a pod is beyond the threshold 300% for 5m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-CoM-Redis-no-active-connections HIGH Triggered when CX Contact compliance has no active redis connection for 2 minutes 2m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-Compliance-LatencyHigh HIGH Triggered when the latency for API responses is beyond the defined threshold. 5000ms for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold. 300% for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-DM-LatencyHigh HIGH Triggered when the latency for dial manager is above the defined threshold. 5000ms for 5m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m

View (previous 500 | next 500) (20 | 50 | 100 | 250 | 500)

Modify query
  
  
  
  
  
  
    
  
  

Retrieved from "https://all.docs.genesys.com/Special:CargoQuery (2024-05-15 02:47:52)"