Cargo query

Jump to: navigation, search

Showing below up to 250 results in range #101 to #350.

View (previous 250 | next 250) (20 | 50 | 100 | 250 | 500)

Page Alert Severity AlertDescription BasedOn Threshold
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_FETCH_RESOURCE_TIMEOUT MEDIUM Number of VXMLi fetch timeouts exceeded limit gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} 1min
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics NGI_LOG_PARSE_ERROR WARNING Number of VXMLi parse errors exceeded limit gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...} 1min
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerCPUreached80percent HIGH The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerMemoryUsage80percent HIGH The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics ContainerRestartedRepeatedly CRITICAL The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins kube_pod_container_status_restarts_total 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics InitContainerFailingRepeatedly CRITICAL The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins kube_pod_init_container_status_restarts_total 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics PodStatusNotReady CRITICAL The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file. kube_pod_status_ready 30mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics PVC50PercentFilled HIGH This trigger will flag an alarm when the RS PVC size is 50% filled
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics PVC80PercentFilled CRITICAL This trigger will flag an alarm when the RS PVC size is 80% filled
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes 5mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics RSQueueSizeCritical HIGH The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins rsQueueSize 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerCPUreached80percentForRM0 HIGH The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerCPUreached80percentForRM1 HIGH The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerMemoryUsage80percentForRM0 HIGH The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins container_memory_rss, kube_pod_container_resource_limits_memory_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerMemoryUsage80percentForRM1 HIGH The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins container_memory_rss, kube_pod_container_resource_limits_memory_bytes 15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics ContainerRestartedRepeatedly CRITICAL The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins kube_pod_container_status_restarts_total 15 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics InitContainerFailingRepeatedly CRITICAL The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins. kube_pod_init_container_status_restarts_total 15 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics MCPPortsExceeded HIGH All the MCP ports in MCP LRG are exceeded gvp_rm_log_parser_eror_total 1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics PodStatusNotReady CRITICAL The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml. kube_pod_status_ready 30mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RM Service Down CRITICAL RM pods are not in ready state and RM service is not available kube_pod_container_status_running 0
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMConfigServerConnectionLost HIGH RM lost connection to GVP Configuration Server for 5mins. gvp_rm_log_parser_warn_total 5 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMInterNodeConnectivityBroken HIGH Inter-node connectivity between RM nodes is lost for 5mins. gvp_rm_log_parser_warn_total 5 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMMatchingIVRTenantNotFound MEDIUM Matching IVR profile tenant could not be found for 2mins gvp_rm_log_parser_eror_total 2mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMResourceAllocationFailed MEDIUM RM Resource allocation failed for 1mins gvp_rm_log_parser_eror_total 1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMServiceDegradedTo50Percentage HIGH One of the RM container is not in running state for 5mins kube_pod_container_status_running 5mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMSocketInterNodeError HIGH RM Inter node Socket Error for 5mins. gvp_rm_log_parser_eror_total 5mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMTotal4XXErrorForINVITE MEDIUM The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm. rmTotal4xxInviteSent 1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics RMTotal5XXErrorForINVITE HIGH The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm. rmTotal5xxInviteSent 5 mins
Draft:GWS/Current/GWSPEGuide/GWSMetrics CPUThrottling Critical Containers are being throttled more than 1 time per second. container_cpu_cfs_throttled_periods_total 1
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_500_responces_java Critical Too many 500 responses. gws_responses_total 10
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_5xx_responces_count Critical Too many 5xx responses. gws_responses_total 60
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_cpu_usage Warning High container CPU usage. container_cpu_usage_seconds_total 300%
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_high_jvm_gc_pause_seconds_count Critical JVM garbage collection occurs too often. jvm_gc_pause_seconds_count 10
Draft:GWS/Current/GWSPEGuide/GWSMetrics gws_jvm_threads_deadlocked Critical Deadlocked JVM threads exist. jvm_threads_deadlocked 0
Draft:GWS/Current/GWSPEGuide/GWSMetrics netstat_Tcp_RetransSegs Warning High number of TCP RetransSegs (retransmitted segments). node_netstat_Tcp_RetransSegs 2000
Draft:GWS/Current/GWSPEGuide/GWSMetrics total_count_of_errors_during_context_initialization Warning Total count of errors during context initialization. gws_context_error_total 1200
Draft:GWS/Current/GWSPEGuide/GWSMetrics total_count_of_errors_in_PSDK_connections Warning Total count of errors in PSDK connections. psdk_conn_error_total 3
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics DesiredPodsDontMatchSpec Critical The Workspace Service deployment doesn't have the desired number of replicas. kube_deployment_status_replicas_available, kube_deployment_spec_replicas Fired when number of available replicas does not equal to configured number.
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_app_workspace_incoming_requests Critical High rate of incoming requests from Workspace Web Edition. gws_app_workspace_incoming_requests 10
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_high_500_responces_workspace Critical The Workspace Service has too many 500 responses. gws_app_workspace_requests 10
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_high_cpu_usage Warning High container CPU usage. container_cpu_usage_seconds_total 300%
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics gws_high_nodejs_eventloop_lag_seconds Critical The Node.js event loop is too slow. nodejs_eventloop_lag_seconds 0.2
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES-NODE-JS-DELAY-WARNING Warning Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment. application_ccecp_nodejs_eventloop_lag_seconds Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_ENQUEUE_LIMIT_REACHED Info GES is throttling callbacks to a given phone number. CB_ENQUEUE_LIMIT_REACHED Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_SUBMIT_FAILED Info GES has failed to submit a callback to ORS. CB_SUBMIT_FAILED Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CB_TTL_LIMIT_REACHED Info GES is throttling callbacks for a specific tenant. CB_TTL_LIMIT_REACHED Triggered when GES has started throttling callbacks within the past 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_CPU_USAGE Info GES has high CPU usage for 1 minute. ges_process_cpu_seconds_total Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_DNS_FAILURE Warning A GES pod has encountered difficulty resolving DNS requests. DNS_FAILURE Triggered when GES encounters any DNS failures within the last 30 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_AUTH_DOWN Warning Connection to the Genesys Authentication Service is down. GWS_AUTH_STATUS Triggered when the connection to the Genesys Authentication Service is down for 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_CONFIG_DOWN Warning Connection to the GWS Configuration Service is down. GWS_CONFIG_STATUS Triggered when the connection to the GWS Configuration Service is down.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_ENVIRONMENT_DOWN Warning Connection to the GWS Environment Service is down. GWS_ENV_STATUS Triggered when the connection to the GWS Environment Service is down.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_INCORRECT_CLIENT_CREDENTIALS Warning The GWS client credentials provided to GES are incorrect. GWS_INCORRECT_CLIENT_CREDENTIALS Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_GWS_SERVER_ERROR Warning GES has encountered server or connection errors with GWS. GWS_SERVER_ERROR Triggered when there has been a GWS server error in the past 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HEALTH Critical One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down. GES_HEALTH Triggered when any component is down for any length of time.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_400_POD Info An individual GES pod is returning excessive HTTP 400 results. ges_http_failed_requests_total, http_400_tolerance Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_401_POD Info An individual GES pod is returning excessive HTTP 401 results. ges_http_failed_requests_total, http_401_tolerance Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_404_POD Info An individual GES pod is returning excessive HTTP 404 results. ges_http_failed_requests_total, http_404_tolerance Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_HTTP_500_POD Info An individual GES pod is returning excessive HTTP 500 results. ges_http_failed_requests_total, http_500_tolerance Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_INVALID_CONTENT_LENGTH Info Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port. INVALID_CONTENT_LENGTH, invalid_content_length_tolerance Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_LOGGING_FAILURE Warning GES has failed to write a message to the log. LOGGING_FAILURE Triggered when there are any failures writing to the logs. Silenced after 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_MEMORY_USAGE Info GES has high memory usage for a period of 90 seconds. ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NEXUS_ACCESS_FAILURE Warning GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback. NEXUS_ACCESS_FAILURE Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NOT_READY_CRITICAL Critical GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. kube_pod_container_status_ready Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_NOT_READY_WARNING Warning GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. kube_pod_container_status_ready Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_ORS_REDIS_DOWN Critical Connection to ORS_REDIS is down. ORS_REDIS_STATUS Triggered when the ORS_REDIS connection is down for 5 consecutive minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_PODS_RESTART Critical GES pods have been excessively crashing and restarting. kube_pod_container_status_restarts_total Triggered when there have been more than five pod restarts in the past 15 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_RBAC_CREATE_VQ_PROXY_ERROR Info Fires if there are issues with GES managing VQ Proxy Objects. RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_SLOW_HTTP_RESPONSE_TIME Warning Fired if the average response time for incoming requests begins to lag. ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_UNCAUGHT_EXCEPTION Warning There has been an uncaught exception within GES. UNCAUGHT_EXCEPTION Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics GES_UP Critical Fires when fewer than two GES pods have been up for the last 15 minutes. Triggered when fewer than two GES pods are up for 15 consecutive minutes.
Draft:PEC-DC/Current/DCPEGuide/DCMetrics Memory usage is above 3000 Mb Critical Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes. nexus_process_resident_memory_bytes For 15 minutes
Draft:PEC-DC/Current/DCPEGuide/DCMetrics Nexus error rate Critical Triggered when the error rate on this pod is greater than 20% for 15 minutes. nexus_errors_total, nexus_request_total For 15 minutes
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts Database connections above 75 HIGH Triggered when pod database connections number is above 75. Default number of connections: 75
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts IWD DB errors CRITICAL Triggered when IWD experiences more than 2 errors within 1 minute during operations with database. Default number of errors: 2
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts IWD error rate CRITICAL Triggered when the number of errors in IWD exceeds the threshold for 15 min period. Default number of errors: 2
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts Memory usage is above 3000 Mb CRITICAL Triggered when the pod memory usage is above 3000 MB. Default memory usage: 3000 MB
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-API-LatencyHigh HIGH Triggered when the latency for API responses is beyond the defined threshold. 2500ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-API-Redis-Connection-Failed HIGH Triggered when the connection to redis fails for more than 1 minute. 1m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-EXT-Ingress-Error-Rate HIGH Triggered when the Ingress error rate is above the specified threshold. 20% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics cxc_api_too_many_errors_from_auth HIGH Triggered when there are too many error responses from the auth service for more than the specified time threshold. 1m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-CM-Redis-Connection-Failed HIGH Triggered when the connection to redis fails for more than 1 minute. 1m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-CPUUsage HIGH Triggered when a the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-CoM-Redis-no-active-connections HIGH Triggered when CX Contact compliance has no active redis connection for 2 minutes 2m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-Compliance-LatencyHigh HIGH Triggered when the latency for API responses is beyond the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold. 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-DM-LatencyHigh HIGH Triggered when the latency for dial manager is above the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-JS-LatencyHigh HIGH Triggered when the latency for job scheduler is above the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-LB-LatencyHigh HIGH Triggered when the latency for list builder is above the defined threshold. 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-CPUUsage HIGH Triggered when the CPU utilization of a pod is beyond the threshold 300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-LM-LatencyHigh HIGH Triggered when the latency for list manager is above the defined threshold 5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-MemoryUsage HIGH Triggered when the memory utilization of a pod is beyond the threshold. 70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-MemoryUsagePD HIGH Triggered when the memory usage of a pod is above the critical threshold. 90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodNotReadyCount HIGH Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodRestartsCount HIGH Triggered when the restart count for a pod is beyond the threshold. 1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodRestartsCountPD HIGH Triggered when the restart count is beyond the critical threshold. 5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics CXC-PodsNotReadyPD HIGH Triggered when there are no pods ready for CX Contact deployment. 0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics cxc_list_manager_too_many_errors_from_auth HIGH Triggered when there are too many error responses from the auth service (list manager) for more than the specified time threshold. 1m
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics gcxi__cluster__info This alert indicates problems with the cluster states. Applicable only if you have two or more nodes in a cluster. gcxi__cluster__info
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics gcxi__projects__status If the value of cxi__projects__status is greater than 0, this alarm is set, indicating that reporting is not functioning properly. cxi__projects__status < 0
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics raa-errors '''Specified by''': raa.prometheusRule.alerts.raa-errors.labels.severity in values.yaml.
'''Recommended value''': warning
A nonzero value indicates that errors have been logged during the scrape interval. gcxi_raa_error_count >0
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics raa-health '''Specified by''': raa.prometheusRule.alerts.labels.severity
'''Recommended value:''' severe
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. gcxi_raa_health_level Specified by: raa.prometheusRule.alerts.health.for
'''Recommended value''': 30m
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics raa-long-aggregation '''Specified by''': raa.prometheusRule.alerts.longAggregation.labels.severity in values.yaml.
'''Recommended value''': warning
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml.
'''Recommended value''': 300
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics GcaOOMKilled Critical Triggered when a GCA pod is restarted because of OOMKilled. kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason 1
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics GcaPodCrashLooping Critical Triggered when a GCA pod is crash looping. kube_pod_container_status_restarts_total The restart rate is greater than 0 for 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspFlinkJobDown Critical Triggered when the GSP Flink job is not running (number of running jobs equals to 0 or metric is not available) flink_jobmanager_numRunningJobs For 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspNoTmRegistered Critical Triggered when there are no registered TaskManagers (or metric not available) flink_jobmanager_numRegisteredTaskManagers For 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspOOMKilled Critical Triggered when a GSP pod is restarted because of OOMKilled kube_pod_container_status_restarts_total 0
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics GspUnknownPerson High Triggered when GSP encounters unknown person(s) flink_taskmanager_job_task_operator_tenant_error_total{error="unknown_person",service="gsp"} For 5 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_connected_configservers Critical Pulse DCU Collector is not connected to ConfigServer. pulse_collector_connection_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_connected_dbservers Critical Pulse DCU Collector is not connected to DbServer. pulse_collector_connection_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_connected_statservers Critical Pulse DCU Collector is not connected to Stat Server. pulse_collector_connection_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_col_snapshot_writing Critical Pulse DCU Collector does not write snapshots. pulse_collector_snapshot_writing_status for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_cpu Critical Detected critical CPU usage by Pulse DCU Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_disk Critical Detected critical disk usage by Pulse DCU Pod. kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes 90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_memory Critical Detected critical memory usage by Pulse DCU Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_nonrunning_instances Critical Triggered when Pulse DCU instances are down. kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_connected_configservers Critical Pulse DCU Stat Server is not connected to ConfigServer. pulse_statserver_server_connected_seconds for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_connected_ixnservers Critical Pulse DCU Stat Server is not connected to IxnServers. pulse_statserver_server_connected_seconds 2
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_connected_tservers Critical Pulse DCU Stat Server is not connected to T-Servers. pulse_statserver_server_connected_number 2
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_critical_ss_failed_dn_registrations Critical Detected critical DN registration failures on Pulse DCU Stat Server. pulse_statserver_dn_failed, pulse_statserver_dn_registered 0.5%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_monitor_data_unavailable Critical Pulse DCU Monitor Agents do not provide data. pulse_monitor_check_duration_seconds, kube_statefulset_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics pulse_dcu_too_frequent_restarts Critical Detected too frequent restarts of DCU Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_critical_cpu Critical Detected critical CPU usage by Pulse LDS Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_critical_memory Critical Detected critical memory usage by Pulse LDS Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_critical_nonrunning_instances Critical Triggered when Pulse LDS instances are down. kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_monitor_data_unavailable Critical Pulse LDS Monitor Agents do not provide data. pulse_monitor_check_duration_seconds, kube_statefulset_replicas for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_no_connected_senders Critical Pule LDS is not connected to upstream servers. pulse_lds_senders_number for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_no_registered_dns Critical No DNs are registered on Pulse LDS. pulse_lds_sender_registered_dns_number for 30 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics pulse_lds_too_frequent_restarts Critical Detected too frequent restarts of LDS Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_5xx Critical Detected critical 5xx errors per second for Pulse container. http_server_requests_seconds_count 15%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_cpu Critical Detected critical CPU usage by Pulse Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_hikari_cp Critical Detected critical Hikari connections pool usage by Pulse container. hikaricp_connections_active, hikaricp_connections 90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_memory Critical Detected critical memory usage by Pulse Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_pulse_health Critical Detected critical number of healthy Pulse containers. pulse_health_all_Boolean 50%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_critical_running_instances Critical Triggered when Pulse instances are down. kube_deployment_status_replicas_available, kube_deployment_status_replicas 75%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_service_down Critical All Pulse instances are down. up for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics pulse_too_frequent_restarts Critical Detected too frequent restarts of Pulse Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_critical_cpu Critical Detected critical CPU usage by Pulse Permissions Pod. container_cpu_usage_seconds_total, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_critical_memory Critical Detected critical memory usage by Pulse Permissions Pod. container_memory_working_set_bytes, kube_pod_container_resource_limits 90%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_critical_running_instances Critical Triggered when Pulse Permissions instances are down. kube_deployment_status_replicas_available, kube_deployment_status_replicas 75%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics pulse_permissions_too_frequent_restarts Critical Detected too frequent restarts of Permissions Pod container. kube_pod_container_status_restarts_total 2 for 1 hour
Draft:STRMS/Current/STRMSPEGuide/ServiceMetrics streams_GWS_AUTH_DOWN critical Unable to connect to GWS auth service gws_auth_down 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_BATCH_LAG_TIME warning Message handling exceeds 2 secs 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_DOWN critical The number of running instances is 0 sum(up) < 1 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_ENDPOINT_CONNECTION_DOWN warning Unable to connect to a customer endpoint endpoint_connection_down 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_ENGAGE_KAFKA_CONNECTION_DOWN critical Unable to connect to Engage Kafka engage_kafka_main_connection_down 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_GWS_AUTH_DOWN Critical Unable to connect to GWS auth service gws_auth_down 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_GWS_CONFIG_DOWN critical Unable to connect to GWS config service gws_config_down
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_GWS_ENV_DOWN critical Unable to connect to GWS environment service gws_env_down 30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_INIT_ERROR critical Aborted due to initialization error e.g., KAFKA_FQDN is not defined application_streams_init_error > 0 10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics streams_REDIS_DOWN critical redis_connection_down 10 seconds
Draft:TLM/Current/TLMPEGuide/TLMMetrics Http Errors Occurrences Exceeded Threshold High Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} >500 in 5 minutes
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry CPU Utilization is Greater Than Threshold High Triggered when average CPU usage is more than 60% node_cpu_seconds_total >60%
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry Dependency Status Low Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus telemetry_dependency_status <80
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry GAuth Time Alert High Triggered when there is no connection to the GAuth service telemetry_gws_auth_req_time >10000
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry Healthy Pod Count Alert High Triggered when the number of healthy pods drops to critical level kube_pod_container_status_ready <2
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry High Network Traffic High Triggered when network traffic exceeds 10MB/second for 5 minutes node_network_transmit_bytes_total, node_network_receive_bytes_total >10MBps
Draft:TLM/Current/TLMPEGuide/TLMMetrics Telemetry Memory Usage is Greater Than Threshold High Triggered when average memory usage is more than 60% container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores >60%
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_elasticsearch_health_status critical Triggered when there is no connection to ElasticSearch ucsx_elasticsearch_health_status 2 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_elasticsearch_slow_processing_time critical Triggered when Elasticsearch internal processing time > 500 ms ucsx_elastic_search_sum, ucsx_elastic_search_count 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_high_cpu_utilization warning Triggered when average CPU usage is more than 80% ucsx_performance 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_high_http_request_rate warning Triggered when request rate is more than 120 requests per seconds on one UCS-X instance ucsx_http_request_duration_count 30 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_high_memory_usage warning Triggered when average CPU usage is more than 800 Mb ucsx_memory 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_overloaded warning Triggered when overload protection rate is more than 0 ucsx_overload_protection_count 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_instance_slow_http_response critical Triggered when average http response time > 500 ms ucsx_http_request_duration_sum, ucsx_http_request_duration_count 5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_masterdb_health_status warning Triggered when there is no connection to master DB ucsx_masterdb_health_status 2 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics ucsx_tenantdb_health_status critical Triggered when there is no connection to tenant DB ucsx_tenantdb_health_status 2 minutes
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Agent service fail Critical Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005B-QINU`"', then restart the pod. agent_health_level Agent health level is Fail for pod '"`UNIQ--nowiki-0000005C-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Config node fail Warning Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005D-QINU`"' and the config node. http_client_response_count Requests to the config node fail for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Container restarted repeatedly Critical Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000056-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Kafka events latency is too high Warning Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network kafka_consumer_latency_bucket Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Kafka not available Critical Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000057-QINU`"', check if there is an issue with the pod. kafka_producer_state, kafka_consumer_state Kafka is not available for pod '"`UNIQ--nowiki-00000058-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Max replicas is not sufficient for 5 mins Critical The desired number of replicas is higher than the current available replicas for the past 5 minutes. kube_statefulset_replicas, kube_statefulset_status_replicas The desired number of replicas is higher than the current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000005E-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000005F-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000060-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000061-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000062-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000063-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000064-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000065-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status Failed Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000052-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status NotReady Critical Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000055-QINU`"' is in NotReady status for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status Pending Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000054-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Pod status Unknown Warning Actions: *Restart the pod. Check if there are any issues with pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000053-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Possible messages lost Critical Actions: *Check Kafka and '"`UNIQ--nowiki-0000004A-QINU`"' service overload, network degradation. kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total Number of sent requests is two times higher than received for topic '"`UNIQ--nowiki-0000004B-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Redis not available Critical Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000059-QINU`"', check if there is an issue with the pod. agent_redis_state, agent_stream_redis_state Redis is not available for pod '"`UNIQ--nowiki-0000005A-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka consumer crashes Critical Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-00000050-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 3 Kafka consumer crashes in 5 minutes for service '"`UNIQ--nowiki-00000051-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka consumer failed health checks Warning Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service. kafka_consumer_error_total Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka consumer request timeouts Warning Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics Too many Kafka pending events Critical Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000066-QINU`"' pod's CPU and network. kafka_producer_queue_depth Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000067-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Container restarted repeatedly Critical Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-0000004A-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Kafka events latency is too high Critical Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-0000003E-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network kafka_consumer_latency_bucket Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-0000003F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Kafka not available Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004B-QINU`"', check if there is an issue with the pod. kafka_producer_state, kafka_consumer_state Kafka is not available for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Max replicas is not sufficient for 5 mins Critical The desired number of replicas is higher than the current available replicas for the past 5 minutes. kube_statefulset_replicas, kube_statefulset_status_replicas The desired number of replicas is higher than the current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000004F-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000050-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-00000051-QINU`"'. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000052-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000053-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000054-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-00000055-QINU`"'. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-00000056-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status Failed Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000046-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status NotReady Critical Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000049-QINU`"' is in NotReady status for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status Pending Warning Actions: *Restart the pod. Check if there are any issues with the pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000048-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Pod status Unknown Warning Actions: *Restart the pod. Check if there are any issues with pod after restart. kube_pod_status_phase Pod '"`UNIQ--nowiki-00000047-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Redis not available Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004D-QINU`"', check if there is an issue with the pod. callthread_redis_state Redis is not available for pod '"`UNIQ--nowiki-0000004E-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka consumer crashes Critical Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000044-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 3 Kafka consumer crashes in 5 minutes for topic '"`UNIQ--nowiki-00000045-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka consumer failed health checks Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000040-QINU`"', check if there is an issue with the service. kafka_consumer_error_total Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000041-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka consumer request timeouts Warning Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000042-QINU`"', check if there is an issue with the service. kafka_consumer_error_total More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000043-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics Too many Kafka pending events Critical Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000057-QINU`"' service's CPU and network. kafka_producer_queue_depth Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000058-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Container restarted repeatedly Critical Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_container_status_restarts_total Container '"`UNIQ--nowiki-00000038-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod CPU greater than 65% Warning High CPU load for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-0000003E-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod CPU greater than 80% Critical Critical CPU load for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. container_cpu_usage_seconds_total, container_spec_cpu_period Container '"`UNIQ--nowiki-00000040-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod Failed Warning Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. kube_pod_status_phase Pod failed '"`UNIQ--nowiki-00000032-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod memory greater than 65% Warning High memory usage for pod '"`UNIQ--nowiki-00000039-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000003A-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod memory greater than 80% Critical Critical memory usage for pod '"`UNIQ--nowiki-0000003B-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes Container '"`UNIQ--nowiki-0000003C-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics Pod Not ready for 10 minutes Critical Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. kube_pod_status_ready Pod '"`UNIQ--nowiki-00000037-QINU`"' is in NotReady state for 10 minutes.

View (previous 250 | next 250) (20 | 50 | 100 | 250 | 500)

Modify query
  
  
  
  
  
  
    
  
  

Retrieved from "https://all.docs.genesys.com/Special:CargoQuery (2024-05-15 03:04:07)"