Cargo query
Showing below up to 250 results in range #101 to #350.
View (previous 250 | next 250) (20 | 50 | 100 | 250 | 500)
Page | Alert | Severity | AlertDescription | BasedOn | Threshold |
---|---|---|---|---|---|
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_FETCH_RESOURCE_TIMEOUT | MEDIUM | Number of VXMLi fetch timeouts exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...} | 1min |
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics | NGI_LOG_PARSE_ERROR | WARNING | Number of VXMLi parse errors exceeded limit | gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...} | 1min |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerCPUreached80percent | HIGH | The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerMemoryUsage80percent | HIGH | The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins | container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | ContainerRestartedRepeatedly | CRITICAL | The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins | kube_pod_container_status_restarts_total | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | InitContainerFailingRepeatedly | CRITICAL | The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins | kube_pod_init_container_status_restarts_total | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | PodStatusNotReady | CRITICAL | The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file. | kube_pod_status_ready | 30mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | PVC50PercentFilled | HIGH | This trigger will flag an alarm when the RS PVC size is 50% filled
|
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | PVC80PercentFilled | CRITICAL | This trigger will flag an alarm when the RS PVC size is 80% filled
|
kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes | 5mins |
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics | RSQueueSizeCritical | HIGH | The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins | rsQueueSize | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerCPUreached80percentForRM0 | HIGH | The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerCPUreached80percentForRM1 | HIGH | The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins | container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerMemoryUsage80percentForRM0 | HIGH | The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins | container_memory_rss, kube_pod_container_resource_limits_memory_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerMemoryUsage80percentForRM1 | HIGH | The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins | container_memory_rss, kube_pod_container_resource_limits_memory_bytes | 15mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | ContainerRestartedRepeatedly | CRITICAL | The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins | kube_pod_container_status_restarts_total | 15 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | InitContainerFailingRepeatedly | CRITICAL | The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins. | kube_pod_init_container_status_restarts_total | 15 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | MCPPortsExceeded | HIGH | All the MCP ports in MCP LRG are exceeded | gvp_rm_log_parser_eror_total | 1min |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | PodStatusNotReady | CRITICAL | The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml. | kube_pod_status_ready | 30mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RM Service Down | CRITICAL | RM pods are not in ready state and RM service is not available | kube_pod_container_status_running | 0 |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMConfigServerConnectionLost | HIGH | RM lost connection to GVP Configuration Server for 5mins. | gvp_rm_log_parser_warn_total | 5 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMInterNodeConnectivityBroken | HIGH | Inter-node connectivity between RM nodes is lost for 5mins. | gvp_rm_log_parser_warn_total | 5 mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMMatchingIVRTenantNotFound | MEDIUM | Matching IVR profile tenant could not be found for 2mins | gvp_rm_log_parser_eror_total | 2mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMResourceAllocationFailed | MEDIUM | RM Resource allocation failed for 1mins | gvp_rm_log_parser_eror_total | 1min |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMServiceDegradedTo50Percentage | HIGH | One of the RM container is not in running state for 5mins | kube_pod_container_status_running | 5mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMSocketInterNodeError | HIGH | RM Inter node Socket Error for 5mins. | gvp_rm_log_parser_eror_total | 5mins |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMTotal4XXErrorForINVITE | MEDIUM | The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm. | rmTotal4xxInviteSent | 1min |
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics | RMTotal5XXErrorForINVITE | HIGH | The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm. | rmTotal5xxInviteSent | 5 mins |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | CPUThrottling | Critical | Containers are being throttled more than 1 time per second. | container_cpu_cfs_throttled_periods_total | 1 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_500_responces_java | Critical | Too many 500 responses. | gws_responses_total | 10 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_5xx_responces_count | Critical | Too many 5xx responses. | gws_responses_total | 60 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_cpu_usage | Warning | High container CPU usage. | container_cpu_usage_seconds_total | 300% |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_high_jvm_gc_pause_seconds_count | Critical | JVM garbage collection occurs too often. | jvm_gc_pause_seconds_count | 10 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | gws_jvm_threads_deadlocked | Critical | Deadlocked JVM threads exist. | jvm_threads_deadlocked | 0 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | netstat_Tcp_RetransSegs | Warning | High number of TCP RetransSegs (retransmitted segments). | node_netstat_Tcp_RetransSegs | 2000 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | total_count_of_errors_during_context_initialization | Warning | Total count of errors during context initialization. | gws_context_error_total | 1200 |
Draft:GWS/Current/GWSPEGuide/GWSMetrics | total_count_of_errors_in_PSDK_connections | Warning | Total count of errors in PSDK connections. | psdk_conn_error_total | 3 |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | DesiredPodsDontMatchSpec | Critical | The Workspace Service deployment doesn't have the desired number of replicas. | kube_deployment_status_replicas_available, kube_deployment_spec_replicas | Fired when number of available replicas does not equal to configured number. |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_app_workspace_incoming_requests | Critical | High rate of incoming requests from Workspace Web Edition. | gws_app_workspace_incoming_requests | 10 |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_high_500_responces_workspace | Critical | The Workspace Service has too many 500 responses. | gws_app_workspace_requests | 10 |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_high_cpu_usage | Warning | High container CPU usage. | container_cpu_usage_seconds_total | 300% |
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics | gws_high_nodejs_eventloop_lag_seconds | Critical | The Node.js event loop is too slow. | nodejs_eventloop_lag_seconds | 0.2 |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES-NODE-JS-DELAY-WARNING | Warning | Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment. | application_ccecp_nodejs_eventloop_lag_seconds | Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_ENQUEUE_LIMIT_REACHED | Info | GES is throttling callbacks to a given phone number. | CB_ENQUEUE_LIMIT_REACHED | Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_SUBMIT_FAILED | Info | GES has failed to submit a callback to ORS. | CB_SUBMIT_FAILED | Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CB_TTL_LIMIT_REACHED | Info | GES is throttling callbacks for a specific tenant. | CB_TTL_LIMIT_REACHED | Triggered when GES has started throttling callbacks within the past 2 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_CPU_USAGE | Info | GES has high CPU usage for 1 minute. | ges_process_cpu_seconds_total | Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_DNS_FAILURE | Warning | A GES pod has encountered difficulty resolving DNS requests. | DNS_FAILURE | Triggered when GES encounters any DNS failures within the last 30 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_AUTH_DOWN | Warning | Connection to the Genesys Authentication Service is down. | GWS_AUTH_STATUS | Triggered when the connection to the Genesys Authentication Service is down for 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_CONFIG_DOWN | Warning | Connection to the GWS Configuration Service is down. | GWS_CONFIG_STATUS | Triggered when the connection to the GWS Configuration Service is down. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_ENVIRONMENT_DOWN | Warning | Connection to the GWS Environment Service is down. | GWS_ENV_STATUS | Triggered when the connection to the GWS Environment Service is down. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_INCORRECT_CLIENT_CREDENTIALS | Warning | The GWS client credentials provided to GES are incorrect. | GWS_INCORRECT_CLIENT_CREDENTIALS | Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_GWS_SERVER_ERROR | Warning | GES has encountered server or connection errors with GWS. | GWS_SERVER_ERROR | Triggered when there has been a GWS server error in the past 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HEALTH | Critical | One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down. | GES_HEALTH | Triggered when any component is down for any length of time. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_400_POD | Info | An individual GES pod is returning excessive HTTP 400 results. | ges_http_failed_requests_total, http_400_tolerance | Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_401_POD | Info | An individual GES pod is returning excessive HTTP 401 results. | ges_http_failed_requests_total, http_401_tolerance | Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_404_POD | Info | An individual GES pod is returning excessive HTTP 404 results. | ges_http_failed_requests_total, http_404_tolerance | Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_HTTP_500_POD | Info | An individual GES pod is returning excessive HTTP 500 results. | ges_http_failed_requests_total, http_500_tolerance | Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_INVALID_CONTENT_LENGTH | Info | Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port. | INVALID_CONTENT_LENGTH, invalid_content_length_tolerance | Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_LOGGING_FAILURE | Warning | GES has failed to write a message to the log. | LOGGING_FAILURE | Triggered when there are any failures writing to the logs. Silenced after 1 minute. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_MEMORY_USAGE | Info | GES has high memory usage for a period of 90 seconds. | ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes | Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NEXUS_ACCESS_FAILURE | Warning | GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback. | NEXUS_ACCESS_FAILURE | Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NOT_READY_CRITICAL | Critical | GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. | kube_pod_container_status_ready | Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_NOT_READY_WARNING | Warning | GES pods are not in the Ready state. Indicative of issues with the Redis connection or other problems with the Helm deployment. | kube_pod_container_status_ready | Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_ORS_REDIS_DOWN | Critical | Connection to ORS_REDIS is down. | ORS_REDIS_STATUS | Triggered when the ORS_REDIS connection is down for 5 consecutive minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_PODS_RESTART | Critical | GES pods have been excessively crashing and restarting. | kube_pod_container_status_restarts_total | Triggered when there have been more than five pod restarts in the past 15 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_RBAC_CREATE_VQ_PROXY_ERROR | Info | Fires if there are issues with GES managing VQ Proxy Objects. | RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance | Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_SLOW_HTTP_RESPONSE_TIME | Warning | Fired if the average response time for incoming requests begins to lag. | ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count | Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_UNCAUGHT_EXCEPTION | Warning | There has been an uncaught exception within GES. | UNCAUGHT_EXCEPTION | Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute. |
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics | GES_UP | Critical | Fires when fewer than two GES pods have been up for the last 15 minutes. | Triggered when fewer than two GES pods are up for 15 consecutive minutes. | |
Draft:PEC-DC/Current/DCPEGuide/DCMetrics | Memory usage is above 3000 Mb | Critical | Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes. | nexus_process_resident_memory_bytes | For 15 minutes |
Draft:PEC-DC/Current/DCPEGuide/DCMetrics | Nexus error rate | Critical | Triggered when the error rate on this pod is greater than 20% for 15 minutes. | nexus_errors_total, nexus_request_total | For 15 minutes |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | Database connections above 75 | HIGH | Triggered when pod database connections number is above 75. | Default number of connections: 75 | |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | IWD DB errors | CRITICAL | Triggered when IWD experiences more than 2 errors within 1 minute during operations with database. | Default number of errors: 2 | |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | IWD error rate | CRITICAL | Triggered when the number of errors in IWD exceeds the threshold for 15 min period. | Default number of errors: 2 | |
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts | Memory usage is above 3000 Mb | CRITICAL | Triggered when the pod memory usage is above 3000 MB. | Default memory usage: 3000 MB | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-API-LatencyHigh | HIGH | Triggered when the latency for API responses is beyond the defined threshold. | 2500ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-API-Redis-Connection-Failed | HIGH | Triggered when the connection to redis fails for more than 1 minute. | 1m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-EXT-Ingress-Error-Rate | HIGH | Triggered when the Ingress error rate is above the specified threshold. | 20% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics | cxc_api_too_many_errors_from_auth | HIGH | Triggered when there are too many error responses from the auth service for more than the specified time threshold. | 1m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-CM-Redis-Connection-Failed | HIGH | Triggered when the connection to redis fails for more than 1 minute. | 1m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-CPUUsage | HIGH | Triggered when a the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-CoM-Redis-no-active-connections | HIGH | Triggered when CX Contact compliance has no active redis connection for 2 minutes | 2m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-Compliance-LatencyHigh | HIGH | Triggered when the latency for API responses is beyond the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold. | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-DM-LatencyHigh | HIGH | Triggered when the latency for dial manager is above the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-JS-LatencyHigh | HIGH | Triggered when the latency for job scheduler is above the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-LB-LatencyHigh | HIGH | Triggered when the latency for list builder is above the defined threshold. | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-CPUUsage | HIGH | Triggered when the CPU utilization of a pod is beyond the threshold | 300% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-LM-LatencyHigh | HIGH | Triggered when the latency for list manager is above the defined threshold | 5000ms for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-MemoryUsage | HIGH | Triggered when the memory utilization of a pod is beyond the threshold. | 70% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-MemoryUsagePD | HIGH | Triggered when the memory usage of a pod is above the critical threshold. | 90% for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodNotReadyCount | HIGH | Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodRestartsCount | HIGH | Triggered when the restart count for a pod is beyond the threshold. | 1 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodRestartsCountPD | HIGH | Triggered when the restart count is beyond the critical threshold. | 5 for 5m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | CXC-PodsNotReadyPD | HIGH | Triggered when there are no pods ready for CX Contact deployment. | 0 for 1m | |
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics | cxc_list_manager_too_many_errors_from_auth | HIGH | Triggered when there are too many error responses from the auth service (list manager) for more than the specified time threshold. | 1m | |
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics | gcxi__cluster__info | This alert indicates problems with the cluster states. Applicable only if you have two or more nodes in a cluster. | gcxi__cluster__info | ||
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics | gcxi__projects__status | If the value of cxi__projects__status is greater than 0, this alarm is set, indicating that reporting is not functioning properly. | cxi__projects__status | < 0 | |
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics | raa-errors | '''Specified by''': raa. '''Recommended value''': warning |
A nonzero value indicates that errors have been logged during the scrape interval. | gcxi_raa_error_count | >0 |
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics | raa-health | '''Specified by''': raa. '''Recommended value:''' severe |
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. | gcxi_raa_health_level | Specified by: raa. '''Recommended value''': 30m |
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics | raa-long-aggregation | '''Specified by''': raa. '''Recommended value''': warning |
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. | gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count | Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml. '''Recommended value''': 300 |
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics | GcaOOMKilled | Critical | Triggered when a GCA pod is restarted because of OOMKilled. | kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason | 1 |
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics | GcaPodCrashLooping | Critical | Triggered when a GCA pod is crash looping. | kube_pod_container_status_restarts_total | The restart rate is greater than 0 for 5 minutes |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspFlinkJobDown | Critical | Triggered when the GSP Flink job is not running (number of running jobs equals to 0 or metric is not available) | flink_jobmanager_numRunningJobs | For 5 minutes |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspNoTmRegistered | Critical | Triggered when there are no registered TaskManagers (or metric not available) | flink_jobmanager_numRegisteredTaskManagers | For 5 minutes |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspOOMKilled | Critical | Triggered when a GSP pod is restarted because of OOMKilled | kube_pod_container_status_restarts_total | 0 |
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics | GspUnknownPerson | High | Triggered when GSP encounters unknown person(s) | flink_ |
For 5 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_connected_configservers | Critical | Pulse DCU Collector is not connected to ConfigServer. | pulse_collector_connection_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_connected_dbservers | Critical | Pulse DCU Collector is not connected to DbServer. | pulse_collector_connection_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_connected_statservers | Critical | Pulse DCU Collector is not connected to Stat Server. | pulse_collector_connection_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_col_snapshot_writing | Critical | Pulse DCU Collector does not write snapshots. | pulse_collector_snapshot_writing_status | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_cpu | Critical | Detected critical CPU usage by Pulse DCU Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_disk | Critical | Detected critical disk usage by Pulse DCU Pod. | kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes | 90% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_memory | Critical | Detected critical memory usage by Pulse DCU Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_nonrunning_instances | Critical | Triggered when Pulse DCU instances are down. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_connected_configservers | Critical | Pulse DCU Stat Server is not connected to ConfigServer. | pulse_statserver_server_connected_seconds | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_connected_ixnservers | Critical | Pulse DCU Stat Server is not connected to IxnServers. | pulse_statserver_server_connected_seconds | 2 |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_connected_tservers | Critical | Pulse DCU Stat Server is not connected to T-Servers. | pulse_statserver_server_connected_number | 2 |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_critical_ss_failed_dn_registrations | Critical | Detected critical DN registration failures on Pulse DCU Stat Server. | pulse_statserver_dn_failed, pulse_statserver_dn_registered | 0.5% |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_monitor_data_unavailable | Critical | Pulse DCU Monitor Agents do not provide data. | pulse_monitor_check_duration_seconds, kube_statefulset_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics | pulse_dcu_too_frequent_restarts | Critical | Detected too frequent restarts of DCU Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_critical_cpu | Critical | Detected critical CPU usage by Pulse LDS Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_critical_memory | Critical | Detected critical memory usage by Pulse LDS Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_critical_nonrunning_instances | Critical | Triggered when Pulse LDS instances are down. | kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_monitor_data_unavailable | Critical | Pulse LDS Monitor Agents do not provide data. | pulse_monitor_check_duration_seconds, kube_statefulset_replicas | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_no_connected_senders | Critical | Pule LDS is not connected to upstream servers. | pulse_lds_senders_number | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_no_registered_dns | Critical | No DNs are registered on Pulse LDS. | pulse_lds_sender_registered_dns_number | for 30 minutes |
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics | pulse_lds_too_frequent_restarts | Critical | Detected too frequent restarts of LDS Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_5xx | Critical | Detected critical 5xx errors per second for Pulse container. | http_server_requests_seconds_count | 15% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_cpu | Critical | Detected critical CPU usage by Pulse Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_hikari_cp | Critical | Detected critical Hikari connections pool usage by Pulse container. | hikaricp_connections_active, hikaricp_connections | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_memory | Critical | Detected critical memory usage by Pulse Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_pulse_health | Critical | Detected critical number of healthy Pulse containers. | pulse_health_all_Boolean | 50% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_critical_running_instances | Critical | Triggered when Pulse instances are down. | kube_deployment_status_replicas_available, kube_deployment_status_replicas | 75% |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_service_down | Critical | All Pulse instances are down. | up | for 15 minutes |
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics | pulse_too_frequent_restarts | Critical | Detected too frequent restarts of Pulse Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_critical_cpu | Critical | Detected critical CPU usage by Pulse Permissions Pod. | container_cpu_usage_seconds_total, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_critical_memory | Critical | Detected critical memory usage by Pulse Permissions Pod. | container_memory_working_set_bytes, kube_pod_container_resource_limits | 90% |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_critical_running_instances | Critical | Triggered when Pulse Permissions instances are down. | kube_deployment_status_replicas_available, kube_deployment_status_replicas | 75% |
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics | pulse_permissions_too_frequent_restarts | Critical | Detected too frequent restarts of Permissions Pod container. | kube_pod_container_status_restarts_total | 2 for 1 hour |
Draft:STRMS/Current/STRMSPEGuide/ServiceMetrics | streams_GWS_AUTH_DOWN | critical | Unable to connect to GWS auth service | gws_auth_down | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_BATCH_LAG_TIME | warning | Message handling exceeds 2 secs | 30 seconds | |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_DOWN | critical | The number of running instances is 0 | sum(up) < 1 | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_ENDPOINT_CONNECTION_DOWN | warning | Unable to connect to a customer endpoint | endpoint_connection_down | 30 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_ENGAGE_KAFKA_CONNECTION_DOWN | critical | Unable to connect to Engage Kafka | engage_kafka_main_connection_down | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_GWS_AUTH_DOWN | Critical | Unable to connect to GWS auth service | gws_auth_down | 30 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_GWS_CONFIG_DOWN | critical | Unable to connect to GWS config service | gws_config_down | |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_GWS_ENV_DOWN | critical | Unable to connect to GWS environment service | gws_env_down | 30 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_INIT_ERROR | critical | Aborted due to initialization error e.g., KAFKA_FQDN is not defined | application_streams_init_error > 0 | 10 seconds |
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics | streams_REDIS_DOWN | critical | redis_connection_down | 10 seconds | |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Http Errors Occurrences Exceeded Threshold | High | Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes | telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} | >500 in 5 minutes |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry CPU Utilization is Greater Than Threshold | High | Triggered when average CPU usage is more than 60% | node_cpu_seconds_total | >60% |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry Dependency Status | Low | Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus | telemetry_dependency_status | <80 |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry GAuth Time Alert | High | Triggered when there is no connection to the GAuth service | telemetry_gws_auth_req_time | >10000 |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry Healthy Pod Count Alert | High | Triggered when the number of healthy pods drops to critical level | kube_pod_container_status_ready | <2 |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry High Network Traffic | High | Triggered when network traffic exceeds 10MB/second for 5 minutes | node_network_transmit_bytes_total, node_network_receive_bytes_total | >10MBps |
Draft:TLM/Current/TLMPEGuide/TLMMetrics | Telemetry Memory Usage is Greater Than Threshold | High | Triggered when average memory usage is more than 60% | container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores | >60% |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_elasticsearch_health_status | critical | Triggered when there is no connection to ElasticSearch | ucsx_elasticsearch_health_status | 2 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_elasticsearch_slow_processing_time | critical | Triggered when Elasticsearch internal processing time > 500 ms | ucsx_elastic_search_sum, ucsx_elastic_search_count | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_high_cpu_utilization | warning | Triggered when average CPU usage is more than 80% | ucsx_performance | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_high_http_request_rate | warning | Triggered when request rate is more than 120 requests per seconds on one UCS-X instance | ucsx_http_request_duration_count | 30 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_high_memory_usage | warning | Triggered when average CPU usage is more than 800 Mb | ucsx_memory | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_overloaded | warning | Triggered when overload protection rate is more than 0 | ucsx_overload_protection_count | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_instance_slow_http_response | critical | Triggered when average http response time > 500 ms | ucsx_http_request_duration_sum, ucsx_http_request_duration_count | 5 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_masterdb_health_status | warning | Triggered when there is no connection to master DB | ucsx_masterdb_health_status | 2 minutes |
Draft:UCS/Current/UCSPEGuide/UCSMetrics | ucsx_tenantdb_health_status | critical | Triggered when there is no connection to tenant DB | ucsx_tenantdb_health_status | 2 minutes |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Agent service fail | Critical | Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005B-QINU`"', then restart the pod. | agent_health_level | Agent health level is Fail for pod '"`UNIQ--nowiki-0000005C-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Config node fail | Warning | Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005D-QINU`"' and the config node. | http_client_response_count | Requests to the config node fail for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Container restarted repeatedly | Critical | Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000056-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Kafka events latency is too high | Warning | Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network | kafka_consumer_latency_bucket | Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Kafka not available | Critical | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000057-QINU`"', check if there is an issue with the pod. | kafka_producer_state, kafka_consumer_state | Kafka is not available for pod '"`UNIQ--nowiki-00000058-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Max replicas is not sufficient for 5 mins | Critical | The desired number of replicas is higher than the current available replicas for the past 5 minutes. | kube_statefulset_replicas, kube_statefulset_status_replicas | The desired number of replicas is higher than the current available replicas for the past 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000005E-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000005F-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000060-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000061-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000062-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000063-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000064-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000065-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status Failed | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000052-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status NotReady | Critical | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000055-QINU`"' is in NotReady status for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status Pending | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000054-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Pod status Unknown | Warning | Actions: *Restart the pod. Check if there are any issues with pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000053-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Possible messages lost | Critical | Actions: *Check Kafka and '"`UNIQ--nowiki-0000004A-QINU`"' service overload, network degradation. | kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total | Number of sent requests is two times higher than received for topic '"`UNIQ--nowiki-0000004B-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Redis not available | Critical | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-00000059-QINU`"', check if there is an issue with the pod. | agent_redis_state, agent_stream_redis_state | Redis is not available for pod '"`UNIQ--nowiki-0000005A-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka consumer crashes | Critical | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-00000050-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 3 Kafka consumer crashes in 5 minutes for service '"`UNIQ--nowiki-00000051-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka consumer failed health checks | Warning | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka consumer request timeouts | Warning | Actions: *If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. *If the alarm is triggered only for container '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004F-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics | Too many Kafka pending events | Critical | Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000066-QINU`"' pod's CPU and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000067-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Container restarted repeatedly | Critical | Actions: *Check if the new version of the image was deployed. *Check for issues with the Kubernetes cluster. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-0000004A-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Kafka events latency is too high | Critical | Actions: *If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). *If the alarm is triggered only for topic '"`UNIQ--nowiki-0000003E-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network | kafka_consumer_latency_bucket | Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-0000003F-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Kafka not available | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004B-QINU`"', check if there is an issue with the pod. | kafka_producer_state, kafka_consumer_state | Kafka is not available for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Max replicas is not sufficient for 5 mins | Critical | The desired number of replicas is higher than the current available replicas for the past 5 minutes. | kube_statefulset_replicas, kube_statefulset_status_replicas | The desired number of replicas is higher than the current available replicas for the past 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000004F-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000050-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-00000051-QINU`"'. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000052-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000053-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000054-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-00000055-QINU`"'. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-00000056-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status Failed | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000046-QINU`"' is in Failed state. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status NotReady | Critical | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000049-QINU`"' is in NotReady status for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status Pending | Warning | Actions: *Restart the pod. Check if there are any issues with the pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000048-QINU`"' is in Pending state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Pod status Unknown | Warning | Actions: *Restart the pod. Check if there are any issues with pod after restart. | kube_pod_status_phase | Pod '"`UNIQ--nowiki-00000047-QINU`"' is in Unknown state for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Redis not available | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. *If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004D-QINU`"', check if there is an issue with the pod. | callthread_redis_state | Redis is not available for pod '"`UNIQ--nowiki-0000004E-QINU`"' for 5 consecutive minutes. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka consumer crashes | Critical | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000044-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 3 Kafka consumer crashes in 5 minutes for topic '"`UNIQ--nowiki-00000045-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka consumer failed health checks | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000040-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000041-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka consumer request timeouts | Warning | Actions: *If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. *If the alarm is triggered only for '"`UNIQ--nowiki-00000042-QINU`"', check if there is an issue with the service. | kafka_consumer_error_total | More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000043-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics | Too many Kafka pending events | Critical | Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000057-QINU`"' service's CPU and network. | kafka_producer_queue_depth | Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000058-QINU`"' (more than 100 in 5 minutes). |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Container restarted repeatedly | Critical | Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_container_status_restarts_total | Container '"`UNIQ--nowiki-00000038-QINU`"' was restarted 5 or more times within 15 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod CPU greater than 65% | Warning | High CPU load for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-0000003E-QINU`"' CPU usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod CPU greater than 80% | Critical | Critical CPU load for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket. | container_cpu_usage_seconds_total, container_spec_cpu_period | Container '"`UNIQ--nowiki-00000040-QINU`"' CPU usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod Failed | Warning | Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason. | kube_pod_status_phase | Pod failed '"`UNIQ--nowiki-00000032-QINU`"'. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod memory greater than 65% | Warning | High memory usage for pod '"`UNIQ--nowiki-00000039-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket. | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000003A-QINU`"' memory usage exceeded 65% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod memory greater than 80% | Critical | Critical memory usage for pod '"`UNIQ--nowiki-0000003B-QINU`"'. Actions: *Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. *Check Grafana for abnormal load. *Restart the service. *Collect the service logs; raise an investigation ticket | container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes | Container '"`UNIQ--nowiki-0000003C-QINU`"' memory usage exceeded 80% for 5 minutes. |
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics | Pod Not ready for 10 minutes | Critical | Actions: *If this alarm is triggered, check whether the CPU is available for the pods. *Check whether the port of the pod is running and serving the request. | kube_pod_status_ready | Pod '"`UNIQ--nowiki-00000037-QINU`"' is in NotReady state for 10 minutes. |