Cargo query

Showing below up to 250 results in range #101 to #350.

View (previous 250 | next 250) (20 | 50 | 100 | 250 | 500)

Page	Alert	Severity	AlertDescription	BasedOn	Threshold
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics	NGI_LOG_FETCH_RESOURCE_TIMEOUT	MEDIUM	Number of VXMLi fetch timeouts exceeded limit	gvp_mcp_log_parser_eror_total {LogID="40026",endpoint="mcplog"...}	1min
Draft:GVP/Current/GVPPEGuide/GVP MCP Metrics	NGI_LOG_PARSE_ERROR	WARNING	Number of VXMLi parse errors exceeded limit	gvp_mcp_log_parser_eror_total {LogID="40028",endpoint="mcplog"...}	1min
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	ContainerCPUreached80percent	HIGH	The trigger will flag an alarm when the RS container CPU utilization goes beyond 80% for 15 mins	container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period	15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	ContainerMemoryUsage80percent	HIGH	The trigger will flag an alarm when the RS container Memory utilization goes beyond 80% for 15 mins	container_memory_usage_bytes, kube_pod_container_resource_limits_memory_bytes	15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	ContainerRestartedRepeatedly	CRITICAL	The trigger will flag an alarm when the RS or RS SNMP container gets restarted 5 or more times within 15 mins	kube_pod_container_status_restarts_total	15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	InitContainerFailingRepeatedly	CRITICAL	The trigger will flag an alarm when the RS init container gets failed 5 or more times within 15 mins	kube_pod_init_container_status_restarts_total	15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	PodStatusNotReady	CRITICAL	The trigger will flag an alarm when RS pod status is Not ready for 30 mins and this will be controlled through override-value.yaml file.	kube_pod_status_ready	30mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	PVC50PercentFilled	HIGH	This trigger will flag an alarm when the RS PVC size is 50% filled	kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes	15mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	PVC80PercentFilled	CRITICAL	This trigger will flag an alarm when the RS PVC size is 80% filled	kubelet_volume_stats_used_bytes, kubelet_volume_stats_capacity_bytes	5mins
Draft:GVP/Current/GVPPEGuide/Reporting Server Metrics	RSQueueSizeCritical	HIGH	The trigger will flag an alarm when RS JMS message queue size goes beyond 15000 (3GB approx. backlog) for 15 mins	rsQueueSize	15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	ContainerCPUreached80percentForRM0	HIGH	The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins	container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period	15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	ContainerCPUreached80percentForRM1	HIGH	The trigger will flag an alarm when the RM container CPU utilization goes beyond 80% for 15 mins	container_cpu_usage_seconds_total, container_spec_cpu_quota, container_spec_cpu_period	15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	ContainerMemoryUsage80percentForRM0	HIGH	The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins	container_memory_rss, kube_pod_container_resource_limits_memory_bytes	15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	ContainerMemoryUsage80percentForRM1	HIGH	The trigger will flag an alarm when the RM container Memory utilization goes beyond 80% for 15 mins	container_memory_rss, kube_pod_container_resource_limits_memory_bytes	15mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	ContainerRestartedRepeatedly	CRITICAL	The trigger will flag an alarm when the RM or RM SNMP container gets restarted 5 or more times within 15 mins	kube_pod_container_status_restarts_total	15 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	InitContainerFailingRepeatedly	CRITICAL	The trigger will flag an alarm when the RM init container gets failed 5 or more times within 15 mins.	kube_pod_init_container_status_restarts_total	15 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	MCPPortsExceeded	HIGH	All the MCP ports in MCP LRG are exceeded	gvp_rm_log_parser_eror_total	1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	PodStatusNotReady	CRITICAL	The trigger will flag an alarm when RM pod status is Not ready for 30 mins and this will be controlled by override-value.yaml.	kube_pod_status_ready	30mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RM Service Down	CRITICAL	RM pods are not in ready state and RM service is not available	kube_pod_container_status_running	0
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMConfigServerConnectionLost	HIGH	RM lost connection to GVP Configuration Server for 5mins.	gvp_rm_log_parser_warn_total	5 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMInterNodeConnectivityBroken	HIGH	Inter-node connectivity between RM nodes is lost for 5mins.	gvp_rm_log_parser_warn_total	5 mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMMatchingIVRTenantNotFound	MEDIUM	Matching IVR profile tenant could not be found for 2mins	gvp_rm_log_parser_eror_total	2mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMResourceAllocationFailed	MEDIUM	RM Resource allocation failed for 1mins	gvp_rm_log_parser_eror_total	1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMServiceDegradedTo50Percentage	HIGH	One of the RM container is not in running state for 5mins	kube_pod_container_status_running	5mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMSocketInterNodeError	HIGH	RM Inter node Socket Error for 5mins.	gvp_rm_log_parser_eror_total	5mins
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMTotal4XXErrorForINVITE	MEDIUM	The RM mib counter stats will be collected for every 60 seconds and if the mib counter total4xxInviteSent increments from its previous value by 10 within 60 seconds the trigger will flag an alarm.	rmTotal4xxInviteSent	1min
Draft:GVP/Current/GVPPEGuide/Resource Manager Metrics	RMTotal5XXErrorForINVITE	HIGH	The RM mib counter stats will be collected for every 30 seconds and if the mib counter total5xxInviteSent increments from its previous value by 5 within 5 minutes the trigger will flag an alarm.	rmTotal5xxInviteSent	5 mins
Draft:GWS/Current/GWSPEGuide/GWSMetrics	CPUThrottling	Critical	Containers are being throttled more than 1 time per second.	container_cpu_cfs_throttled_periods_total	1
Draft:GWS/Current/GWSPEGuide/GWSMetrics	gws_high_500_responces_java	Critical	Too many 500 responses.	gws_responses_total	10
Draft:GWS/Current/GWSPEGuide/GWSMetrics	gws_high_5xx_responces_count	Critical	Too many 5xx responses.	gws_responses_total	60
Draft:GWS/Current/GWSPEGuide/GWSMetrics	gws_high_cpu_usage	Warning	High container CPU usage.	container_cpu_usage_seconds_total	300%
Draft:GWS/Current/GWSPEGuide/GWSMetrics	gws_high_jvm_gc_pause_seconds_count	Critical	JVM garbage collection occurs too often.	jvm_gc_pause_seconds_count	10
Draft:GWS/Current/GWSPEGuide/GWSMetrics	gws_jvm_threads_deadlocked	Critical	Deadlocked JVM threads exist.	jvm_threads_deadlocked	0
Draft:GWS/Current/GWSPEGuide/GWSMetrics	netstat_Tcp_RetransSegs	Warning	High number of TCP RetransSegs (retransmitted segments).	node_netstat_Tcp_RetransSegs	2000
Draft:GWS/Current/GWSPEGuide/GWSMetrics	total_count_of_errors_during_context_initialization	Warning	Total count of errors during context initialization.	gws_context_error_total	1200
Draft:GWS/Current/GWSPEGuide/GWSMetrics	total_count_of_errors_in_PSDK_connections	Warning	Total count of errors in PSDK connections.	psdk_conn_error_total	3
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics	DesiredPodsDontMatchSpec	Critical	The Workspace Service deployment doesn't have the desired number of replicas.	kube_deployment_status_replicas_available, kube_deployment_spec_replicas	Fired when number of available replicas does not equal to configured number.
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics	gws_app_workspace_incoming_requests	Critical	High rate of incoming requests from Workspace Web Edition.	gws_app_workspace_incoming_requests	10
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics	gws_high_500_responces_workspace	Critical	The Workspace Service has too many 500 responses.	gws_app_workspace_requests	10
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics	gws_high_cpu_usage	Warning	High container CPU usage.	container_cpu_usage_seconds_total	300%
Draft:GWS/Current/GWSPEGuide/WorkspaceMetrics	gws_high_nodejs_eventloop_lag_seconds	Critical	The Node.js event loop is too slow.	nodejs_eventloop_lag_seconds	0.2
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES-NODE-JS-DELAY-WARNING	Warning	Triggers if the base NodeJS event loop becomes excessive. This indicates significant resource and performance issues with the deployment.	application_ccecp_nodejs_eventloop_lag_seconds	Triggered when the event loop is greater than 5 milliseconds for a period exceeding 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_CB_ENQUEUE_LIMIT_REACHED	Info	GES is throttling callbacks to a given phone number.	CB_ENQUEUE_LIMIT_REACHED	Triggered when GES has begun throttling callbacks to a given number within the past 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_CB_SUBMIT_FAILED	Info	GES has failed to submit a callback to ORS.	CB_SUBMIT_FAILED	Triggered when GES has failed to submit a callback to ORS in the past 2 minutes for any reason.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_CB_TTL_LIMIT_REACHED	Info	GES is throttling callbacks for a specific tenant.	CB_TTL_LIMIT_REACHED	Triggered when GES has started throttling callbacks within the past 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_CPU_USAGE	Info	GES has high CPU usage for 1 minute.	ges_process_cpu_seconds_total	Triggered when the average CPU usage (measured by ges_process_cpu_seconds_total) is greater than 90% for 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_DNS_FAILURE	Warning	A GES pod has encountered difficulty resolving DNS requests.	DNS_FAILURE	Triggered when GES encounters any DNS failures within the last 30 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_GWS_AUTH_DOWN	Warning	Connection to the Genesys Authentication Service is down.	GWS_AUTH_STATUS	Triggered when the connection to the Genesys Authentication Service is down for 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_GWS_CONFIG_DOWN	Warning	Connection to the GWS Configuration Service is down.	GWS_CONFIG_STATUS	Triggered when the connection to the GWS Configuration Service is down.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_GWS_ENVIRONMENT_DOWN	Warning	Connection to the GWS Environment Service is down.	GWS_ENV_STATUS	Triggered when the connection to the GWS Environment Service is down.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_GWS_INCORRECT_CLIENT_CREDENTIALS	Warning	The GWS client credentials provided to GES are incorrect.	GWS_INCORRECT_CLIENT_CREDENTIALS	Triggered when GWS has had any issue with the GES client credentials in the last 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_GWS_SERVER_ERROR	Warning	GES has encountered server or connection errors with GWS.	GWS_SERVER_ERROR	Triggered when there has been a GWS server error in the past 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_HEALTH	Critical	One or more downstream components (PostGres, Config Server, GWS, ORS) are down. '''Note:''' Because GES goes into a crash loop when Redis is down, this does not fire when Redis is down.	GES_HEALTH	Triggered when any component is down for any length of time.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_HTTP_400_POD	Info	An individual GES pod is returning excessive HTTP 400 results.	ges_http_failed_requests_total, http_400_tolerance	Triggered when two or more HTTP 400 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_HTTP_401_POD	Info	An individual GES pod is returning excessive HTTP 401 results.	ges_http_failed_requests_total, http_401_tolerance	Triggered when two or more HTTP 401 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_HTTP_404_POD	Info	An individual GES pod is returning excessive HTTP 404 results.	ges_http_failed_requests_total, http_404_tolerance	Triggered when two or more HTTP 404 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_HTTP_500_POD	Info	An individual GES pod is returning excessive HTTP 500 results.	ges_http_failed_requests_total, http_500_tolerance	Triggered when two or more HTTP 500 results are returned from a pod within a 5-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_INVALID_CONTENT_LENGTH	Info	Fires if GES encounters any incoming requests that have exceeded the maximum content length of 10mb on the internal port and 500kb for the external, public-facing port.	INVALID_CONTENT_LENGTH, invalid_content_length_tolerance	Triggered when one instance of a message with an invalid length is received. Silenced after 2 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_LOGGING_FAILURE	Warning	GES has failed to write a message to the log.	LOGGING_FAILURE	Triggered when there are any failures writing to the logs. Silenced after 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_MEMORY_USAGE	Info	GES has high memory usage for a period of 90 seconds.	ges_nodejs_heap_space_size_used_bytes, ges_nodejs_heap_space_size_available_bytes	Triggered when memory usage (measured as a ratio of Used Heap Space vs Available Heap Space) is above 80% for a 90-second interval.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_NEXUS_ACCESS_FAILURE	Warning	GES has been having difficulties contacting Nexus. This alert is only relevant for customers who leverage the Push Notification feature in Genesys Callback.	NEXUS_ACCESS_FAILURE	Triggered when GES has failed to connect or communicate with Nexus more than 30 times over the last hour.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_NOT_READY_CRITICAL	Critical	GES pods are not in the `Ready` state. Indicative of issues with the Redis connection or other problems with the Helm deployment.	kube_pod_container_status_ready	Triggered when more than 50% of GES pods have not been in a Ready state for 5 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_NOT_READY_WARNING	Warning	GES pods are not in the `Ready` state. Indicative of issues with the Redis connection or other problems with the Helm deployment.	kube_pod_container_status_ready	Triggered when 25% (or more) of GES pods have not been in a Ready state for 10 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_ORS_REDIS_DOWN	Critical	Connection to ORS_REDIS is down.	ORS_REDIS_STATUS	Triggered when the ORS_REDIS connection is down for 5 consecutive minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_PODS_RESTART	Critical	GES pods have been excessively crashing and restarting.	kube_pod_container_status_restarts_total	Triggered when there have been more than five pod restarts in the past 15 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_RBAC_CREATE_VQ_PROXY_ERROR	Info	Fires if there are issues with GES managing VQ Proxy Objects.	RBAC_CREATE_VQ_PROXY_ERROR, rbac_create_vq_proxy_error_tolerance	Triggered when there are at least 1000 instances of issues managing VQ Proxy objects within a 10-minute period.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_SLOW_HTTP_RESPONSE_TIME	Warning	Fired if the average response time for incoming requests begins to lag.	ges_http_request_duration_seconds_sum, ges_http_request_duration_seconds_count	Triggered when the average response time for incoming requests is above 1.5 seconds for a sustained period of 15 minutes.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_UNCAUGHT_EXCEPTION	Warning	There has been an uncaught exception within GES.	UNCAUGHT_EXCEPTION	Triggered when GES encounters any uncaught exceptions. Silenced after 1 minute.
Draft:PEC-CAB/Current/CABPEGuide/CallbackMetrics	GES_UP	Critical	Fires when fewer than two GES pods have been up for the last 15 minutes.		Triggered when fewer than two GES pods are up for 15 consecutive minutes.
Draft:PEC-DC/Current/DCPEGuide/DCMetrics	Memory usage is above 3000 Mb	Critical	Triggered when the memory usage on this pod is above 3000 Mb for 15 minutes.	nexus_process_resident_memory_bytes	For 15 minutes
Draft:PEC-DC/Current/DCPEGuide/DCMetrics	Nexus error rate	Critical	Triggered when the error rate on this pod is greater than 20% for 15 minutes.	nexus_errors_total, nexus_request_total	For 15 minutes
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts	Database connections above 75	HIGH	Triggered when pod database connections number is above 75.		Default number of connections: 75
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts	IWD DB errors	CRITICAL	Triggered when IWD experiences more than 2 errors within 1 minute during operations with database.		Default number of errors: 2
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts	IWD error rate	CRITICAL	Triggered when the number of errors in IWD exceeds the threshold for 15 min period.		Default number of errors: 2
Draft:PEC-IWD/Current/IWDPEGuide/IWD metrics and alerts	Memory usage is above 3000 Mb	CRITICAL	Triggered when the pod memory usage is above 3000 MB.		Default memory usage: 3000 MB
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-API-LatencyHigh	HIGH	Triggered when the latency for API responses is beyond the defined threshold.		2500ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-API-Redis-Connection-Failed	HIGH	Triggered when the connection to redis fails for more than 1 minute.		1m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-CPUUsage	HIGH	Triggered when the CPU utilization of a pod is beyond the threshold		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-EXT-Ingress-Error-Rate	HIGH	Triggered when the Ingress error rate is above the specified threshold.		20% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/APIAMetrics	cxc_api_too_many_errors_from_auth	HIGH	Triggered when there are too many error responses from the auth service for more than the specified time threshold.		1m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-CM-Redis-Connection-Failed	HIGH	Triggered when the connection to redis fails for more than 1 minute.		1m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-CPUUsage	HIGH	Triggered when a the CPU utilization of a pod is beyond the threshold		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPGMMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-CoM-Redis-no-active-connections	HIGH	Triggered when CX Contact compliance has no active redis connection for 2 minutes		2m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-Compliance-LatencyHigh	HIGH	Triggered when the latency for API responses is beyond the defined threshold.		5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-CPUUsage	HIGH	Triggered when the CPU utilization of a pod is beyond the threshold.		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/CPLMMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-CPUUsage	HIGH	Triggered when the CPU utilization of a pod is beyond the threshold		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-DM-LatencyHigh	HIGH	Triggered when the latency for dial manager is above the defined threshold.		5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/DMMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-CPUUsage	HIGH	Triggered when the CPU utilization of a pod is beyond the threshold		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-JS-LatencyHigh	HIGH	Triggered when the latency for job scheduler is above the defined threshold.		5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/JSMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-CPUUsage	HIGH	Triggered when the CPU utilization of a pod is beyond the threshold		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-LB-LatencyHigh	HIGH	Triggered when the latency for list builder is above the defined threshold.		5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LBMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-CPUUsage	HIGH	Triggered when the CPU utilization of a pod is beyond the threshold		300% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-LM-LatencyHigh	HIGH	Triggered when the latency for list manager is above the defined threshold		5000ms for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-MemoryUsage	HIGH	Triggered when the memory utilization of a pod is beyond the threshold.		70% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-MemoryUsagePD	HIGH	Triggered when the memory usage of a pod is above the critical threshold.		90% for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-PodNotReadyCount	HIGH	Triggered when the number of pods ready for a CX Contact deployment is less than or equal to the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-PodRestartsCount	HIGH	Triggered when the restart count for a pod is beyond the threshold.		1 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-PodRestartsCountPD	HIGH	Triggered when the restart count is beyond the critical threshold.		5 for 5m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	CXC-PodsNotReadyPD	HIGH	Triggered when there are no pods ready for CX Contact deployment.		0 for 1m
Draft:PEC-OU/Current/CXCPEGuide/LMMetrics	cxc_list_manager_too_many_errors_from_auth	HIGH	Triggered when there are too many error responses from the auth service (list manager) for more than the specified time threshold.		1m
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics	gcxi__cluster__info		This alert indicates problems with the cluster states. Applicable only if you have two or more nodes in a cluster.	gcxi__cluster__info
Draft:PEC-REP/Current/GCXIPEGuide/GCXIMetrics	gcxi__projects__status		If the value of cxi__projects__status is greater than 0, this alarm is set, indicating that reporting is not functioning properly.	cxi__projects__status	< 0
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics	raa-errors	'''Specified by''': raa.prometheusRule.alerts.raa-errors.labels.severity in values.yaml. '''Recommended value''': warning	A nonzero value indicates that errors have been logged during the scrape interval.	gcxi_raa_error_count	>0
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics	raa-health	'''Specified by''': raa.prometheusRule.alerts.labels.severity '''Recommended value:''' severe	A zero value for a recent period (several scrape intervals) indicates that RAA is not operating.	gcxi_raa_health_level	Specified by: raa.prometheusRule.alerts.health.for '''Recommended value''': 30m
Draft:PEC-REP/Current/GCXIPEGuide/RAAMetrics	raa-long-aggregation	'''Specified by''': raa.prometheusRule.alerts.longAggregation.labels.severity in values.yaml. '''Recommended value''': warning	Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold.	gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count	Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml. '''Recommended value''': 300
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics	GcaOOMKilled	Critical	Triggered when a GCA pod is restarted because of OOMKilled.	kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason	1
Draft:PEC-REP/Current/GIMPEGuide/GCAMetrics	GcaPodCrashLooping	Critical	Triggered when a GCA pod is crash looping.	kube_pod_container_status_restarts_total	The restart rate is greater than 0 for 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics	GspFlinkJobDown	Critical	Triggered when the GSP Flink job is not running (number of running jobs equals to 0 or metric is not available)	flink_jobmanager_numRunningJobs	For 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics	GspNoTmRegistered	Critical	Triggered when there are no registered TaskManagers (or metric not available)	flink_jobmanager_numRegisteredTaskManagers	For 5 minutes
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics	GspOOMKilled	Critical	Triggered when a GSP pod is restarted because of OOMKilled	kube_pod_container_status_restarts_total	0
Draft:PEC-REP/Current/GIMPEGuide/GSPMetrics	GspUnknownPerson	High	Triggered when GSP encounters unknown person(s)	flink_taskmanager_job_task_operator_tenant_error_total{error="unknown_person",service="gsp"}	For 5 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_col_connected_configservers	Critical	Pulse DCU Collector is not connected to ConfigServer.	pulse_collector_connection_status	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_col_connected_dbservers	Critical	Pulse DCU Collector is not connected to DbServer.	pulse_collector_connection_status	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_col_connected_statservers	Critical	Pulse DCU Collector is not connected to Stat Server.	pulse_collector_connection_status	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_col_snapshot_writing	Critical	Pulse DCU Collector does not write snapshots.	pulse_collector_snapshot_writing_status	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_cpu	Critical	Detected critical CPU usage by Pulse DCU Pod.	container_cpu_usage_seconds_total, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_disk	Critical	Detected critical disk usage by Pulse DCU Pod.	kubelet_volume_stats_available_bytes, kubelet_volume_stats_capacity_bytes	90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_memory	Critical	Detected critical memory usage by Pulse DCU Pod.	container_memory_working_set_bytes, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_nonrunning_instances	Critical	Triggered when Pulse DCU instances are down.	kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_ss_connected_configservers	Critical	Pulse DCU Stat Server is not connected to ConfigServer.	pulse_statserver_server_connected_seconds	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_ss_connected_ixnservers	Critical	Pulse DCU Stat Server is not connected to IxnServers.	pulse_statserver_server_connected_seconds	2
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_ss_connected_tservers	Critical	Pulse DCU Stat Server is not connected to T-Servers.	pulse_statserver_server_connected_number	2
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_critical_ss_failed_dn_registrations	Critical	Detected critical DN registration failures on Pulse DCU Stat Server.	pulse_statserver_dn_failed, pulse_statserver_dn_registered	0.5%
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_monitor_data_unavailable	Critical	Pulse DCU Monitor Agents do not provide data.	pulse_monitor_check_duration_seconds, kube_statefulset_replicas	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/dcuMetrics	pulse_dcu_too_frequent_restarts	Critical	Detected too frequent restarts of DCU Pod container.	kube_pod_container_status_restarts_total	2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_critical_cpu	Critical	Detected critical CPU usage by Pulse LDS Pod.	container_cpu_usage_seconds_total, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_critical_memory	Critical	Detected critical memory usage by Pulse LDS Pod.	container_memory_working_set_bytes, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_critical_nonrunning_instances	Critical	Triggered when Pulse LDS instances are down.	kube_statefulset_status_replicas_ready, kube_statefulset_status_replicas	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_monitor_data_unavailable	Critical	Pulse LDS Monitor Agents do not provide data.	pulse_monitor_check_duration_seconds, kube_statefulset_replicas	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_no_connected_senders	Critical	Pule LDS is not connected to upstream servers.	pulse_lds_senders_number	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_no_registered_dns	Critical	No DNs are registered on Pulse LDS.	pulse_lds_sender_registered_dns_number	for 30 minutes
Draft:PEC-REP/Current/PulsePEGuide/ldsMetrics	pulse_lds_too_frequent_restarts	Critical	Detected too frequent restarts of LDS Pod container.	kube_pod_container_status_restarts_total	2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_critical_5xx	Critical	Detected critical 5xx errors per second for Pulse container.	http_server_requests_seconds_count	15%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_critical_cpu	Critical	Detected critical CPU usage by Pulse Pod.	container_cpu_usage_seconds_total, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_critical_hikari_cp	Critical	Detected critical Hikari connections pool usage by Pulse container.	hikaricp_connections_active, hikaricp_connections	90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_critical_memory	Critical	Detected critical memory usage by Pulse Pod.	container_memory_working_set_bytes, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_critical_pulse_health	Critical	Detected critical number of healthy Pulse containers.	pulse_health_all_Boolean	50%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_critical_running_instances	Critical	Triggered when Pulse instances are down.	kube_deployment_status_replicas_available, kube_deployment_status_replicas	75%
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_service_down	Critical	All Pulse instances are down.	up	for 15 minutes
Draft:PEC-REP/Current/PulsePEGuide/PulseMetrics	pulse_too_frequent_restarts	Critical	Detected too frequent restarts of Pulse Pod container.	kube_pod_container_status_restarts_total	2 for 1 hour
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics	pulse_permissions_critical_cpu	Critical	Detected critical CPU usage by Pulse Permissions Pod.	container_cpu_usage_seconds_total, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics	pulse_permissions_critical_memory	Critical	Detected critical memory usage by Pulse Permissions Pod.	container_memory_working_set_bytes, kube_pod_container_resource_limits	90%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics	pulse_permissions_critical_running_instances	Critical	Triggered when Pulse Permissions instances are down.	kube_deployment_status_replicas_available, kube_deployment_status_replicas	75%
Draft:PEC-REP/Current/PulsePEGuide/PulsePermissionsMetrics	pulse_permissions_too_frequent_restarts	Critical	Detected too frequent restarts of Permissions Pod container.	kube_pod_container_status_restarts_total	2 for 1 hour
Draft:STRMS/Current/STRMSPEGuide/ServiceMetrics	streams_GWS_AUTH_DOWN	critical	Unable to connect to GWS auth service	gws_auth_down	10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_BATCH_LAG_TIME	warning	Message handling exceeds 2 secs		30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_DOWN	critical	The number of running instances is 0	sum(up) < 1	10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_ENDPOINT_CONNECTION_DOWN	warning	Unable to connect to a customer endpoint	endpoint_connection_down	30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_ENGAGE_KAFKA_CONNECTION_DOWN	critical	Unable to connect to Engage Kafka	engage_kafka_main_connection_down	10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_GWS_AUTH_DOWN	Critical	Unable to connect to GWS auth service	gws_auth_down	30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_GWS_CONFIG_DOWN	critical	Unable to connect to GWS config service	gws_config_down
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_GWS_ENV_DOWN	critical	Unable to connect to GWS environment service	gws_env_down	30 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_INIT_ERROR	critical	Aborted due to initialization error e.g., KAFKA_FQDN is not defined	application_streams_init_error > 0	10 seconds
Draft:STRMS/Current/STRMSPEGuide/STRMSMetrics	streams_REDIS_DOWN	critical		redis_connection_down	10 seconds
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Http Errors Occurrences Exceeded Threshold	High	Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes	telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"}	>500 in 5 minutes
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Telemetry CPU Utilization is Greater Than Threshold	High	Triggered when average CPU usage is more than 60%	node_cpu_seconds_total	>60%
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Telemetry Dependency Status	Low	Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus	telemetry_dependency_status	<80
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Telemetry GAuth Time Alert	High	Triggered when there is no connection to the GAuth service	telemetry_gws_auth_req_time	>10000
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Telemetry Healthy Pod Count Alert	High	Triggered when the number of healthy pods drops to critical level	kube_pod_container_status_ready	<2
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Telemetry High Network Traffic	High	Triggered when network traffic exceeds 10MB/second for 5 minutes	node_network_transmit_bytes_total, node_network_receive_bytes_total	>10MBps
Draft:TLM/Current/TLMPEGuide/TLMMetrics	Telemetry Memory Usage is Greater Than Threshold	High	Triggered when average memory usage is more than 60%	container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores	>60%
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_elasticsearch_health_status	critical	Triggered when there is no connection to ElasticSearch	ucsx_elasticsearch_health_status	2 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_elasticsearch_slow_processing_time	critical	Triggered when Elasticsearch internal processing time > 500 ms	ucsx_elastic_search_sum, ucsx_elastic_search_count	5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_instance_high_cpu_utilization	warning	Triggered when average CPU usage is more than 80%	ucsx_performance	5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_instance_high_http_request_rate	warning	Triggered when request rate is more than 120 requests per seconds on one UCS-X instance	ucsx_http_request_duration_count	30 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_instance_high_memory_usage	warning	Triggered when average CPU usage is more than 800 Mb	ucsx_memory	5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_instance_overloaded	warning	Triggered when overload protection rate is more than 0	ucsx_overload_protection_count	5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_instance_slow_http_response	critical	Triggered when average http response time > 500 ms	ucsx_http_request_duration_sum, ucsx_http_request_duration_count	5 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_masterdb_health_status	warning	Triggered when there is no connection to master DB	ucsx_masterdb_health_status	2 minutes
Draft:UCS/Current/UCSPEGuide/UCSMetrics	ucsx_tenantdb_health_status	critical	Triggered when there is no connection to tenant DB	ucsx_tenantdb_health_status	2 minutes
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Agent service fail	Critical	Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005B-QINU`"', then restart the pod.	agent_health_level	Agent health level is Fail for pod '"`UNIQ--nowiki-0000005C-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Config node fail	Warning	Actions: *Check if there is any problem with pod '"`UNIQ--nowiki-0000005D-QINU`"' and the config node.	http_client_response_count	Requests to the config node fail for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Container restarted repeatedly	Critical	Actions: Check if the new version of the image was deployed. Check for issues with the Kubernetes cluster.	kube_pod_container_status_restarts_total	Container '"`UNIQ--nowiki-00000056-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Kafka events latency is too high	Warning	Actions: If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). If the alarm is triggered only for topic '"`UNIQ--nowiki-00000048-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network	kafka_consumer_latency_bucket	Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-00000049-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Kafka not available	Critical	Actions: If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. If the alarm is triggered only for pod '"`UNIQ--nowiki-00000057-QINU`"', check if there is an issue with the pod.	kafka_producer_state, kafka_consumer_state	Kafka is not available for pod '"`UNIQ--nowiki-00000058-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Max replicas is not sufficient for 5 mins	Critical	The desired number of replicas is higher than the current available replicas for the past 5 minutes.	kube_statefulset_replicas, kube_statefulset_status_replicas	The desired number of replicas is higher than the current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod CPU greater than 65%	Warning	High CPU load for pod '"`UNIQ--nowiki-0000005E-QINU`"'.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container '"`UNIQ--nowiki-0000005F-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod CPU greater than 80%	Critical	Critical CPU load for pod '"`UNIQ--nowiki-00000060-QINU`"'.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container '"`UNIQ--nowiki-00000061-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod memory greater than 65%	Warning	High memory usage for pod '"`UNIQ--nowiki-00000062-QINU`"'.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container '"`UNIQ--nowiki-00000063-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod memory greater than 80%	Critical	Critical memory usage for pod '"`UNIQ--nowiki-00000064-QINU`"'.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container '"`UNIQ--nowiki-00000065-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod status Failed	Warning	Actions: *Restart the pod. Check if there are any issues with the pod after restart.	kube_pod_status_phase	Pod '"`UNIQ--nowiki-00000052-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod status NotReady	Critical	Actions: *Restart the pod. Check if there are any issues with the pod after restart.	kube_pod_status_ready	Pod '"`UNIQ--nowiki-00000055-QINU`"' is in NotReady status for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod status Pending	Warning	Actions: *Restart the pod. Check if there are any issues with the pod after restart.	kube_pod_status_phase	Pod '"`UNIQ--nowiki-00000054-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Pod status Unknown	Warning	Actions: *Restart the pod. Check if there are any issues with pod after restart.	kube_pod_status_phase	Pod '"`UNIQ--nowiki-00000053-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Possible messages lost	Critical	Actions: *Check Kafka and '"`UNIQ--nowiki-0000004A-QINU`"' service overload, network degradation.	kafka_consumer_recv_messages_total, kafka_producer_sent_messages_total	Number of sent requests is two times higher than received for topic '"`UNIQ--nowiki-0000004B-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Redis not available	Critical	Actions: If the alarm is triggered for multiple services, ensure there are no issues with Redis. Restart Redis. If the alarm is triggered only for pod '"`UNIQ--nowiki-00000059-QINU`"', check if there is an issue with the pod.	agent_redis_state, agent_stream_redis_state	Redis is not available for pod '"`UNIQ--nowiki-0000005A-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Too many Kafka consumer crashes	Critical	Actions: If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. If the alarm is triggered only for container '"`UNIQ--nowiki-00000050-QINU`"', check if there is an issue with the service.	kafka_consumer_error_total	More than 3 Kafka consumer crashes in 5 minutes for service '"`UNIQ--nowiki-00000051-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Too many Kafka consumer failed health checks	Warning	Actions: If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. If the alarm is triggered only for container '"`UNIQ--nowiki-0000004C-QINU`"', check if there is an issue with the service.	kafka_consumer_error_total	Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004D-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Too many Kafka consumer request timeouts	Warning	Actions: If the alarm is triggered for multiple services, ensure there are no issues with Kafka. Restart Kafka. If the alarm is triggered only for container '"`UNIQ--nowiki-0000004E-QINU`"', check if there is an issue with the service.	kafka_consumer_error_total	More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-0000004F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceAgentStateServiceMetrics	Too many Kafka pending events	Critical	Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000066-QINU`"' pod's CPU and network.	kafka_producer_queue_depth	Too many Kafka producer pending events for pod '"`UNIQ--nowiki-00000067-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Container restarted repeatedly	Critical	Actions: Check if the new version of the image was deployed. Check for issues with the Kubernetes cluster.	kube_pod_container_status_restarts_total	Container '"`UNIQ--nowiki-0000004A-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Kafka events latency is too high	Critical	Actions: If the alarm is triggered for multiple topics, ensure there are no issues with Kafka (CPU, memory, or network overload). If the alarm is triggered only for topic '"`UNIQ--nowiki-0000003E-QINU`"', check if there is an issue with the service related to the topic (CPU, memory, or network	kafka_consumer_latency_bucket	Latency for more than 5% of messages is more than 0.5 seconds for topic '"`UNIQ--nowiki-0000003F-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Kafka not available	Critical	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004B-QINU`"', check if there is an issue with the pod.	kafka_producer_state, kafka_consumer_state	Kafka is not available for pod '"`UNIQ--nowiki-0000004C-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Max replicas is not sufficient for 5 mins	Critical	The desired number of replicas is higher than the current available replicas for the past 5 minutes.	kube_statefulset_replicas, kube_statefulset_status_replicas	The desired number of replicas is higher than the current available replicas for the past 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod CPU greater than 65%	Warning	High CPU load for pod '"`UNIQ--nowiki-0000004F-QINU`"'.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container '"`UNIQ--nowiki-00000050-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod CPU greater than 80%	Critical	Critical CPU load for pod '"`UNIQ--nowiki-00000051-QINU`"'.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container '"`UNIQ--nowiki-00000052-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod memory greater than 65%	Warning	High memory usage for pod '"`UNIQ--nowiki-00000053-QINU`"'.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container '"`UNIQ--nowiki-00000054-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod memory greater than 80%	Critical	Critical memory usage for pod '"`UNIQ--nowiki-00000055-QINU`"'.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container '"`UNIQ--nowiki-00000056-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod status Failed	Warning	Actions: *Restart the pod. Check if there are any issues with the pod after restart.	kube_pod_status_phase	Pod '"`UNIQ--nowiki-00000046-QINU`"' is in Failed state.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod status NotReady	Critical	Actions: *Restart the pod. Check if there are any issues with the pod after restart.	kube_pod_status_ready	Pod '"`UNIQ--nowiki-00000049-QINU`"' is in NotReady status for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod status Pending	Warning	Actions: *Restart the pod. Check if there are any issues with the pod after restart.	kube_pod_status_phase	Pod '"`UNIQ--nowiki-00000048-QINU`"' is in Pending state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Pod status Unknown	Warning	Actions: *Restart the pod. Check if there are any issues with pod after restart.	kube_pod_status_phase	Pod '"`UNIQ--nowiki-00000047-QINU`"' is in Unknown state for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Redis not available	Critical	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Redis, and then restart Redis. If the alarm is triggered only for pod '"`UNIQ--nowiki-0000004D-QINU`"', check if there is an issue with the pod.	callthread_redis_state	Redis is not available for pod '"`UNIQ--nowiki-0000004E-QINU`"' for 5 consecutive minutes.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Too many Kafka consumer crashes	Critical	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. If the alarm is triggered only for '"`UNIQ--nowiki-00000044-QINU`"', check if there is an issue with the service.	kafka_consumer_error_total	More than 3 Kafka consumer crashes in 5 minutes for topic '"`UNIQ--nowiki-00000045-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Too many Kafka consumer failed health checks	Warning	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. If the alarm is triggered only for '"`UNIQ--nowiki-00000040-QINU`"', check if there is an issue with the service.	kafka_consumer_error_total	Health check failed more than 10 times in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000041-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Too many Kafka consumer request timeouts	Warning	Actions: If the alarm is triggered for multiple services, make sure there are no issues with Kafka, and then restart Kafka. If the alarm is triggered only for '"`UNIQ--nowiki-00000042-QINU`"', check if there is an issue with the service.	kafka_consumer_error_total	More than 10 request timeouts appeared in 5 minutes for Kafka consumer for topic '"`UNIQ--nowiki-00000043-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceCallStateServiceMetrics	Too many Kafka pending events	Critical	Actions: *Ensure there are no issues with Kafka or '"`UNIQ--nowiki-00000057-QINU`"' service's CPU and network.	kafka_producer_queue_depth	Too many Kafka producer pending events for service '"`UNIQ--nowiki-00000058-QINU`"' (more than 100 in 5 minutes).
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Container restarted repeatedly	Critical	Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason.	kube_pod_container_status_restarts_total	Container '"`UNIQ--nowiki-00000038-QINU`"' was restarted 5 or more times within 15 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Pod CPU greater than 65%	Warning	High CPU load for pod '"`UNIQ--nowiki-0000003D-QINU`"'. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container '"`UNIQ--nowiki-0000003E-QINU`"' CPU usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Pod CPU greater than 80%	Critical	Critical CPU load for pod '"`UNIQ--nowiki-0000003F-QINU`"'. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Restart the service. Collect the service logs; raise an investigation ticket.	container_cpu_usage_seconds_total, container_spec_cpu_period	Container '"`UNIQ--nowiki-00000040-QINU`"' CPU usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Pod Failed	Warning	Actions: *One of the containers in the pod has entered a failed state. Check the Kibana logs for the reason.	kube_pod_status_phase	Pod failed '"`UNIQ--nowiki-00000032-QINU`"'.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Pod memory greater than 65%	Warning	High memory usage for pod '"`UNIQ--nowiki-00000039-QINU`"'. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. *Collect the service logs; raise an investigation ticket.	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container '"`UNIQ--nowiki-0000003A-QINU`"' memory usage exceeded 65% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Pod memory greater than 80%	Critical	Critical memory usage for pod '"`UNIQ--nowiki-0000003B-QINU`"'. Actions: Check whether the horizontal pod autoscaler has triggered and if the maximum number of pods has been reached. Check Grafana for abnormal load. Restart the service. Collect the service logs; raise an investigation ticket	container_memory_working_set_bytes, kube_pod_container_resource_requests_memory_bytes	Container '"`UNIQ--nowiki-0000003C-QINU`"' memory usage exceeded 80% for 5 minutes.
Draft:VM/Current/VMPEGuide/VoiceConfigServiceMetrics	Pod Not ready for 10 minutes	Critical	Actions: If this alarm is triggered, check whether the CPU is available for the pods. Check whether the port of the pod is running and serving the request.	kube_pod_status_ready	Pod '"`UNIQ--nowiki-00000037-QINU`"' is in NotReady state for 10 minutes.

View (previous 250 | next 250) (20 | 50 | 100 | 250 | 500)

Modify query

Table(s):
Field(s):
Where:
Join on:
Group by:
Having:
Order by:
Limit:
Offset:
Format: