Difference between revisions of "TLM/Current/TLMPEGuide/TLMMetrics"

Latest revision as of 16:47, September 29, 2022

Find the metrics Telemetry Service exposes and the alerts defined for Telemetry Service.

Metrics[edit source]

Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see System metrics.

The following standard Kubernetes metrics are likely to be most relevant.

Metric and description	Metric details	Indicator of
container_cpu_usage_seconds_total Cumulative CPU time consumed	Unit: seconds Type: Counter Label: pod="podId" Sample value: 7000	Monitoring the CPU usage
container_fs_reads_bytes_total Cumulative count of bytes read	Unit: bytes Type: Counter Label: pod="podId Sample value: 900	Monitoring Filesystem usage
container_network_receive_bytes_total Cumulative count of bytes received	Unit: bytes Type: Counter Label: pod="podId" Sample value: 3000	Monitoring incoming network
container_network_transmit_bytes_total Cumulative count of bytes transmitted	Unit: bytes Type: Counter Label: pod="podId" Sample value: 5000	Monitoring outgoing network
kube_pod_container_status_ready Describes whether the containers readiness check succeeded.	Unit: integer Type: Gauge Label: pod="podId" Sample value: 2	Monitoring Healthy pods
kube_pod_container_status_restarts_total The number of container restarts per container	Unit: integer Type: Counter Label: pod="podId" Sample value: 0	Monitoring pod restarts

Alerts[edit source]

The following alerts are defined for No results.

Alert	Severity	Description	Based on	Threshold
Telemetry CPU Utilization is Greater Than Threshold	High	Triggered when average CPU usage is more than 60%	node_cpu_seconds_total	>60%
Telemetry Memory Usage is Greater Than Threshold	High	Triggered when average memory usage is more than 60%	container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores	>60%
Telemetry High Network Traffic	High	Triggered when network traffic exceeds 10MB/second for 5 minutes	node_network_transmit_bytes_total, node_network_receive_bytes_total	>10MBps
Http Errors Occurrences Exceeded Threshold	High	Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes	telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"}	>500 in 5 minutes
Telemetry Dependency Status	Low	Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus	telemetry_dependency_status	<80
Telemetry Healthy Pod Count Alert	High	Triggered when the number of healthy pods drops to critical level	kube_pod_container_status_ready	<2
Telemetry GAuth Time Alert	High	Triggered when there is no connection to the GAuth service	telemetry_gws_auth_req_time	>10000

@@ Line 1: / Line 1: @@
 {{ArticlePEServiceMetrics
-|IncludedServiceId=17df197d-45b4-4d49-b269-f44d5bdfe5a1
 |NoCRDsOrAnnotations=All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service.
 |CRD=n/a
+|Annotations=Annotations
+|Endpoint=/metrics
 |MetricsDefined=Yes
 |MetricsIntro=Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see [https://all.docs-content.genesys.com/Draft:PrivateEdition/Current/Operations/SystemMetrics System metrics].
@@ Line 15: / Line 16: @@
 |SampleValue=7000
 |UsedFor=Monitoring the CPU usage
-}}{{PEMetric
-|Metric=container_memory_working_set_bytes
-|Type=Gauge
-|Unit=bytes
-|Label=pod="podId"
-|MetricDescription=Current working set
-|SampleValue=4500
-|UsedFor=Monitoring memory usage
 }}{{PEMetric
 |Metric=container_fs_reads_bytes_total
@@ Line 64: / Line 57: @@
 |UsedFor=Monitoring pod restarts
 }}
-|AlertsDefined=No
+|AlertsDefined=Yes
+|PEAlert={{PEAlert
+|Alert=Telemetry CPU Utilization is Greater Than Threshold
+|Severity=High
+|AlertDescription=Triggered when average CPU usage is more than 60%
+|BasedOn=node_cpu_seconds_total
+|Threshold=>60%
+}}{{PEAlert
+|Alert=Telemetry Memory Usage is Greater Than Threshold
+|Severity=High
+|AlertDescription=Triggered when average memory usage is more than 60%
+|BasedOn=container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores
+|Threshold=>60%
+}}{{PEAlert
+|Alert=Telemetry High Network Traffic
+|Severity=High
+|AlertDescription=Triggered when network traffic exceeds 10MB/second for 5 minutes
+|BasedOn=node_network_transmit_bytes_total, node_network_receive_bytes_total
+|Threshold=>10MBps
+}}{{PEAlert
+|Alert=Http Errors Occurrences Exceeded Threshold
+|Severity=High
+|AlertDescription=Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes
+|BasedOn=telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"}
+|Threshold=>500 in 5 minutes
+}}{{PEAlert
+|Alert=Telemetry Dependency Status
+|Severity=Low
+|AlertDescription=Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus
+|BasedOn=telemetry_dependency_status
+|Threshold=<80
+}}{{PEAlert
+|Alert=Telemetry Healthy Pod Count Alert
+|Severity=High
+|AlertDescription=Triggered when the number of healthy pods drops to critical level
+|BasedOn=kube_pod_container_status_ready
+|Threshold=<2
+}}{{PEAlert
+|Alert=Telemetry GAuth Time Alert
+|Severity=High
+|AlertDescription=Triggered when there is no connection to the GAuth service
+|BasedOn=telemetry_gws_auth_req_time
+|Threshold=>10000
+}}
 }}

Service	CRD or annotations?	Port	Endpoint/Selector	Metrics update interval
Telemetry Service	n/aAnnotations		/metrics
Telemetry Service	All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service.

Telemetry Service Private Edition Guide

Overview

Configure and deploy

Upgrade, roll back, or uninstall Telemetry

Observability

Latest revision as of 16:47, September 29, 2022

Difference between revisions of "TLM/Current/TLMPEGuide/TLMMetrics"

Metrics[edit source]

Alerts[edit source]

Contentsto top