Telemetry Service metrics and alerts

This topic is part of the manual Telemetry Service Private Edition Guide for version Current of Telemetry Service.

Metrics[edit source]

Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see System metrics.

The following standard Kubernetes metrics are likely to be most relevant.

Metric and description	Metric details	Indicator of
container_cpu_usage_seconds_total Cumulative CPU time consumed	Unit: seconds Type: Counter Label: pod="podId" Sample value: 7000	Monitoring the CPU usage
container_fs_reads_bytes_total Cumulative count of bytes read	Unit: bytes Type: Counter Label: pod="podId Sample value: 900	Monitoring Filesystem usage
container_network_receive_bytes_total Cumulative count of bytes received	Unit: bytes Type: Counter Label: pod="podId" Sample value: 3000	Monitoring incoming network
container_network_transmit_bytes_total Cumulative count of bytes transmitted	Unit: bytes Type: Counter Label: pod="podId" Sample value: 5000	Monitoring outgoing network
kube_pod_container_status_ready Describes whether the containers readiness check succeeded.	Unit: integer Type: Gauge Label: pod="podId" Sample value: 2	Monitoring Healthy pods
kube_pod_container_status_restarts_total The number of container restarts per container	Unit: integer Type: Counter Label: pod="podId" Sample value: 0	Monitoring pod restarts

Alerts[edit source]

The following alerts are defined for No results.

Alert	Severity	Description	Based on	Threshold
Telemetry CPU Utilization is Greater Than Threshold	High	Triggered when average CPU usage is more than 60%	node_cpu_seconds_total	>60%
Telemetry Memory Usage is Greater Than Threshold	High	Triggered when average memory usage is more than 60%	container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores	>60%
Telemetry High Network Traffic	High	Triggered when network traffic exceeds 10MB/second for 5 minutes	node_network_transmit_bytes_total, node_network_receive_bytes_total	>10MBps
Http Errors Occurrences Exceeded Threshold	High	Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes	telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"}	>500 in 5 minutes
Telemetry Dependency Status	Low	Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus	telemetry_dependency_status	<80
Telemetry Healthy Pod Count Alert	High	Triggered when the number of healthy pods drops to critical level	kube_pod_container_status_ready	<2
Telemetry GAuth Time Alert	High	Triggered when there is no connection to the GAuth service	telemetry_gws_auth_req_time	>10000

Service	CRD or annotations?	Port	Endpoint/Selector	Metrics update interval
Telemetry Service	n/aAnnotations		/metrics
Telemetry Service	All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service.

Telemetry Service Private Edition Guide

Overview

Configure and deploy

Upgrade, roll back, or uninstall Telemetry

Observability

Telemetry Service metrics and alerts

Contents

Metrics[edit source]

Alerts[edit source]