Difference between revisions of "TLM/Current/TLMPEGuide/TLMMetrics"

From Genesys Documentation
Jump to: navigation, search
(Published)
 
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
{{ArticlePEServiceMetrics
 
{{ArticlePEServiceMetrics
|IncludedServiceId=17df197d-45b4-4d49-b269-f44d5bdfe5a1
 
 
|NoCRDsOrAnnotations=All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service.
 
|NoCRDsOrAnnotations=All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service.
 
|CRD=n/a
 
|CRD=n/a
 +
|Annotations=Annotations
 +
|Endpoint=/metrics
 
|MetricsDefined=Yes
 
|MetricsDefined=Yes
 
|MetricsIntro=Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see [https://all.docs-content.genesys.com/Draft:PrivateEdition/Current/Operations/SystemMetrics System metrics].
 
|MetricsIntro=Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see [https://all.docs-content.genesys.com/Draft:PrivateEdition/Current/Operations/SystemMetrics System metrics].
Line 15: Line 16:
 
|SampleValue=7000
 
|SampleValue=7000
 
|UsedFor=Monitoring the CPU usage
 
|UsedFor=Monitoring the CPU usage
}}{{PEMetric
 
|Metric=container_memory_working_set_bytes
 
|Type=Gauge
 
|Unit=bytes
 
|Label=pod="podId"
 
|MetricDescription=Current working set
 
|SampleValue=4500
 
|UsedFor=Monitoring memory usage
 
 
}}{{PEMetric
 
}}{{PEMetric
 
|Metric=container_fs_reads_bytes_total
 
|Metric=container_fs_reads_bytes_total
Line 64: Line 57:
 
|UsedFor=Monitoring pod restarts
 
|UsedFor=Monitoring pod restarts
 
}}
 
}}
|AlertsDefined=No
+
|AlertsDefined=Yes
 +
|PEAlert={{PEAlert
 +
|Alert=Telemetry CPU Utilization is Greater Than Threshold
 +
|Severity=High
 +
|AlertDescription=Triggered when average CPU usage is more than 60%
 +
|BasedOn=node_cpu_seconds_total
 +
|Threshold=>60%
 +
}}{{PEAlert
 +
|Alert=Telemetry Memory Usage is Greater Than Threshold
 +
|Severity=High
 +
|AlertDescription=Triggered when average memory usage is more than 60%
 +
|BasedOn=container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores
 +
|Threshold=>60%
 +
}}{{PEAlert
 +
|Alert=Telemetry High Network Traffic
 +
|Severity=High
 +
|AlertDescription=Triggered when network traffic exceeds 10MB/second for 5 minutes
 +
|BasedOn=node_network_transmit_bytes_total, node_network_receive_bytes_total
 +
|Threshold=>10MBps
 +
}}{{PEAlert
 +
|Alert=Http Errors Occurrences Exceeded Threshold
 +
|Severity=High
 +
|AlertDescription=Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes
 +
|BasedOn=telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"}
 +
|Threshold=>500 in 5 minutes
 +
}}{{PEAlert
 +
|Alert=Telemetry Dependency Status
 +
|Severity=Low
 +
|AlertDescription=Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus
 +
|BasedOn=telemetry_dependency_status
 +
|Threshold=<80
 +
}}{{PEAlert
 +
|Alert=Telemetry Healthy Pod Count Alert
 +
|Severity=High
 +
|AlertDescription=Triggered when the number of healthy pods drops to critical level
 +
|BasedOn=kube_pod_container_status_ready
 +
|Threshold=<2
 +
}}{{PEAlert
 +
|Alert=Telemetry GAuth Time Alert
 +
|Severity=High
 +
|AlertDescription=Triggered when there is no connection to the GAuth service
 +
|BasedOn=telemetry_gws_auth_req_time
 +
|Threshold=>10000
 +
}}
 
}}
 
}}

Latest revision as of 16:47, September 29, 2022

This topic is part of the manual Telemetry Service Private Edition Guide for version Current of Telemetry Service.

Find the metrics Telemetry Service exposes and the alerts defined for Telemetry Service.

Service CRD or annotations? Port Endpoint/Selector Metrics update interval
Telemetry Service n/aAnnotations /metrics
All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service.

See details about:

Metrics[edit source]

Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see System metrics.

The following standard Kubernetes metrics are likely to be most relevant.

Metric and description Metric details Indicator of
container_cpu_usage_seconds_total

Cumulative CPU time consumed

Unit: seconds

Type: Counter
Label: pod="podId"
Sample value: 7000

Monitoring the CPU usage
container_fs_reads_bytes_total

Cumulative count of bytes read

Unit: bytes

Type: Counter
Label: pod="podId
Sample value: 900

Monitoring Filesystem usage
container_network_receive_bytes_total

Cumulative count of bytes received

Unit: bytes

Type: Counter
Label: pod="podId"
Sample value: 3000

Monitoring incoming network
container_network_transmit_bytes_total

Cumulative count of bytes transmitted

Unit: bytes

Type: Counter
Label: pod="podId"
Sample value: 5000

Monitoring outgoing network
kube_pod_container_status_ready

Describes whether the containers readiness check succeeded.

Unit: integer

Type: Gauge
Label: pod="podId"
Sample value: 2

Monitoring Healthy pods
kube_pod_container_status_restarts_total

The number of container restarts per container

Unit: integer

Type: Counter
Label: pod="podId"
Sample value: 0

Monitoring pod restarts

Alerts[edit source]

The following alerts are defined for No results.

Alert Severity Description Based on Threshold
Telemetry CPU Utilization is Greater Than Threshold High Triggered when average CPU usage is more than 60% node_cpu_seconds_total >60%


Telemetry Memory Usage is Greater Than Threshold High Triggered when average memory usage is more than 60% container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores >60%


Telemetry High Network Traffic High Triggered when network traffic exceeds 10MB/second for 5 minutes node_network_transmit_bytes_total, node_network_receive_bytes_total >10MBps


Http Errors Occurrences Exceeded Threshold High Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} >500 in 5 minutes


Telemetry Dependency Status Low Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus telemetry_dependency_status <80


Telemetry Healthy Pod Count Alert High Triggered when the number of healthy pods drops to critical level kube_pod_container_status_ready <2


Telemetry GAuth Time Alert High Triggered when there is no connection to the GAuth service telemetry_gws_auth_req_time >10000
Comments or questions about this documentation? Contact us for support!