Difference between revisions of "TLM/Current/TLMPEGuide/TLMMetrics"
(Published) |
|||
(2 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{ArticlePEServiceMetrics | {{ArticlePEServiceMetrics | ||
− | |||
|NoCRDsOrAnnotations=All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service. | |NoCRDsOrAnnotations=All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service. | ||
|CRD=n/a | |CRD=n/a | ||
+ | |Annotations=Annotations | ||
+ | |Endpoint=/metrics | ||
|MetricsDefined=Yes | |MetricsDefined=Yes | ||
|MetricsIntro=Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see [https://all.docs-content.genesys.com/Draft:PrivateEdition/Current/Operations/SystemMetrics System metrics]. | |MetricsIntro=Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see [https://all.docs-content.genesys.com/Draft:PrivateEdition/Current/Operations/SystemMetrics System metrics]. | ||
Line 15: | Line 16: | ||
|SampleValue=7000 | |SampleValue=7000 | ||
|UsedFor=Monitoring the CPU usage | |UsedFor=Monitoring the CPU usage | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
}}{{PEMetric | }}{{PEMetric | ||
|Metric=container_fs_reads_bytes_total | |Metric=container_fs_reads_bytes_total | ||
Line 64: | Line 57: | ||
|UsedFor=Monitoring pod restarts | |UsedFor=Monitoring pod restarts | ||
}} | }} | ||
− | |AlertsDefined= | + | |AlertsDefined=Yes |
+ | |PEAlert={{PEAlert | ||
+ | |Alert=Telemetry CPU Utilization is Greater Than Threshold | ||
+ | |Severity=High | ||
+ | |AlertDescription=Triggered when average CPU usage is more than 60% | ||
+ | |BasedOn=node_cpu_seconds_total | ||
+ | |Threshold=>60% | ||
+ | }}{{PEAlert | ||
+ | |Alert=Telemetry Memory Usage is Greater Than Threshold | ||
+ | |Severity=High | ||
+ | |AlertDescription=Triggered when average memory usage is more than 60% | ||
+ | |BasedOn=container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores | ||
+ | |Threshold=>60% | ||
+ | }}{{PEAlert | ||
+ | |Alert=Telemetry High Network Traffic | ||
+ | |Severity=High | ||
+ | |AlertDescription=Triggered when network traffic exceeds 10MB/second for 5 minutes | ||
+ | |BasedOn=node_network_transmit_bytes_total, node_network_receive_bytes_total | ||
+ | |Threshold=>10MBps | ||
+ | }}{{PEAlert | ||
+ | |Alert=Http Errors Occurrences Exceeded Threshold | ||
+ | |Severity=High | ||
+ | |AlertDescription=Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes | ||
+ | |BasedOn=telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} | ||
+ | |Threshold=>500 in 5 minutes | ||
+ | }}{{PEAlert | ||
+ | |Alert=Telemetry Dependency Status | ||
+ | |Severity=Low | ||
+ | |AlertDescription=Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus | ||
+ | |BasedOn=telemetry_dependency_status | ||
+ | |Threshold=<80 | ||
+ | }}{{PEAlert | ||
+ | |Alert=Telemetry Healthy Pod Count Alert | ||
+ | |Severity=High | ||
+ | |AlertDescription=Triggered when the number of healthy pods drops to critical level | ||
+ | |BasedOn=kube_pod_container_status_ready | ||
+ | |Threshold=<2 | ||
+ | }}{{PEAlert | ||
+ | |Alert=Telemetry GAuth Time Alert | ||
+ | |Severity=High | ||
+ | |AlertDescription=Triggered when there is no connection to the GAuth service | ||
+ | |BasedOn=telemetry_gws_auth_req_time | ||
+ | |Threshold=>10000 | ||
+ | }} | ||
}} | }} |
Latest revision as of 16:47, September 29, 2022
Find the metrics Telemetry Service exposes and the alerts defined for Telemetry Service.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
Telemetry Service | n/aAnnotations | /metrics | ||
All the Telemetry Service metrics are standard Kubernetes metrics as delivered by a standard Kubernetes metrics service. |
See details about:
Metrics[edit source]
Use standard Kubernetes metrics, as delivered by a standard Kubernetes metrics service (such as cAdvisor), to monitor the Telemetry Service. For information about standard system metrics to use to monitor services, see System metrics.
The following standard Kubernetes metrics are likely to be most relevant.
Metric and description | Metric details | Indicator of |
---|---|---|
container_ Cumulative CPU time consumed |
Unit: seconds Type: Counter |
Monitoring the CPU usage |
container_ Cumulative count of bytes read |
Unit: bytes Type: Counter |
Monitoring Filesystem usage |
container_ Cumulative count of bytes received |
Unit: bytes Type: Counter |
Monitoring incoming network |
container_ Cumulative count of bytes transmitted |
Unit: bytes Type: Counter |
Monitoring outgoing network |
kube_ Describes whether the containers readiness check succeeded. |
Unit: integer Type: Gauge |
Monitoring Healthy pods |
kube_ The number of container restarts per container |
Unit: integer Type: Counter |
Monitoring pod restarts |
Alerts[edit source]
The following alerts are defined for No results.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
Telemetry CPU Utilization is Greater Than Threshold | High | Triggered when average CPU usage is more than 60% | node_cpu_seconds_total | >60%
|
Telemetry Memory Usage is Greater Than Threshold | High | Triggered when average memory usage is more than 60% | container_cpu_usage_seconds_total, kube_pod_container_resource_limits_cpu_cores | >60%
|
Telemetry High Network Traffic | High | Triggered when network traffic exceeds 10MB/second for 5 minutes | node_network_transmit_bytes_total, node_network_receive_bytes_total | >10MBps
|
Http Errors Occurrences Exceeded Threshold | High | Triggered when the number of HTTP errors exceeds 500 responses in 5 minutes | telemetry_events{eventName=~"http_error_.*", eventName!="http_error_404"} | >500 in 5 minutes
|
Telemetry Dependency Status | Low | Triggered when there is no connection to one of the dependent services - GAuth, Config, Prometheus | telemetry_dependency_status | <80
|
Telemetry Healthy Pod Count Alert | High | Triggered when the number of healthy pods drops to critical level | kube_pod_container_status_ready | <2
|
Telemetry GAuth Time Alert | High | Triggered when there is no connection to the GAuth service | telemetry_gws_auth_req_time | >10000 |