Difference between revisions of "PEC-REP/Current/GCXIPEGuide/RAAMetrics"
(Published) |
|||
Line 24: | Line 24: | ||
|MetricDescription=A health status metric extracted using a dedicated port on the monitor container. The metric value is a sum of values from two different health-checks: | |MetricDescription=A health status metric extracted using a dedicated port on the monitor container. The metric value is a sum of values from two different health-checks: | ||
− | * | + | *A database-based health-check. Results: |
− | * | + | **A result of '''2''' indicates that RAA is working or in maintenance (has received a STOP command from Genesys Info Mart, and will restart only after receiving the START command). |
+ | **A result of '''0''' indicates that RAA is not working according to this check. | ||
+ | *A local health-check. | ||
+ | **A result of '''1''' indicates that RAA is processing aggregation requests according to local health files. | ||
+ | **A result of '''0''' indicates that RAA is not processing aggregation requests according to local health files. | ||
|SampleValue=0 | |SampleValue=0 | ||
|UsedFor=Health check | |UsedFor=Health check | ||
Line 32: | Line 36: | ||
|Type=Counter | |Type=Counter | ||
|Label=cmd | |Label=cmd | ||
− | |MetricDescription=The number of commands received from Genesys Info mart since the previous scrape. Label reflects name of the command. The supported commands are: START, QUIT, EXIT, UPDATE_CONFIG, REAGGREGATE | + | |MetricDescription=The number of commands received from Genesys Info mart since the previous scrape. Label reflects the name of the command. The supported commands are: START, QUIT, EXIT, UPDATE_CONFIG, REAGGREGATE |
|SampleValue=10 | |SampleValue=10 | ||
|UsedFor=Traffic | |UsedFor=Traffic | ||
Line 38: | Line 42: | ||
|Metric=gcxi_raa_dispatch_count | |Metric=gcxi_raa_dispatch_count | ||
|Type=Counter | |Type=Counter | ||
− | |MetricDescription=The number | + | |MetricDescription=The number of dispatch events (moving aggregation requests from AGR_NOTIFICATION to PENDING_ARG) since the previous scrape. Dispatch events typically occur every 15 seconds. Such events are used for aggregation health check based on local files. |
|SampleValue=100 | |SampleValue=100 | ||
|UsedFor=Health check | |UsedFor=Health check | ||
Line 51: | Line 55: | ||
|Metric=gcxi_raa_relaunched_count | |Metric=gcxi_raa_relaunched_count | ||
|Type=Counter | |Type=Counter | ||
− | |MetricDescription=The number of times RAA was | + | |MetricDescription=The number of times RAA was relaunched since the previous scrape. The aggregation process can exit when an error occurs. Genesys Info Mart sends a START command every 15 minutes during the aggregation period, which causes RAA to relaunch. |
|SampleValue=1 | |SampleValue=1 | ||
|UsedFor=Error | |UsedFor=Error | ||
Line 92: | Line 96: | ||
|Type=Counter | |Type=Counter | ||
|Label=hierarchy, level, mediaType | |Label=hierarchy, level, mediaType | ||
− | |MetricDescription=The number of aggregations completed by RAA since the previous scrape. | + | |MetricDescription=The number of aggregations completed by RAA since the previous scrape. RAA groups the data by aggregation hierarchy name, materialized level (usually SUBHOUR, HOUR, DAY, MONTH), and media type (Online, Offline). |
|UsedFor=Traffic | |UsedFor=Traffic | ||
}}{{PEMetric | }}{{PEMetric | ||
Line 121: | Line 125: | ||
|Unit=seconds | |Unit=seconds | ||
|Label=table | |Label=table | ||
− | |MetricDescription=The number of records purged by RAA since the previous scrape. | + | |MetricDescription=The number of records purged by RAA since the previous scrape. RAA groups the data by purged table name. |
<br /> | <br /> | ||
Line 130: | Line 134: | ||
|Unit=milliseconds | |Unit=milliseconds | ||
|Label=table | |Label=table | ||
− | |MetricDescription=The total amount of time | + | |MetricDescription=The total amount of time spent on purging since the previous scrape. |
|UsedFor=Traffic | |UsedFor=Traffic | ||
}} | }} | ||
|AlertsDefined=Yes | |AlertsDefined=Yes | ||
− | |AlertsIntro= | + | |AlertsIntro=Various ''raa.prometheusRule.alerts.*'' parameters in the '''values.yaml''' file specify the severity of alerts and some thresholds. |
|PEAlert={{PEAlert | |PEAlert={{PEAlert | ||
− | |Alert=raa-health | + | |Alert=raa-health |
|Severity='''Specified by''': {{#replace:raa.prometheusRule.alerts.labels.severity|.|.<wbr/>}}<br/>'''Recommended value:''' severe | |Severity='''Specified by''': {{#replace:raa.prometheusRule.alerts.labels.severity|.|.<wbr/>}}<br/>'''Recommended value:''' severe | ||
|AlertDescription=A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. | |AlertDescription=A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. | ||
Line 142: | Line 146: | ||
|Threshold=Specified by: {{#replace:raa.prometheusRule.alerts.health.for|.|.<wbr/>}}<br/> '''Recommended value''': 30m | |Threshold=Specified by: {{#replace:raa.prometheusRule.alerts.health.for|.|.<wbr/>}}<br/> '''Recommended value''': 30m | ||
}}{{PEAlert | }}{{PEAlert | ||
− | |Alert=raa-errors | + | |Alert=raa-errors |
|Severity='''Specified by''': {{#replace:raa.prometheusRule.alerts.raa-errors.labels.severity|.|.<wbr/>}} in values.yaml. <br/>'''Recommended value''': warning | |Severity='''Specified by''': {{#replace:raa.prometheusRule.alerts.raa-errors.labels.severity|.|.<wbr/>}} in values.yaml. <br/>'''Recommended value''': warning | ||
− | |AlertDescription=A | + | |AlertDescription=A nonzero value indicates that errors have been logged during the scrape interval. |
|BasedOn=gcxi_raa_error_count | |BasedOn=gcxi_raa_error_count | ||
|Threshold=>0 | |Threshold=>0 |
Latest revision as of 19:45, April 7, 2022
Find the metrics RAA exposes and the alerts defined for RAA.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
RAA | PodMonitor and PrometheusRule | metrics: 9100, health: 9101 |
RAA forms matched labels from raa.statefulset.selector.matchLabels values, specified in values.yaml. The default contains a single raa-app item with raa.serviceName variable as a value. The element raa.serviceName is a concatenation of parameters. ....
statefulset:
## pod selector
selector:
matchLabels:
raa-app: "{{ tpl $.Values.raa.serviceName $ }}"
template:
## a map of pod specific labels to add to common labels
labels:
raa-app: "{{ tpl $.Values.raa.serviceName $ }}" |
metrics: several seconds, health: up to 3 minutes |
See details about:
Metrics[edit source]
Metric and description | Metric details | Indicator of |
---|---|---|
gcxi_ A health status metric extracted using a dedicated port on the monitor container. The metric value is a sum of values from two different health-checks:
|
Unit: Type: Gauge |
Health check |
gcxi_ The number of commands received from Genesys Info mart since the previous scrape. Label reflects the name of the command. The supported commands are: START, QUIT, EXIT, UPDATE_CONFIG, REAGGREGATE |
Unit: Type: Counter |
Traffic |
gcxi_ The number of dispatch events (moving aggregation requests from AGR_NOTIFICATION to PENDING_ARG) since the previous scrape. Dispatch events typically occur every 15 seconds. Such events are used for aggregation health check based on local files. |
Unit: Type: Counter |
Health check |
gcxi_ The number of heartbeats since the previous scrape. Heartbeat is normally performed once every five minutes, and is used for health check based on local files. The label is the current RAA version. |
Unit: Type: Counter |
Health check |
gcxi_ The number of times RAA was relaunched since the previous scrape. The aggregation process can exit when an error occurs. Genesys Info Mart sends a START command every 15 minutes during the aggregation period, which causes RAA to relaunch. |
Unit: Type: Counter |
Error |
gcxi_ The number of times RAA launched since the previous scrape. |
Unit: Type: Counter |
Error |
gcxi_ The number of errors registered since the previous scrape. |
Unit: Type: Counter |
Error |
gcxi_ The number of fact change notifications received from Genesys Info Mart since the previous scrape. |
Unit: Type: Counter |
Latency |
gcxi_ The total amount of time attributed to changed fact periods in notifications received from Genesys Info Mart since the previous scrape. |
Unit: milliseconds Type: Counter |
Latency |
gcxi_ The total amount of time attributed to fact notification delays in notifications received from Genesys Info Mart since the previous scrape. Notification delay is calculated as the difference between the moment of notification and the start of the changed period. |
Unit: milliseconds Type: Counter |
Latency |
gcxi_ The number of aggregations completed by RAA since the previous scrape. RAA groups the data by aggregation hierarchy name, materialized level (usually SUBHOUR, HOUR, DAY, MONTH), and media type (Online, Offline). |
Unit: Type: Counter |
Traffic |
gcxi_ The total number of periods aggregated by RAA since the previous scrape. |
Unit: milliseconds Type: Counter |
Traffic |
gcxi_ The total duration of time periods aggregations completed by RAA since the previous scrape. |
Unit: milliseconds Type: Counter |
Traffic |
gcxi_ The total duration of delays for aggregations completed by RAA since the previous scrape. Aggregation delay is calculated as the difference between the moment aggregation competes, and the start of the aggregation range. |
Unit: milliseconds Type: Counter |
Latency |
gcxi_ The number of records purged by RAA since the previous scrape. RAA groups the data by purged table name.
|
Unit: seconds Type: Counter |
Traffic |
gcxi_ The total amount of time spent on purging since the previous scrape. |
Unit: milliseconds Type: Counter |
Traffic |
Alerts[edit source]
Various raa.prometheusRule.alerts.* parameters in the values.yaml file specify the severity of alerts and some thresholds.
The following alerts are defined for RAA.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
raa-health | Specified by: raa. Recommended value: severe |
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. | gcxi_raa_health_level | Specified by: raa. Recommended value: 30m
|
raa-errors | Specified by: raa. Recommended value: warning |
A nonzero value indicates that errors have been logged during the scrape interval. | gcxi_raa_error_count | >0
|
raa-long-aggregation | Specified by: raa. Recommended value: warning |
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. | gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count | Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml. Recommended value: 300 |