RAA metrics and alerts
Find the metrics RAA exposes and the alerts defined for RAA.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
RAA | PodMonitor and PrometheusRule | metrics: 9100, health: 9101 |
RAA forms matched labels from raa.statefulset.selector.matchLabels values, specified in values.yaml. The default contains a single raa-app item with raa.serviceName variable as a value. The element raa.serviceName is a concatenation of parameters. ....
statefulset:
## pod selector
selector:
matchLabels:
raa-app: "{{ tpl $.Values.raa.serviceName $ }}"
template:
## a map of pod specific labels to add to common labels
labels:
raa-app: "{{ tpl $.Values.raa.serviceName $ }}" |
metrics: several seconds, health: up to 3 minutes |
See details about:
Metrics[edit source]
Metric and description | Metric details | Indicator of |
---|---|---|
gcxi_ A health status metric extracted using a dedicated port on the monitor container. The metric value is a sum of values from two different health-checks:
|
Unit: Type: Gauge |
Health check |
gcxi_ The number of commands received from Genesys Info mart since the previous scrape. Label reflects the name of the command. The supported commands are: START, QUIT, EXIT, UPDATE_CONFIG, REAGGREGATE |
Unit: Type: Counter |
Traffic |
gcxi_ The number of dispatch events (moving aggregation requests from AGR_NOTIFICATION to PENDING_ARG) since the previous scrape. Dispatch events typically occur every 15 seconds. Such events are used for aggregation health check based on local files. |
Unit: Type: Counter |
Health check |
gcxi_ The number of heartbeats since the previous scrape. Heartbeat is normally performed once every five minutes, and is used for health check based on local files. The label is the current RAA version. |
Unit: Type: Counter |
Health check |
gcxi_ The number of times RAA was relaunched since the previous scrape. The aggregation process can exit when an error occurs. Genesys Info Mart sends a START command every 15 minutes during the aggregation period, which causes RAA to relaunch. |
Unit: Type: Counter |
Error |
gcxi_ The number of times RAA launched since the previous scrape. |
Unit: Type: Counter |
Error |
gcxi_ The number of errors registered since the previous scrape. |
Unit: Type: Counter |
Error |
gcxi_ The number of fact change notifications received from Genesys Info Mart since the previous scrape. |
Unit: Type: Counter |
Latency |
gcxi_ The total amount of time attributed to changed fact periods in notifications received from Genesys Info Mart since the previous scrape. |
Unit: milliseconds Type: Counter |
Latency |
gcxi_ The total amount of time attributed to fact notification delays in notifications received from Genesys Info Mart since the previous scrape. Notification delay is calculated as the difference between the moment of notification and the start of the changed period. |
Unit: milliseconds Type: Counter |
Latency |
gcxi_ The number of aggregations completed by RAA since the previous scrape. RAA groups the data by aggregation hierarchy name, materialized level (usually SUBHOUR, HOUR, DAY, MONTH), and media type (Online, Offline). |
Unit: Type: Counter |
Traffic |
gcxi_ The total number of periods aggregated by RAA since the previous scrape. |
Unit: milliseconds Type: Counter |
Traffic |
gcxi_ The total duration of time periods aggregations completed by RAA since the previous scrape. |
Unit: milliseconds Type: Counter |
Traffic |
gcxi_ The total duration of delays for aggregations completed by RAA since the previous scrape. Aggregation delay is calculated as the difference between the moment aggregation competes, and the start of the aggregation range. |
Unit: milliseconds Type: Counter |
Latency |
gcxi_ The number of records purged by RAA since the previous scrape. RAA groups the data by purged table name.
|
Unit: seconds Type: Counter |
Traffic |
gcxi_ The total amount of time spent on purging since the previous scrape. |
Unit: milliseconds Type: Counter |
Traffic |
Alerts[edit source]
Various raa.prometheusRule.alerts.* parameters in the values.yaml file specify the severity of alerts and some thresholds.
The following alerts are defined for RAA.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
raa-health | Specified by: raa. Recommended value: severe |
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. | gcxi_raa_health_level | Specified by: raa. Recommended value: 30m
|
raa-errors | Specified by: raa. Recommended value: warning |
A nonzero value indicates that errors have been logged during the scrape interval. | gcxi_raa_error_count | >0
|
raa-long-aggregation | Specified by: raa. Recommended value: warning |
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. | gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count | Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml. Recommended value: 300 |