RAA metrics and alerts

From Genesys Documentation
Jump to: navigation, search
This topic is part of the manual Genesys Customer Experience Insights Private Edition Guide for version Current of Reporting.


Find the metrics RAA exposes and the alerts defined for RAA.

Related documentation:
Service CRD or annotations? Port Endpoint/Selector Metrics update interval
RAA PodMonitor and PrometheusRule metrics: 9100,
health: 9101
RAA forms matched labels from raa.statefulset.selector.matchLabels values, specified in values.yaml.
The default contains a single raa-app item with raa.serviceName variable as a value.
The element raa.serviceName is a concatenation of parameters.
....
  statefulset: 
    ## pod selector
    selector:       
       matchLabels:         
          raa-app: "{{ tpl $.Values.raa.serviceName $ }}"

    template: 
      ## a map of pod specific labels to add to common labels
      labels:         
          raa-app: "{{ tpl $.Values.raa.serviceName $ }}"
metrics: several seconds,
health: up to 3 minutes

See details about:

Metrics[edit source]

Metric and description Metric details Indicator of
gcxi_raa_health_level

A health status metric extracted using a dedicated port on the monitor container. The metric value is a sum of values ​​from two different health-checks:

  • A database-based health-check. Results:
    • A result of 2 indicates that RAA is working or in maintenance (has received a STOP command from Genesys Info Mart, and will restart only after receiving the START command).
    • A result of 0 indicates that RAA is not working according to this check.
  • A local health-check.
    • A result of 1 indicates that RAA is processing aggregation requests according to local health files.
    • A result of 0 indicates that RAA is not processing aggregation requests according to local health files.
Unit:

Type: Gauge
Label:
Sample value: 0

Health check
gcxi_raa_command_count

The number of commands received from Genesys Info mart since the previous scrape. Label reflects the name of the command. The supported commands are: START, QUIT, EXIT, UPDATE_CONFIG, REAGGREGATE

Unit:

Type: Counter
Label: cmd
Sample value: 10

Traffic
gcxi_raa_dispatch_count

The number of dispatch events (moving aggregation requests from AGR_NOTIFICATION to PENDING_ARG) since the previous scrape. Dispatch events typically occur every 15 seconds. Such events are used for aggregation health check based on local files.

Unit:

Type: Counter
Label:
Sample value: 100

Health check
gcxi_raa_heartbeat_count

The number of heartbeats since the previous scrape. Heartbeat is normally performed once every five minutes, and is used for health check based on local files. The label is the current RAA version.

Unit:

Type: Counter
Label: version
Sample value: 10

Health check
gcxi_raa_relaunched_count

The number of times RAA was relaunched since the previous scrape. The aggregation process can exit when an error occurs. Genesys Info Mart sends a START command every 15 minutes during the aggregation period, which causes RAA to relaunch.

Unit:

Type: Counter
Label:
Sample value: 1

Error
gcxi_raa_launched_count

The number of times RAA launched since the previous scrape.

Unit:

Type: Counter
Label: version
Sample value: 1

Error
gcxi_raa_error_count

The number of errors registered since the previous scrape.

Unit:

Type: Counter
Label:
Sample value: 1

Error
gcxi_raa_notification_count

The number of fact change notifications received from Genesys Info Mart since the previous scrape.

Unit:

Type: Counter
Label: fact
Sample value: 10

Latency
gcxi_raa_notification_period_ms

The total amount of time attributed to changed fact periods in notifications received from Genesys Info Mart since the previous scrape.

Unit: milliseconds

Type: Counter
Label: fact
Sample value:

Latency
gcxi_raa_notification_delay_ms

The total amount of time attributed to fact notification delays in notifications received from Genesys Info Mart since the previous scrape. Notification delay is calculated as the difference between the moment of notification and the start of the changed period.

Unit: milliseconds

Type: Counter
Label: fact
Sample value:

Latency
gcxi_raa_aggregated_count

The number of aggregations completed by RAA since the previous scrape. RAA groups the data by aggregation hierarchy name, materialized level (usually SUBHOUR, HOUR, DAY, MONTH), and media type (Online, Offline).

Unit:

Type: Counter
Label: hierarchy, level, mediaType
Sample value:

Traffic
gcxi_raa_aggregated_period_ms

The total number of periods aggregated by RAA since the previous scrape.

Unit: milliseconds

Type: Counter
Label: hierarchy, level, mediaType
Sample value: 10

Traffic
gcxi_raa_aggregated_duration_ms

The total duration of time periods aggregations completed by RAA since the previous scrape.

Unit: milliseconds

Type: Counter
Label: hierarchy, level, mediaType
Sample value:

Traffic
gcxi_raa_aggregated_delay_ms

The total duration of delays for aggregations completed by RAA since the previous scrape. Aggregation delay is calculated as the difference between the moment aggregation competes, and the start of the aggregation range.

Unit: milliseconds

Type: Counter
Label: hierarchy, level, mediaType
Sample value:

Latency
gcxi_raa_purged_count

The number of records purged by RAA since the previous scrape. RAA groups the data by purged table name.


Unit: seconds

Type: Counter
Label: table
Sample value:

Traffic
gcxi_raa_purged_duration_ms

The total amount of time spent on purging since the previous scrape.

Unit: milliseconds

Type: Counter
Label: table
Sample value:

Traffic


Alerts[edit source]

Various raa.prometheusRule.alerts.* parameters in the values.yaml file specify the severity of alerts and some thresholds.

The following alerts are defined for RAA.

Alert Severity Description Based on Threshold
raa-health Specified by: raa.prometheusRule.alerts.labels.severity
Recommended value: severe
A zero value for a recent period (several scrape intervals) indicates that RAA is not operating. gcxi_raa_health_level Specified by: raa.prometheusRule.alerts.health.for
Recommended value: 30m


raa-errors Specified by: raa.prometheusRule.alerts.raa-errors.labels.severity in values.yaml.
Recommended value: warning
A nonzero value indicates that errors have been logged during the scrape interval. gcxi_raa_error_count >0


raa-long-aggregation Specified by: raa.prometheusRule.alerts.longAggregation.labels.severity in values.yaml.
Recommended value: warning
Indicates that the average duration of aggregation queries specified by the hierarchy, level, and mediaType labels is greater than the deadlock-threshold. gcxi_raa_aggregated_duration_ms/ gcxi_raa_aggregated_count Greater than the value (seconds) of raa.prometheusRule.alerts.longAggregation.thresholdSec in values.yaml.
Recommended value: 300