GSP metrics and alerts
Find the metrics GSP exposes and the alerts defined for GSP.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
GSP | PodMonitor | 9249 | Endpoint: /
Selector: matchLabels:
app: {{ template "gsp.fullname" . }} where the value of gsp.fullname depends on deployment parameters such as Helm release name, .Values.fullnameOverride, and .Values.nameOverride. |
30 seconds |
See details about:
Metrics[edit source]
GSP exposes some standard Apache Flink and Kafka metrics as well as Genesys-defined metrics, which are exposed via the Flink API. Therefore, all GSP metrics start with the prefix flink_ but in some cases the values are calculated by GSP.
You can query Prometheus directly to see all the metrics Flink and the Flink Kafka connector expose through GSP.
- For full information about the standard Flink metrics, see the Apache Flink documentation.
- For full information about the Kafka metrics, see the Apache Kafka or Confluent Kafka documentation.
The following metrics are likely to be particularly useful. The naming convention is <flink_scope_prefix>_<GSP suffix>. Genesys does not commit to maintain other currently available GSP metrics not documented on this page.
Metric and description | Metric details | Indicator of |
---|---|---|
flink_ Number of invalid input records. |
Unit: Type: Gauge |
Error |
flink_ Number of running Flink jobs. If less than 1, there is a problem. |
Unit: Type: Gauge |
Error |
flink_ Number of messages exceeding the max.request.size Kafka option. |
Unit: Type: Gauge
Sample value: 0 |
Error |
flink_ Number of issues encountered, such as errors or warnings. |
Unit: Type: Gauge
Sample value: |
Error |
flink_ The last watermark received by this operator/task, in milliseconds since the Unix Epoch (00:00:00 UTC on 1 January 1970). |
Unit: milliseconds Type: Gauge
Sample value: |
Latency |
flink_ The last watermark this operator has emitted, in milliseconds since the Unix Epoch. |
Unit: milliseconds Type: Gauge
Sample value: |
Latency |
flink_ The maximum lag in terms of the number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers. |
Unit: Type: Gauge |
Latency |
flink_ The average number of records consumed per second. |
Unit: Type: Gauge |
Traffic |
flink_ Total number of EventCallCreated events GSP received since it started processing. |
Unit: Type: Gauge |
Traffic |
flink_ Number of EventCallCreated events per second (CPS). |
Unit: Type: Gauge |
Traffic |
flink_ Total number of CallThreads GSP received since it started processing. |
Unit: Type: Gauge |
Traffic |
flink_ Number of CallThreads per second (CTHPS). |
Unit: Type: Gauge |
Traffic |
flink_ Total number of EventOCSChainStartProcessing events GSP received since it started processing. |
Unit: Type: Gauge |
Traffic |
flink_ Number of EventOCSChainStartProcessing events per second (CPS). |
Unit: Type: Gauge |
Traffic |
flink_ The recent CPU usage for the JVM process. The value is a double in the [0.0,1.0] interval, where a value of 0.0 means that none of the CPUs were running threads from the JVM process, while a value of 1.0 means that all CPUs were actively running threads from the JVM 100% of the time during the recent period being observed. A negative value means usage data is not available. For more information, see https:/ |
Unit: Type: Gauge
Sample value: |
Saturation |
flink_ The total capacity of all buffers in the direct buffer pool. |
Unit: bytes Type: Gauge
Sample value: |
Saturation |
flink_ The amount of memory used by the JVM for the direct buffer pool. |
Unit: bytes Type: Gauge
Sample value: |
Saturation |
flink_ The maximum amount of non-heap memory that can be used for memory management. |
Unit: bytes Type: Gauge
Sample value: |
Saturation |
flink_ The amount of non-heap memory currently used. |
Unit: bytes Type: Gauge
Sample value: |
Saturation |
flink_ The maximum amount of heap memory that can be used for memory management. |
Unit: bytes Type: Gauge
Sample value: |
Saturation |
flink_ The amount of heap memory currently used. |
Unit: bytes Type: Gauge
Sample value: |
Saturation |
Alerts[edit source]
The alerts are based on Flink and Kubernetes cluster metrics.
The following alerts are defined for GSP.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
GspFlinkJobDown | Critical | Triggered when the GSP Flink job is not running (number of running jobs equals to 0 or metric is not available) | flink_jobmanager_numRunningJobs | For 5 minutes
|
GspOOMKilled | Critical | Triggered when a GSP pod is restarted because of OOMKilled | kube_pod_container_status_restarts_total | 0
|
GspNoTmRegistered | Critical | Triggered when there are no registered TaskManagers (or metric not available) | flink_jobmanager_numRegisteredTaskManagers | For 5 minutes
|
GspUnknownPerson | High | Triggered when GSP encounters unknown person(s) | flink_ |
For 5 minutes |