Difference between revisions of "PEC-CAB/Current/CABPEGuide/Metrics"

Revision as of 23:36, August 5, 2021

Metrics and alerting

GES exposes default metrics about the state of the Node.js application; this includes CPU usage, memory usage, and the state of the Node.js runtime.

You’ll find helpful metrics in the GES Metrics subsection, which includes some basic metrics such as REST API usage, the number of created callbacks, call-in requests, and so on. These basic metrics are created as counters, which means that the values will monotonically increase over time from the beginning of a GES pod's lifespan. For more information about counters, see Metric Types in the Prometheus documentation.

You can develop a solid understanding of the performance of a given GES deployment or pod by watching how these metrics change over time. The

#mintydocs_link must be called from a MintyDocs-enabled page (PEC-CAB/Current/CABPEGuide/Metrics).

show you how to use the basic metrics to gain valuable insights into your callback-related activity. For information about deploying dashboards and accessing sample implementations, see

#mintydocs_link must be called from a MintyDocs-enabled page (PEC-CAB/Current/CABPEGuide/Metrics).

and

#mintydocs_link must be called from a MintyDocs-enabled page (PEC-CAB/Current/CABPEGuide/Metrics).

.

Sample Prometheus expressions

For more information about querying in Prometheus, see Querying Prometheus.

Purpose	Prometheus snippet	Notes
Find the number of Callbacks Created within a given time range across all tenants.	sum(increase(ges_callbacks_created{tenant=~"$Tenant"}[$__range]))	The same type of expression can be used to track callbacks, call-ins, and other metrics.
Find the number of Callbacks Created per minute for a given tenant.	sum by (tenant) (rate(ges_callbacks_created{tenant=~"$Tenant"}[5m])) * 60	The same type of expression can be used to track callbacks, call-ins and other metrics.
Find the number of API failures per minute (across all tenants).	sum by (path, httpCode) (rate(ges_http_failed_requests_total{tenant=~"$Tenant"}[5m]) * 60)
Find the API success rate over a selected time range.	1 - (increase(sum(ges_http_failed_requests_total{tenant=~"$Tenant"})[$__range]) / increase(sum(ges_http_requests_total{tenant=~"$Tenant"})[$__range]))
Find the 15-minute rolling average response time by endpoint.	sum by (method, route, code)(increase(ges_http_request_duration_seconds_sum{pod=~"$Pod"}[15m])) / sum by (method, route, code)(increase(ges_http_request_duration_seconds_count{pod=~"$Pod"}[15m]))
Find the 15-minute rolling average response time by pod.	sum by (pod)(increase(ges_http_request_duration_seconds_sum{pod=~"$Pod"}[15m])) / sum by (pod)(increase(ges_http_request_duration_seconds_count{pod=~"$Pod"}[15m]))
Find the number of HTTP 401 errors per minute.	sum(rate(ges_http_failed_requests_total{httpCode="401", pod=~"$Pod"}[5m]) * 60)	Change the `httpCode` variable to query other response types.

Health metrics

Health metrics, that is, those metrics that report on the status of connections from GES to dependencies such as ORS, GWS, Redis, and Postgres, do not work like the metrics described above. Instead, they are implemented as a gauge that toggles between "0" and "1". For information about gauges, see the Prometheus Metric types documentation. When the connection to a service is down, the metric is "1". When the service is up, the metric is "0". Also see

#mintydocs_link must be called from a MintyDocs-enabled page (PEC-CAB/Current/CABPEGuide/Metrics).

.

How alerts work

In a Kubernetes deployment, GES relies on Prometheus and Alertmanager to generate alerts. These alerts can then be fowarded to a service of your choice (for example, PagerDuty). For information about finding sample alerts, see

#mintydocs_link must be called from a MintyDocs-enabled page (PEC-CAB/Current/CABPEGuide/Metrics).

.

While GES leverages Prometheus, GES also has internal functionality that manually triggers alerts when certain criteria are met. The internal alert is turned into a counter (see the Prometheus Metric types documentation) that is incremented each time the conditions to fire the alert are met. The counter is made available on the /metrics endpoint. Use a Prometheus rule to capture the metric data and fire the alert on Prometheus. The following example shows an alert used in an Azure deployment; note how the process watches the increase in instances of the alert being fired over time to trigger the Prometheus alert.

- alert: GES_RBAC_CREATE_VQ_PROXY_ERROR
annotations:
  summary: "There are issues managing VQ proxy objects on {{ $labels.pod }}"
labels:
  severity: info
  action: email
  service: GES
expr: increase(RBAC_CREATE_VQ_PROXY_ERROR[10m]) > 5

Health alerts in GES work a little differently. They are gauges, rather than counters. The gauge toggles between "0" and "1"; "1" indicates that the service is down and "0" indicates that the service is up. Because GES has an automatic health check that runs every 15-20 seconds or so, the health alerts are fired by simply checking that a connection has been in the DOWN state for a given period of time. The following example shows the ORS_REDIS_DOWN alert.

- alert: GES_ORS_REDIS_DOWN
        expr: ORS_REDIS_STATUS > 0
        for: 5m
        labels:
          severity: critical
          action: page
          service: GES
        annotations:
            summary: "ORS REDIS Connection down for {{ $labels.pod }}"
            dashboard: "See GES Performance > Health and Liveliness to track ORS Redis Health over time"

Grafana dashboards

You can deploy the Grafana dashboards, included with the helm chart, when you deploy GES. Simply set the Helm value .Values.ges.grafana.enabled to true. This creates a config map to automatically deploy the dashboard.

In some cases, the dashboards might need adjustment to work appropriately with your Grafana version and overall Kubernetes setup. To make changes, unpack the helm chart .tar.gz file. Make the necessary upgrades to the grafana/ges-dashboard-configmap.yaml and grafana/ges-performance-dashboard.yaml files. Experienced users can make changes in the JSON files. Alternatively, you can use the web interface to set up the dashboard, export the JSON for the dashboard (following the Grafana dashboard export and import instructions), and then copy the JSON into the appropriate file. On a re-deploy of the Helm Charts, Grafana picks up the new dashboards.

Sample implementations

You can find sample implementations of alerts in the provided helm charts, in the prometheus/alerts.yaml file.

Sample dashboards, embedded in config maps, can be found in the grafana\ges-dashboard.yaml and grafana/ges-performance-dashboard.yaml files. These are for the business logic and performance dashboards respectively. You might need to make some adjustments to get the alerts and dashboards working; see Grafana dashboards.

@@ Line 15: / Line 15: @@
 You’ll find helpful metrics in the '''GES Metrics''' subsection, which includes some basic metrics such as REST API usage, the number of created callbacks, call-in requests, and so on. These basic metrics are created as counters, which means that the values will monotonically increase over time from the beginning of a GES pod's lifespan. For more information about counters, see [https://prometheus.io/docs/concepts/metric_types/ Metric Types] in the Prometheus documentation.
 You can develop a solid understanding of the performance of a given GES deployment or pod by watching how these metrics change over time. The {{Link-SomewhereInThisVersion|manual=CABPEGuide|topic=Metrics|anchor=SamplePromExpressions|display text=sample Prometheus expressions}} show you how to use the basic metrics to gain valuable insights into your callback-related activity.
+For information about deploying dashboards and accessing sample implementations, see {{Link-SomewhereInThisVersion|manual=CABPEGuide|topic=Metrics|anchor=GrafanaDashboards|display text=Grafana dashboards}} and {{Link-SomewhereInThisVersion|manual=CABPEGuide|topic=Metrics|anchor=SampleImplementation|display text=Sample implementations}}.
 |Status=No
 }}{{Section
@@ Line 89: / Line 91: @@
              dashboard: "See GES Performance > Health and Liveliness to track ORS Redis Health over time"
 </source>
+|Status=No
+}}{{Section
+|sectionHeading=Grafana dashboards
+|anchor=GrafanaDashboards
+|alignment=Vertical
+|structuredtext=You can deploy the Grafana dashboards, included with the helm chart, when you deploy GES. Simply set the Helm value <tt>.Values.ges.grafana.enabled</tt> to '''true'''. This creates a config map to automatically deploy the dashboard.
+In some cases, the dashboards might need adjustment to work appropriately with your Grafana version and overall Kubernetes setup. To make changes, unpack the helm chart .tar.gz file. Make the necessary upgrades to the <tt>grafana/ges-dashboard-configmap.yaml</tt> and <tt>grafana/ges-performance-dashboard.yaml</tt> files. Experienced users can make changes in the JSON files. Alternatively, you can use the web interface to set up the dashboard, export the JSON for the dashboard (following the [https://grafana.com/docs/grafana/latest/dashboards/export-import/ Grafana dashboard export and import instructions]), and then copy the JSON into the appropriate file. On a re-deploy of the Helm Charts, Grafana picks up the new dashboards.
 |Status=No
 }}{{Section
@@ Line 94: / Line 104: @@
 |anchor=SampleImplementation
 |alignment=Vertical
-|structuredtext=Sample implementations of alerts can be found in the provided helm charts, in the <code>prometheus/alerts.yaml</code> file.
+|structuredtext=You can find sample implementations of alerts in the provided helm charts, in the <tt>prometheus/alerts.yaml</tt> file.
-Sample dashboards, embedded in config maps, can be found in the <code>templates\ges-dashboard-configmap.yaml</code> and <code>templates/ges-performance-dashboard-configmap.yaml</code> files. These are for the business logic and performance dashboards respectively. Some work might be needed in order to have the alerts and dashboards work.
+Sample dashboards, embedded in config maps, can be found in the <tt>grafana\ges-dashboard.yaml</tt> and <tt>grafana/ges-performance-dashboard.yaml</tt> files. These are for the business logic and performance dashboards respectively. You might need to make some adjustments to get the alerts and dashboards working; see Grafana dashboards.
 |Status=No
 }}
 |PEPageType=21ecf3f4-ef12-4276-8872-1e0e3af9561e
 }}

Difference between revisions of "PEC-CAB/Current/CABPEGuide/Metrics"

Revision as of 23:36, August 5, 2021

Contents

Metrics and alerting

Sample Prometheus expressions

Health metrics

How alerts work

Grafana dashboards

Sample implementations