DAS metrics and alerts
Find the metrics DAS exposes and the alerts defined for DAS.
Service | CRD or annotations? | Port | Endpoint/Selector | Metrics update interval |
---|---|---|---|---|
DAS | ServiceMonitor | 8081 | selector:
matchLabels:
{{- include "das.serviceSelectorLabels" . | nindent 6 }} Path: |
10 seconds |
See details about:
Metrics[edit source]
Given below are some of the metrics exposed by the DAS service:
Metric and description | Metric details | Indicator of |
---|---|---|
sdr_ Number of requests received since DAS is running (provided for each CCID). |
Unit: Type: Counter |
|
sdr_ Number requests rejected since DAS is running (provided for each CCID). |
Unit: Type: Counter |
|
data_ Number of failed data table requests since DAS is running (provided for each CCID). |
Unit: Type: Counter |
|
data_ Data table requests latency in seconds, since DAS is running (provided for each CCID). |
Unit: seconds Type: Histogram |
|
business_ Number of failed business hours requests since DAS is running. |
Unit: Type: Counter |
|
business_ Business hours requests latency in seconds, since DAS is running (provided for each CCID). |
Unit: seconds Type: Histogram |
|
special_ Number of failed special days requests since DAS is running. |
Unit: Type: Counter |
|
special_ Special days requests latency in seconds, since DAS is running (provided for each CCID). |
Unit: seconds Type: Histogram |
|
external_ Number of failed external requests since DAS is running. |
Unit: Type: Counter |
|
external_ Number of timed out external requests since DAS is running. |
Unit: Type: Counter |
|
external_ External requests latency in seconds, since DAS is running. |
Unit: seconds Type: Histogram |
|
das_ HTTP request latency in seconds (provided for each request type and CCID). |
Unit: seconds Type: Histogram |
|
das_ Number of HTTP requests (provided for each request type and CCID). |
Unit: Type: Counter |
|
nginx_ Number of nginx-lua-prometheus errors. |
Unit: Type: Counter |
Alerts[edit source]
The following alerts are defined for DAS.
Alert | Severity | Description | Based on | Threshold |
---|---|---|---|---|
CPUUtilization (Alarm: Pod CPU Usage) |
CRITICAL | Triggered when a pod's CPU utilization is beyond the threshold. | 75% Default interval: 180s
| |
MemoryUtilization (Alarm: Pod Memory Usage) |
CRITICAL | Triggered when a pod's memory utilization is beyond the threshold. | 75% Default interval: 180s
| |
containerRestartAlert (Alarm: Pod Restarts Count) |
CRITICAL | Triggered when a pod's restart count is beyond the threshold. | 5 Default interval: 180s
| |
containerReadyAlert (Alarm: Pod Ready Count) |
CRITICAL | Triggered when a pod's ready count is less than the threshold (1). | 1 Default interval: 60s
| |
AbsentAlert (Alarm: Deployment availability) |
CRITICAL | Triggered when DAS pod metrics are unavailable. | 1 Default interval: 60s
| |
WorkspaceUtilization (Alarm: Azure Fileshare PVC Usage) |
HIGH | Triggered when file share usage is greater than the threshold. | 80% Default interval: 180s
| |
Health (Alarm: Health Status) |
CRITICAL | Triggered when DAS health status is 0. | 0 Default interval: 60s
| |
WorkspaceHealth (Alarm: Workspace Health Status) |
CRITICAL | Triggered when DAS is not able to communicate with the workspace. | 0 Default interval: 60s
| |
PHPHealth (Alarm: PHP Health Status) |
CRITICAL | Triggered when Designer/DAS experiences a PHP Health check failure. | 0 Default interval: 60s
| |
ProxyHealth (Alarm: Proxy Health Status) |
CRITICAL | Triggered when Designer/DAS experiences a Proxy Health check failure. | 0 Default interval: 60s
| |
HTTP5XXCount (Alarm: Application 5XX Error) |
HIGH | Triggered when DAS exceeds the allowed 5xx error count threshold specified here. | 10 Default interval: 180s
| |
HTTP4XXCount (Alarm: Application 4XX Error) |
HIGH | Triggered when DAS exceeds the 4xx error count threshold specified here. | 100 Default interval: 180s
| |
PhpLatency (Alarm: DAS PHP Latency Alert) |
HIGH | Triggered when the average time taken by a PHP request is greater than the threshold (in seconds) specified here. | 10s Default interval: 180s
| |
HTTPLatency (Alarm: DAS HTTP Latency Alert) |
HIGH | Triggered when the average time taken by a HTTP request is greater than the threshold (in seconds) specified here. | 10s Default interval: 180s |