Skip to main content

Go Generation (sloth)

The following guide shows how to use go:generate to generate Sloth SLO specifications, plus prometheus alert groups, for a simple application's metrics.

Prerequisites​

  • Go
  • Sloscribe
  • Sloth

Generate sloth SLO specification using go:generate​

The sample application structure is quite simple, see below, it is composed of main.go containing the core application's code, and metrics.go containing metrics defined by the application.

.
├── main.go
└── metrics.go

0 directories, 2 files

metrics.go

The metrics.go defines 2 Prometheus counter metrics, tracking the total number of logins and number of unsuccessful logins.

    var (
// @sloth.slo name chat-gpt-availability
// @sloth.slo objective 95.0
// @sloth.sli error_query sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[{{.window}}])) OR on() vector(0)
// @sloth.sli total_query sum(rate(tenant_login_operations_total{client="chat-gpt"}[{{.window}}]))
// @sloth.slo description 95% of logins to the chat-gpt app should be successful.
// @sloth.alerting name ChatGPTAvailability
metricTenantTotalLoginsCount = prometheus.NewCounter(
prometheus.GaugeOpts{
Namespace: "chatgpt",
Subsystem: "auth0",
Name: "tenant_login_operations_total",
})
metricTenantFailedLoginsCount = prometheus.NewCounter(
prometheus.CounterOpts{
Namespace: "chatgpt",
Subsystem: "auth0",
Name: "tenant_failed_login_operations_total",
})
)

In the metrics.go is where we define the annotations required for the chat-gpt-availability SLO, keeping track of how many users are able to log into the website.

main.go

The main.go is where we can define the name of the Sloth service that owns the SLOs, @sloth service chatgpt, this is also where we would add the go:generate directive.

Note

The Sloth service name can be defined in the metrics.go if it's not possible to use the main.go.

//go:generate sloscribe init --to-file

package main

// @sloth service chatgpt
func main() {
// application code
}

Running go generate ./... in the terminal, will tell sloscribe to run and parse the different project directories for in code annotations and generate Sloth SLOs specifications, under ./slo_definitions/chatgpt.yaml directory.

go generate ./...

slo_definitions/chatgpt.yaml:

# Code generated by sloscribe: https://github.com/slosive/sloscribe.
# DO NOT EDIT.
version: prometheus/v1
service: chatgpt
slos:
- name: chat-gpt-availability
description: 95% of logins to the chat-gpt app should be successful.
objective: 95
sli:
events:
error_query: sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[{{.window}}])) OR on() vector(0)
total_query: sum(rate(tenant_login_operations_total{client="chat-gpt"}[{{.window}}]))
alerting:
name: ChatGPTAvailability

Generate Prometheus alert groups from Sloth SLOs Specification​

The Sloth SLO specification can be used to generate a Prometheus alert group rules.yaml which can be used by a Prometheus instance to monitor and alert on the SLOs.

sloth generate -i ./slo_definitions/chatgpt.yaml -o ./rules.yml
Resulting alert groups.
# Code generated by Sloth (v0.11.0): https://github.com/slok/sloth.
# DO NOT EDIT.

groups:
- name: sloth-slo-sli-recordings-foo-chat-gpt-availability
rules:
- record: slo:sli_error:ratio_rate5m
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[5m])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[5m])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 5m
- record: slo:sli_error:ratio_rate30m
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[30m])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[30m])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 30m
- record: slo:sli_error:ratio_rate1h
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[1h])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[1h])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 1h
- record: slo:sli_error:ratio_rate2h
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[2h])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[2h])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 2h
- record: slo:sli_error:ratio_rate6h
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[6h])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[6h])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 6h
- record: slo:sli_error:ratio_rate1d
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[1d])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[1d])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 1d
- record: slo:sli_error:ratio_rate3d
expr: |
(sum(rate(tenant_failed_login_operations_total{client="chat-gpt"}[3d])) OR on() vector(0))
/
(sum(rate(tenant_login_operations_total{client="chat-gpt"}[3d])))
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 3d
- record: slo:sli_error:ratio_rate30d
expr: |
sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"}[30d])
/ ignoring (sloth_window)
count_over_time(slo:sli_error:ratio_rate5m{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"}[30d])
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_window: 30d
- name: sloth-slo-meta-recordings-foo-chat-gpt-availability
rules:
- record: slo:objective:ratio
expr: vector(0.95)
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
- record: slo:error_budget:ratio
expr: vector(1-0.95)
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
- record: slo:time_period:days
expr: vector(30)
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
- record: slo:current_burn_rate:ratio
expr: |
slo:sli_error:ratio_rate5m{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"}
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
- record: slo:period_burn_rate:ratio
expr: |
slo:sli_error:ratio_rate30d{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"}
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
- record: slo:period_error_budget_remaining:ratio
expr: 1 - slo:period_burn_rate:ratio{sloth_id="foo-chat-gpt-availability", sloth_service="foo",
sloth_slo="chat-gpt-availability"}
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_service: foo
sloth_slo: chat-gpt-availability
- record: sloth_slo_info
expr: vector(1)
labels:
foo: bar
sloth_id: foo-chat-gpt-availability
sloth_mode: cli-gen-prom
sloth_objective: "95"
sloth_service: foo
sloth_slo: chat-gpt-availability
sloth_spec: prometheus/v1
sloth_version: v0.11.0
- name: sloth-slo-alerts-foo-chat-gpt-availability
rules:
- alert: K8sApiserverAvailabilityAlert
expr: |
(
max(slo:sli_error:ratio_rate5m{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (14.4 * 0.05)) without (sloth_window)
and
max(slo:sli_error:ratio_rate1h{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (14.4 * 0.05)) without (sloth_window)
)
or
(
max(slo:sli_error:ratio_rate30m{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (6 * 0.05)) without (sloth_window)
and
max(slo:sli_error:ratio_rate6h{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (6 * 0.05)) without (sloth_window)
)
labels:
sloth_severity: page
annotations:
summary: '{{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn
rate is over expected.'
title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
burn rate is too fast.
- alert: K8sApiserverAvailabilityAlert
expr: |
(
max(slo:sli_error:ratio_rate2h{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (3 * 0.05)) without (sloth_window)
and
max(slo:sli_error:ratio_rate1d{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (3 * 0.05)) without (sloth_window)
)
or
(
max(slo:sli_error:ratio_rate6h{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (1 * 0.05)) without (sloth_window)
and
max(slo:sli_error:ratio_rate3d{sloth_id="foo-chat-gpt-availability", sloth_service="foo", sloth_slo="chat-gpt-availability"} > (1 * 0.05)) without (sloth_window)
)
labels:
sloth_severity: ticket
annotations:
summary: '{{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget burn
rate is over expected.'
title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
burn rate is too fast.

Add Prometheus alert group to a Prometheus configuration​

The rules.yml from the previous steps can then be referenced in the Prometheus instance configuration, by adding the rules file name to the rule_files field.

# my global config
global:
scrape_interval: 5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "exporter"

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ["localhost:9301"]