Monitoring and Observability with Prometheus and Grafana 

Course & Training

Learn how to practically monitor, analyze, and visualize applications and systems with Prometheus and Grafana — including alerting, SLOs, and Kubernetes best practices.

A two-day intensive course focusing on monitoring and observability of applications with Prometheus and visualizing metrics with Grafana. Participants will learn the installation, configuration, and effective use of Prometheus for monitoring applications and creating meaningful dashboards with Grafana in a Kubernetes environment — to ensure stability and performance. We cover **service discovery, recording rules, PromQL, alerting (Alertmanager & Grafana Alerting), SLO/SLA tracking, histograms/exemplars**, plus **HA & long-term storage** (Thanos/Cortex/Mimir) and **security/costs/retention**.

In-House Course:

We are happy to conduct tailored courses for your team - on-site, remotely or in our course rooms.

Request In-House Course

Content:


Prometheus is a powerful open-source monitoring and alerting system purpose-built for modern, distributed, and containerized applications. In this course, we show how to use Prometheus with Grafana effectively to monitor application health, identify performance issues, and ensure software quality.

**Course topics (hands-on focus):**

- **Intro & Architecture**
- Data model (labels/series), pull model, TSDB
- Components: Prometheus, exporters, Alertmanager, Pushgateway (when to use)
- Deployment options: standalone, Prometheus Operator, kube-prometheus-stack (Helm)

- **Install & Configure**
- Prometheus on Kubernetes (Helm/Operator) and Docker/Compose
- **Service discovery** (Kubernetes, EC2, Consul) and **relabeling** patterns
- Scrape config, jobs/targets, multi-cluster/namespace layouts

- **Instrumentation & Exporters**
- App instrumentation best practices (counter/gauge/histogram/summary)
- **Histograms & exemplars** for latency and trace correlation
- Key exporters: **node_exporter**, **kube-state-metrics**, **cAdvisor**, **blackbox_exporter**, DB exporters
- OpenTelemetry bridge (OTel Collector → Prometheus)

- **PromQL & Recording Rules**
- Query basics, label matching, joins
- **rate/irate**, histogram quantiles, Apdex/latency buckets
- **Recording rules** & groups for performance and reuse
- **SLO/SLA** metrics: error budget, availability & latency

- **Grafana Dashboards**
- Data source config, time ranges, transformations
- Dashboard design, panels, variables, library panels
- **Exemplars** in Grafana, drill-downs, annotations
- **Best practices** for SRE, infra & app monitoring

- **Alerting**
- Prometheus alert rules, templating, severity design
- **Alertmanager**: routing, inhibition, silence, receivers
- **Grafana Alerting**: when to use; harmonizing with Alertmanager
- **Runbooks** & annotations: from alert to action

- **Operations, Scale & Reliability**
- Retention & TSDB tuning, WAL/compaction, capacity
- **High availability**: sharding/HA pairs, **Thanos/Cortex/Mimir** for LTS & global query
- Federation vs. remote write/read, multi-cluster strategies
- Self-monitoring; watchdog alerts

- **Security & Compliance**
- TLS, authN/Z (reverse proxy, OAuth proxy), network scoping
- Multi-tenancy (Mimir/Cortex), tenant isolation via labels/namespaces
- PII/Compliance: what not to put into metrics

- **Cost Control & Cardinality**
- Detecting label explosion, cardinality checks
- Metric hygiene: naming, labeling, intervals, downsampling/recording
- Storage cost vs. resolution vs. retention: guardrails

- **Troubleshooting & Patterns**
- Debugging slow queries, PromQL optimization
- Exporter/target issues, scrape errors, stale series
- Incident dashboards (Golden Signals, RED/USE)

**Hands-on labs (Beispiele):**
- Lab 1: Helm deploy (kube-prometheus-stack), access & security
- Lab 2: Service discovery & relabeling — scrape only what matters
- Lab 3: PromQL drills (rates, histograms, joins, quantiles)
- Lab 4: Recording rules for SLOs + SLI dashboards in Grafana
- Lab 5: Alerting setup (rules + Alertmanager routing), runbook linking
- Lab 6: Blackbox checks (HTTP/TCP/ICMP) + incident dashboard
- Lab 7: Retention/cardinality tuning, self-monitoring & watchdog
- Lab 8: Thanos for long-term storage & HA querying

Scenarios and hands-on labs are based on Kubernetes and containerized applications.


Disclaimer: The actual course content may vary from the above, depending on the trainer, implementation, duration and constellation of participants.

Whether we call it training, course, workshop or seminar, we want to pick up participants at their point and equip them with the necessary practical knowledge so that they can apply the technology directly after the training and deepen it independently.

Goal:

After the course, participants can use Prometheus and Grafana as monitoring and alerting systems in their projects: configure **service discovery & relabeling**, write **PromQL** confidently, define **recording rules & SLOs**, operate **alerts with Alertmanager/Grafana**, and plan **operations/scale** (HA, retention, cost, cardinality) with confidence.


Form:

The course combines short input sessions, guided **live demos**, and practical **hands-on labs** in a Kubernetes cluster (Helm/Operator). We emphasize **realistic scenarios**, clear patterns, and directly applicable best practices.


Target Audience:

Software developers, DevOps/Platform engineers, SREs, and system administrators who want to monitor apps and infrastructure efficiently, establish **SLOs**, speed up **incident response**, and operate **Kubernetes**-based monitoring stacks professionally.


Requirements:

Basic Linux/CLI skills, foundational knowledge of containers/Kubernetes and web applications. Helpful: some exposure to metrics/logs and YAML/Helm.


Preparation:

Each participant receives a questionnaire and an installation guide. We provide a lab environment (Kubernetes cluster, **kube-prometheus-stack**, sample services). Optional: bring your own cloud access. Prerequisites are verified in advance.

Request In-House Course:

In-House Kurs Anfragen

Waitinglist for public course:

Sign up for the waiting list for more public course dates. Once we have enough people on the waiting list, we will determine a date that suits everyone as much as possible and schedule a new session. If you want to participate directly with two colleagues, we can even plan a public course specifically for you.

Waiting List Request

(If you already have 3 or more participants, we will discuss your preferred date directly with you and announce the course.)

More about Prometheus & Grafana


Prometheus uses a dimensional data model (labels) and pull-based scraping with built-in **service discovery**. **PromQL** enables flexible queries, **recording rules** speed up common metrics, and **Alertmanager** handles notifications. **Grafana** provides visualization, supports **exemplars**, and SRE-style dashboards. For **long-term storage & HA**, systems like **Thanos/Cortex/Mimir** are commonly used.




History


Prometheus started at SoundCloud in 2012 and joined the CNCF in 2016 as the second project after Kubernetes. In combination with Grafana, it has become the de facto standard for metrics-based monitoring and SRE-led observability. The ecosystem (Operator, Thanos/Mimir/Cortex, OpenTelemetry integration) keeps evolving.