Description
Overview
We're looking for experienced Grafana power users to design expert-level evaluation tasks that test whether AI agents can use Grafana the way a real professional does. Your domain expertise is what makes these tasks authentic.
What You'll Do
-
Design realistic, multi-step Grafana workflows - dashboards, alerting rules, data source configuration, panel setup, cross-module operations
-
Perform each workflow yourself on a hosted Grafana instance
to produce a reference trajectory
-
Write clear, specific task prompts with measurable outcomes that can be verified programmatically
-
Implement programmatic graders that check whether each instruction was completed correctly
-
Review AI agent attempts at your tasks, identify where and why they fail, and tag root causes
-
Calibrate task difficulty so tasks are challenging but solvable - iterating on prompts and constraints based on model performance
Requirements
-
2+ years of daily, professional Grafana experience (SRE, Platform Engineering, Observability, or similar)
-
Deep familiarity with PromQL, dashboard templating, alerting pipelines, and data source configuration (Prometheus, InfluxDB, etc.)
-
Ability to articulate workflows clearly enough for programmatic verification
-
Comfort writing basic grading scripts (Python; engineering support provided as needed)
Nice to Have
-
Experience with Grafana API automation
-
Kubernetes/infrastructure monitoring background
-
Familiarity with AI evaluation or benchmarking
Time Commitment
-
10-15 hrs/week minimum during the project
-
Fast turnaround expected - responsiveness matters
Interested in this position?
Apply directly on the company's website