Prometheus at Prezi: replacing 10 years of anti-patterns

What happens when you combine 10 years of architectural organic build-up and a lack of common best practices around monitoring? Bad on-call shifts, more outages, longer outages and a lack of technical flexibility.
In our effort towards uniformity and reliability, we redesigned our entire monitoring stack and created a new platform based on Prometheus. This talk, presented at PromCon EU 2019, tells our journey from the old to the new.
After identifying our main pain points, we cover how we sold the idea of Prometheus to our organization, a task often underestimated. We then show ways to validate that we solved our issues by adopting a data-driven approach. The main components of the new platform are explained, as well as how we changed our monitoring and alerting culture as a whole. Finally, we conclude by looking back at our achievements and how the new platform enables us to move in a faster and safer way towards Kubernetes.