
"Why isn't my data ready?" — every on-call engineer knows this moment: jumping between dashboards, losing precious minutes while incidents escalate.
We built a unified telemetry lakehouse using OpenTelemetry and Apache Iceberg, cutting investigation time from 30+ minutes to ~8 seconds per query. But the real story is the collaboration model behind it — how platform engineers and domain experts co-designed a three-layer anomaly detection system that catches silent drift and cascading failures before they page anyone, and how we closed the loop with automated ticketing and remediation.
We'll share the bridge pattern for writing telemetry into open table formats, the correlation key model that joins signals across teams, and the cultural shifts that made cross-functional ownership actually work.
聽眾收穫:
Platform engineers, SREs, and data engineers who are tired of context-switching across disconnected observability tools. Attendees will walk away with:
All architecture, code, and runbooks are open-source and reproducible.
中階
中文