Topics

Browse posts by category and tag — every topic we cover, with the latest pieces under each.

Categories

ops 5 posts

Alerting for ML Model Drift: A Practical Setup

Drift alerting fails in one of two ways — it never fires, or it fires constantly until everyone mutes it. A concrete setup for alerts that fire when
LLM Cost & Latency Observability with OpenTelemetry

Token spend and tail latency are the two metrics that decide whether an LLM feature ships or gets killed. How to instrument both with OpenTelemetry so you
Closing the Eval-Prod Gap: Online Evaluation as Observability

Offline eval scores are green and production is worse. The gap is not a measurement error — it is structural. Here is how to instrument online evaluation
Embedding and Vector-Store Observability: The Unwatched Layer

RAG systems fail at the embedding and index layer long before the LLM does. Here is what to actually monitor: embedding drift, index staleness, recall
End-to-End Tracing for LLM Applications: What Belongs in a Span

Production LLM apps span multiple model calls, tool invocations, retrieval steps, and re-tries. A complete trace makes them debuggable; a sparse one

monitoring 2 posts

tooling 2 posts

MLOps 1 posts

ML Model Monitoring Best Practices for Production Systems

A practitioner guide to the metrics, drift detection methods, alerting thresholds, and tooling that keep production ML reliable — without drowning your on-call in noise.