Introduction
NannyML is an open-source Python library built for post-deployment ML monitoring, with a focus on detecting data drift, concept drift, and performance degradation without access to ground truth labels.
NannyML is a key component in the model observability layer, enabling teams to monitor predictions, inputs, and pipeline behavior over time, even in cases where labels are delayed or unavailable. It bridges the gap between model training and model trust in real-world settings—helping detect issues early, understand their root causes, and iterate safely.
Key Benefits of Using NannyML include:
Post-Deployment Performance Estimation: Estimates metrics like accuracy or F1-score without needing real-time ground truth using techniques like confidence-based and CBPE estimation.
Data and Concept Drift Detection: Identifies changes in input feature distributions and prediction patterns, helping detect model degradation before it impacts users.
Feature Attribution for Drift: Pinpoints which input features have drifted most—crucial for debugging upstream data issues or shifts in user behavior.
Visual Monitoring Dashboards: Offers clear, actionable visualizations of drift, performance estimation, and pipeline behavior over time.
Python-Native and Pipeline-Ready: Easily integrates into existing ML pipelines built on Airflow, Kubeflow, or Prefect and complements logging systems like MLflow or Arize Phoenix.
Use Cases
NannyML is used to monitor:
User classification models for churn prediction, segmentation, or intent inference, especially where labels arrive late or are sparse.
Retrieval and reranking models in RAG pipelines, by monitoring changes in query distributions or embedding space behavior.
Embedding generation workflows, identifying shifts in semantic inputs that may affect downstream performance or retrievability.
Time series and forecasting models, where seasonality or external events can drive sudden drift.
NannyML integrates with data stacks through batch jobs, automated alerts, and model observability dashboards built with Superset, Grafana, or Metabase. It complements other evaluation tools like DeepChecks, Great Expectations, and TrustCall in a layered MLOps strategy. By adopting NannyML, teams can ensure that their models remain robust, monitored, and explainable in production environments—enabling faster issue detection, safer iteration, and greater trust in ML-powered decisions.