Cake AI Ops for Fine-Tuned Models

Prev Next

The Cake AI platform supports a structured and traceable workflow for managing fine-tuned models in production environments. The process begins with the creation or adoption of a fine-tuned model, which may include base models and parameter-efficient LoRA variations. Fine-tuning in the Cake platform is covered separately.

These models are immediately logged into an experiment tracking system, such as a Cake deployed MLflow, ensuring reproducibility, assuring lineage preservation and facilitating systematic comparisons of various fine-tuning experiments. This registration serves as the foundation for version control and performance traceability throughout the rest of the model's lifecycle. Instructions for registering models in Cake are located in the following document:

Adding Experiments to MLflow

Following experiment tracking, the fine-tuned model can be deployed into an inference engine. One option for inference is via a Cake deployed KubeRay cluster. KubeRay’s hosted vLLM backend is a performance optimized model server primarily for large language models. KubeRay extends vLLM to make multi-node deployment easy. Instructions for using KubeRay are here:

Deploying a fine-tuned model across multiple nodes with KubeRay and vLLM

Once deployed, the model can be proxied through LiteLLM, a thin proxy layer that abstracts various Cake and externally hosted model API’s. LiteLLM also provides access control, usage routing, user tracing and integration with various MCP tools. Instructions for registering models with LiteLLM are here.

Add a deployed model to LiteLLM

Once proxied, a deployed model can be utilized in multiple ways. Chat UI’s like Open WebUI can simply call the proxied model. Alternatively, the proxied model can be called from complex agentic workflows via: orchestration frameworks like LangFlow, as an MCP endpoint, or various from agent api’s like LangGraph or A2A. These integrations enable use-case-specific coordination and functionality, such as multi-agent reasoning or dynamic memory retrieval via Cake hosted vector databases like PgVector, Weaviate, and Milvus. Using LangFlow, MCP in Cake, as well as various vector stores is discussed separately. Documentation on calling LiteLLM proxied models via Open WebUI is here:

Connecting to LiteLLM proxied models via Open WebUI

Operational observability and reliability are maintained through a robust monitoring stack. The Cake platform integrates LLM Observability and Tracing. This includes tools like Langfuse, that allow developers to inspect request-level telemetry and to debug individual model interactions. Further, Langfuse can help developers understand what steps in complex agentic workflows were successful. It also vastly simplifies debugging those steps that were not. Tracing a LiteLLM proxied model with Langfuse is discussed here:

Tracing Calls to LiteLLM with Langfuse

Metrics are collected using Prometheus and visualized in Grafana, where performance indicators such as latency, token usage, and throughput are displayed. Grafana Alert Manager monitors these metrics to generate alerts on anomalous behavior or degraded performance. Discussion on Prometheus and Grafana are discussed here:

Monitoring your Ray Deployed Models with Monitoring your Ray Deployed Models with Prometheus and Grafana and Grafana

Together with LLM evaluation and tracing, this observability stack forms the backbone of Cake's AI Ops system, enabling proactive issue detection and continuous model performance optimization.