Learning & Case Studies, AI Evals, RAG & MCP

AI Evals: How I Think About Evaluating LLMs

Before we ship any LLM powered feature, I want to know that it’s correct, safe, and fast enough for the user. That’s where evals come in. To me, evals are the set of checks we run on model outputs so we’re not guessing. we’re deciding with data.

I think about the pipeline in four stages: the prompt (what we send in), the model (what we call), the output (what we get back), and then the evals we run on that output. Evals aren’t a single number; they’re a mix of what matters for the product. For a recommendation engine, that might be relevance and diversity. For a chat feature, it could be helpfulness, safety, and whether the answer stays on topic. For anything user facing, latency and consistency matter too. So I treat “does it respond in under X ms?” and “does it behave similarly for similar inputs?” as part of the eval story.

I also care about who defines the eval. Product and eng need to agree on what “good” means before we scale. That means choosing a small set of golden examples or rubrics, running the model on them, and scoring. If we’re comparing two models or two prompts, evals tell us which one is better on our criteria, not just on generic benchmarks. I’ve learned to keep eval sets focused and representative of real use cases, and to revisit them when we change the product or the user.

In practice, I see evals as the gate before launch and the signal in production. We don’t ship until key eval metrics meet our bar, and we keep monitoring them after launch so we can catch regressions or drift. That’s how I connect “did we build the right thing?” to “is it still working for users?”

RAG & MCP: How I Think About Connecting LLMs to the World

LLMs are powerful on their own, but they’re far more useful when they can read your data and use your tools. That’s the idea behind RAG and MCP. Two concepts I keep in mind when we design AI features.

RAG (retrieval augmented generation) is how I think about “grounding” the model. Instead of relying only on what’s in the weights, we retrieve relevant chunks from a knowledge base: docs, tickets, internal wikis. We pass them in as context. The model then generates an answer that’s informed by that context. For me, the product implications are clear: we need a good retrieval step (what to index, how to rank), a clear way to cite or show sources, and a fallback when retrieval returns nothing useful. I see RAG as the default pattern when the answer depends on data that changes or that we don’t want to bake into training.

MCP (Model Context Protocol) is how I think about giving the model access to tools and live data. It’s a protocol that lets an LLM call out to servers that expose actions and resources: databases, APIs, file systems, third party apps. So the model isn’t just generating text; it can run a query, fetch a file, or trigger a workflow. From a product perspective, that means we can ship assistants that actually do things: summarize your inbox, query your DB, or control your dev environment. I see MCP as the plumbing that makes “AI + tools” standard and composable, rather than every product building its own integration layer.

Together, RAG and MCP shape how I think about building AI products: RAG for “use this body of knowledge,” MCP for “use these tools and data sources.” Both require clear product decisions: what to expose, how to scope access, how to handle errors and latency, and how to explain to the user what the system did. I’m tracking how products like Cursor and Claude use MCP to stay flexible and how teams combine RAG with tool use for more capable, trustworthy experiences.