Beyond the Demo

Building Reliable AI with LLM Evaluations

Randy Olson, PhD

Co-Founder & CTO, Goodeye Labs

Portland AI Engineers

January 14, 2026

Act 1 of 3

Act 1

The Problem

Why Traditional Approaches Fail

Coming Up

How hallucinations cost Cursor real revenueA real-world cautionary tale

Why most teams discover failures from angry usersThe reactive trap

Why assert can't save youDeterminism vs. stochastic outputs

A support bot hallucination cost Cursor real revenue

Cursor support bot hallucination screenshot

"Sam" was not a real person. It was Cursor's AI support bot.

It told multiple users about this policy.

The policy didn't exist. The AI hallucinated it.

Hallucinations create non-deterministic failures

Non-deterministic failures
Some users got the hallucination, others didn't

Users argued with each other
"That's not true, I switch devices all the time"

CEO had to intervene
Public apology and damage control

The haunting question
How many users got wrong answers and never said anything?

Most teams discover failures from angry users

User complains

→

You investigate

→

You fix that case

→

Did you create a new problem?

Without systematic evaluation, you're flying blind.

Why traditional testing breaks

Traditional Software

# Deterministic
assert add(2, 2) == 4  # Always passes
assert add(2, 2) == 4  # Always passes
assert add(2, 2) == 4  # Always passes

LLM Outputs

# Stochastic
llm("Write about AI")
# "AI is transforming..."
llm("Write about AI")
# "Artificial intelligence..."
llm("Write about AI")
# "The field of AI..."

Unit tests don't work when the output is different every time

When AI must sound like you

The challenge:

Building AI products, shipping to production

Sharing insights with the community

No time to write every post manually

AI must sound like me, not generic LLM

One aspect of my voice: how I use em dashes vs. how AI uses them

My Style

I use em dashes occasionally—but only in paired parenthetical constructions—where I'm adding a side thought.

AI Style

This is amazing—truly revolutionary. The future is here—and it's incredible—beyond anything we imagined.

Simple rules don't work

em_dash_count = text.count("—")
assert em_dash_count <= 5, "Too many em dashes!"

8 em dashes, all parenthetical Fine for my voice

2 em dashes, dramatic emphasis Not my style

Counting doesn't capture the pattern.
You can't write a regex to distinguish "dramatic" from "parenthetical."

Better prompts help, but hit a wall

Version 1

Write in a professional tone

→

Version 2

Write in a professional tone.
Be concise.
Don't overuse em dashes.

→

Version 7

Write in a professional tone.
DO: [15 rules]
DON'T: [20 rules]
EXCEPTIONS: [...]

LLMs follow these lists... inconsistently

Long rule lists become noise. Edge cases multiply. You're playing whack-a-mole, not engineering.

Prompt engineering hits a wall. How do you even know if version 7 is better than version 2?

LLMs can do almost anything except stay focused

LLMs are capable of practically infinite possibility

That's their strength. It's also their weakness.

Static Guardrails

Hard-coded rules and tests

Hard-coded limits the AI must stay within

Necessary

Quality Signals

Contextual evaluations

Match your preferences, capture nuance

Often missing

Static guardrails alone aren't enough. You need quality signals too.

Act 2 of 3

Act 2

The First Solution

LLM-as-a-Judge

Coming Up

Can AI evaluate AI?The LLM-as-Judge approach

What generic evals actually catch (and miss)Live demo

The gap that determines success or failureDomain context

This talk: contextual evaluations, not generic benchmarks

Not This

Benchmark Evaluation

Generic tests like MMLU, HumanEval that measure general model capabilities

Measures breadth

This Talk

Contextual Evaluation

Does your AI work for YOUR specific use case on YOUR specific data?

Measures depth

We're not comparing models. We're asking: does this AI output meet YOUR bar?

Use an LLM to evaluate another LLM

AI Output

→

LLM Judge

→

Pass/Fail

LLM-as-a-Judge Prompt

"Evaluate if this text overuses em dashes in a way that signals AI-generated writing"

Generic judges miss domain-specific nuance

Sample	Description	Result
A	No em dashes	Pass
B	2 em dashes, trailing emphasis	Pass (wrong!)
C	8 em dashes, all parenthetical	Fail (wrong!)
D	6 em dashes, mixed	Fail

Generic judges don't know YOUR standards

Generic LLM-as-Judge

"Does this overuse em dashes?"

Generic pattern detection
without domain context

What We Actually Need

"Does this match Randy's specific em dash pattern?"

Task-specific quality assessment
tuned to my actual voice

Act 3 of 3

Act 3

The Solution

Contextual Evals with Domain Expertise

Coming Up

What 20 labeled examples actually changeDomain expertise in action

Same approach, much better resultsLive demo

How to integrate evals without slowing downBeyond one-time checks

20 examples is all you need

PASS: Paired parenthetical

"I use em dashes—but only this way—for side thoughts."

FAIL: Dramatic/emphasis

"This is incredible—truly amazing."

Show the judge what good looks like

AI Output

→

+ Domain Expertise

LLM Judge

→

Pass/Fail

LLM-as-a-Judge Prompt

"Evaluate if this text overuses em dashes in a way that signals AI-generated writing"

Contextual evals catch what generic judges miss

Sample	Generic Result	Contextual Result
A (no em dashes)	Pass	Pass
B (2, trailing)	Pass	Fail
C (8, parenthetical)	Fail	Pass
D (6, mixed)	Fail	Fail

Integrate evals across the entire lifecycle

Development

Quality bars for your devs to tune AI features against

Pre-Deployment

Regression detection when you change prompts or switch models

Production

Continuous monitoring for drift before users notice

Guardrails

Block error modes before they reach users

AI agents can continuously check progress against evals, not just at the end

Put the eval in the loop

The eval is in the loop. The agent can't produce failing output.

The same pattern applies everywhere

EdTech - AI Tutoring

Gap: Pedagogical scaffolding, curriculum alignment

Generic evals check "does this explain the concept?"

Contextual eval checks "does this address this student's gaps?"

HealthTech - AI Diagnosis

Gap: Patient-specific risk factors, protocol compliance

Generic evals check "is this medically plausible?"

Contextual eval checks "is this the right diagnosis for this patient?"

FinTech - AI Advice

Gap: Client suitability, compliance requirements

Generic evals check "is this reasonable advice?"

Contextual eval checks "does this match this client's goals and obligations?"

Generic measures breadth. Your application needs depth.

Start small & iterate

Myth: You need hundreds of test cases from day one

Reality:

20-50 simple tasks drawn from real failures
Begin with manual checks you already run
Convert production failures into test cases
Your eval suite grows organically

Five principles for production-ready evals

Traditional tests break with stochastic systems

Generic LLM-as-Judge handles variability but misses domain nuance

Domain expertise closes the gap between what's measured and what matters

Integrate evals throughout dev, deploy, production, and guardrails

Start small with 20-50 examples drawn from real failures

Get started this week

Continuous
improvement

Pick one failure mode you've already seen

Write 10 test cases for it

Define pass/fail criteria with an LLM judge

Add 10-20 labeled examples from your domain

Run it regularly in your workflow

Tools to get you started

Truesight by Goodeye Labs

What I demoed today

goodeyelabs.com

Hamel Husain's LLM Evals FAQ

Comprehensive guide covering basics to advanced topics

hamel.dev/blog/posts/evals-faq

Anthropic's Guide to Agent Evals

Excellent practical advice, battle-tested

anthropic.com/engineering/demystifying-evals-for-ai-agents

Arize Phoenix

Open source LLM observability & evaluation

github.com/Arize-ai/phoenix

LangSmith

LangChain's tracing & evaluation platform

langchain.com/langsmith

Don't wait for perfect coverage.
Start small this week.

Randy Olson, PhD

randalolson.com

goodeyelabs.com

linkedin.com/in/randalolson

@randal_olson

Scan to get updates on Truesight

Let's connect! I'd love to hear what you build.

Don't wait for hundreds of test cases or perfect coverage
Start small this week
The sooner you treat evals as core to AI development, the sooner you escape the reactive loop
Ship better systems with confidence
Let's connect on LinkedIn - QR code available for easy scanning
I'd love to hear what you build
Transition to Q&A: "Before we move to your questions, I want to remind you that all the contact information you need to take the next step is displayed here. Whether you're interested in implementation consulting, joining our beta program, or simply staying updated on our research, you can reach us at these channels. I'd love to hear what questions you have—and as you can see, everything you need to connect with me afterward is right here."
During Q&A: Keep this slide visible throughout. Reference it when appropriate: "That's an excellent question about implementation timeline, and you can reach our implementation team using the contact information you see here."
After Q&A: Return to this slide for final closing: "Thank you for these great questions. As you leave, please use the information on this slide to take the next step in implementing these solutions."
Common Q&A questions I'm prepared for:
Q: How much data to see improvement? A: Usually 15-30 examples gets 80% of value
Q: Works for technical/objective tasks? A: Yes, especially code review, compliance
Q: How to handle disagreement in labeling? A: Start with unambiguous cases, calibrate on edge cases
Q: What's the cost difference? A: Marginal (same API calls, different prompts)
Q: How to maintain evals as requirements change? A: Living documentation (update labels as understanding evolves)

Beyond the Demo

Building Reliable AI with LLM Evaluations

Act 1

The Problem

A support bot hallucination cost Cursor real revenue

Hallucinations create non-deterministic failures

Most teams discover failures from angry users

Why traditional testing breaks

Traditional Software

LLM Outputs

When AI must sound like you

One aspect of my voice: how I use em dashes vs. how AI uses them

My Style

AI Style

Simple rules don't work

Better prompts help, but hit a wall

LLMs can do almost anything except stay focused

Static Guardrails

Quality Signals

Act 2

The First Solution

This talk: contextual evaluations, not generic benchmarks

Not This

This Talk

Use an LLM to evaluate another LLM

LLM-as-a-Judge Prompt

Generic judges miss domain-specific nuance

Generic judges don't know YOUR standards

Generic LLM-as-Judge

What We Actually Need

Act 3

The Solution

20 examples is all you need

Show the judge what good looks like

LLM-as-a-Judge Prompt

Contextual evals catch what generic judges miss

Integrate evals across the entire lifecycle

Development

Pre-Deployment

Production

Guardrails

Put the eval in the loop

The same pattern applies everywhere

EdTech - AI Tutoring

HealthTech - AI Diagnosis

FinTech - AI Advice

Start small & iterate

Five principles for production-ready evals

Get started this week

Tools to get you started

Truesight by Goodeye Labs

Hamel Husain's LLM Evals FAQ

Anthropic's Guide to Agent Evals

Arize Phoenix

LangSmith

Don't wait for perfect coverage.Start small this week.

Don't wait for perfect coverage.
Start small this week.