Goodeye Labs

Beyond the Demo

Building Reliable AI with LLM Evaluations

Randy Olson, PhD

Co-Founder & CTO, Goodeye Labs

Portland AI Engineers

January 14, 2026

  • Welcome to Portland AI Engineers
  • Randy Olson, Co-Founder & CTO at Goodeye Labs
  • We build evaluation infrastructure for AI systems
  • Today's question: How do you actually know if your AI works?
Act 1 of 3

Act 1

The Problem

Why Traditional Approaches Fail

Coming Up

How hallucinations cost Cursor real revenueA real-world cautionary tale
Why most teams discover failures from angry usersThe reactive trap
Why assert can't save youDeterminism vs. stochastic outputs
  • Opening with a story that illustrates why this matters

A support bot hallucination cost Cursor real revenue

Cursor support bot hallucination screenshot

"Sam" was not a real person. It was Cursor's AI support bot.

It told multiple users about this policy.

The policy didn't exist. The AI hallucinated it.

  • April 2025: Cursor (AI code editor many of you use) support chatbot went rogue
  • "Sam" was the name of the AI bot in the chat — not a human agent
  • Multiple users asked about logging in on different devices
  • The AI bot confidently claimed a single-device policy existed
  • Said users had to log out first to switch devices
  • Problem: That policy didn't exist — the AI completely hallucinated it

Hallucinations create non-deterministic failures

Non-deterministic failures
Some users got the hallucination, others didn't

Users argued with each other
"That's not true, I switch devices all the time"

CEO had to intervene
Public apology and damage control

The haunting question
How many users got wrong answers and never said anything?

  • Failures were non-deterministic: some users got hallucination, others didn't
  • Users posting on social media disagreed with each other
  • "That's not true, I switch devices all the time"
  • Chaos: nobody could distinguish real policy from AI invention
  • CEO had to personally intervene
  • Lost subscriptions and trust
  • The haunting question: How many users got wrong answers and never said anything?
  • How many quietly canceled? No way to know

Most teams discover failures from angry users

1

User complains

→
2

You investigate

→
3

You fix that case

→
?

Did you create a new problem?

Without systematic evaluation, you're flying blind.

  • I call this the reactive loop
  • Without systematic evaluation, you're stuck:
  • User complains → you investigate → you fix that case
  • No idea if problem is happening to other users right now
  • No idea if your fix created a new problem somewhere else
  • You're flying blind
  • Scary part: Cursor is an AI company building AI tools for developers
  • If they can't catch this, what chance do the rest of us have?

Why traditional testing breaks

Traditional Software

# Deterministic
assert add(2, 2) == 4 # Always passes
assert add(2, 2) == 4 # Always passes
assert add(2, 2) == 4 # Always passes

LLM Outputs

# Stochastic
llm("Write about AI")
# "AI is transforming..."
llm("Write about AI")
# "Artificial intelligence..."
llm("Write about AI")
# "The field of AI..."

Unit tests don't work when the output is different every time

  • Why not traditional software testing approaches?
  • We have unit tests for deterministic code
  • Why not LLMs?
  • LLMs are stochastic
  • Same input gives different outputs
  • Breaks the fundamental assumption of unit testing

When AI must sound like you

The challenge:

Building AI products, shipping to production

Sharing insights with the community

No time to write every post manually

AI must sound like me, not generic LLM

One aspect of my voice: how I use em dashes vs. how AI uses them

My Style

I use em dashes occasionally—but only in paired parenthetical constructions—where I'm adding a side thought.

AI Style

This is amazing—truly revolutionary. The future is here—and it's incredible—beyond anything we imagined.

  • Concrete example from my own work
  • Building cutting-edge AI products, shipping to production
  • Committed to sharing insights with community—hence this talk
  • Don't have time to write every piece of content myself
  • Use AI to produce content more efficiently
  • KEY: It has to sound like me—readers notice when something's off
  • Even well-prompted AI doesn't match my voice consistently
  • Need quality control: evals provide continuous feedback to the AI writer
  • Part of my voice: how I use em dashes
  • I use them occasionally, but only in one specific way
  • Paired parenthetical constructions—like this—for side thoughts
  • AI writing overuses em dashes differently: emphasis, dramatic pauses, interruption

Simple rules don't work

em_dash_count = text.count("—")
assert em_dash_count <= 5, "Too many em dashes!"
8 em dashes, all parenthetical Fine for my voice
2 em dashes, dramatic emphasis Not my style

Counting doesn't capture the pattern.
You can't write a regex to distinguish "dramatic" from "parenthetical."

  • Try traditional approach: count em dashes, flag if greater than 5
  • Doesn't work
  • 8 em dashes, all paired parenthetical = totally fine for my voice
  • 2 em dashes, both dramatic emphasis = not my style at all
  • Counting doesn't capture the pattern
  • Regex can't distinguish between usage types reliably
  • Can't write regex to catch "dramatic vs. parenthetical" em dash usage
  • Same problem as earlier Cursor example—nuanced patterns break simple rules
  • Non-determinism and nuance break traditional testing

Better prompts help, but hit a wall

Version 1

Write in a professional tone

→
Version 2

Write in a professional tone.
Be concise.
Don't overuse em dashes.

→
Version 7

Write in a professional tone.
DO: [15 rules]
DON'T: [20 rules]
EXCEPTIONS: [...]

LLMs follow these lists... inconsistently

Long rule lists become noise. Edge cases multiply. You're playing whack-a-mole, not engineering.

Prompt engineering hits a wall. How do you even know if version 7 is better than version 2?

  • OK so unit tests and simple rules don't work—what about better prompts?
  • Week 1: "Write in a professional tone"
  • Week 4: Add more instructions as you find edge cases
  • Week 8: Long lists of DO THIS and DON'T DO THAT
  • Problem: LLMs follow these lists inconsistently
  • Long rule lists become noise—the model can't prioritize
  • Edge cases multiply faster than you can write rules
  • You're playing whack-a-mole, not engineering
  • Prompt engineering has its place, but it hits a wall
  • And here's the deeper problem: how do you even know if your changes are helping?
  • Without systematic evaluation, you're just guessing

LLMs can do almost anything except stay focused

LLMs are capable of practically infinite possibility

That's their strength. It's also their weakness.

Static Guardrails

Hard-coded rules and tests

Hard-coded limits the AI must stay within

Necessary

+

Quality Signals

Contextual evaluations

Match your preferences, capture nuance

Often missing

Static guardrails alone aren't enough. You need quality signals too.

  • So here's the fundamental challenge
  • LLMs are capable of practically infinite possibility and creation
  • That's their strength—but also their weakness
  • They can do so much that they're not always focused on what you need
  • You need two types of controls:
  • Static guardrails: unit tests, type checks, assertions—hard-coded limits
  • We just saw these fail: the assert on em dash count didn't capture the pattern
  • Quality signals: contextual evals that match your preferences
  • Many teams have static guardrails, but miss quality signals
  • That's the gap we're going to close in Act 2 and 3
Act 2 of 3

Act 2

The First Solution

LLM-as-a-Judge

Coming Up

Can AI evaluate AI?The LLM-as-Judge approach
What generic evals actually catch (and miss)Live demo
The gap that determines success or failureDomain context
  • Natural next step: use an LLM to evaluate another LLM's output
  • Let's see what that looks like

This talk: contextual evaluations, not generic benchmarks

Not This

Benchmark Evaluation

Generic tests like MMLU, HumanEval that measure general model capabilities

Measures breadth

This Talk

Contextual Evaluation

Does your AI work for YOUR specific use case on YOUR specific data?

Measures depth

We're not comparing models. We're asking: does this AI output meet YOUR bar?

  • Before I show you the solution, quick scope clarification
  • You've seen the problem: Cursor hallucinating policies, em dashes that counting can't catch
  • Now you might be thinking "okay, how do we evaluate this?"
  • NOT benchmark evaluation (MMLU, HumanEval, Chatbot Arena)
  • Those compare models across broad general tasks
  • We're asking a different question:
  • Does YOUR AI work for YOUR specific use case on YOUR data?
  • Contextual evaluation measures depth on the problems you're actually solving—grounded in your domain, your data, your standards
  • With that framing, let me show you the first approach teams try

Use an LLM to evaluate another LLM

AI Output

→

LLM Judge

→
/

Pass/Fail

LLM-as-a-Judge Prompt

"Evaluate if this text overuses em dashes in a way that signals AI-generated writing"

  • Use an LLM to evaluate another LLM's output
  • Create an eval with a simple prompt
  • No examples, no labeled data
  • Just a generic instruction to the judge
  • Vanilla LLM judge relying on model's general understanding of "overuse"

Generic judges miss domain-specific nuance

SampleDescriptionResult
ANo em dashes Pass
B2 em dashes, trailing emphasis Pass (wrong!)
C8 em dashes, all parenthetical Fail (wrong!)
D6 em dashes, mixed Fail
  • [DEMO: Run generic eval on 4 samples in Truesight]
  • Sample A (no em dashes): Pass ✓
  • Sample D (6 mixed em dashes): Fail ✓
  • Sample B (2 em dashes, trailing emphasis): Pass ✗ WRONG
  • Generic judge passes because COUNT is low
  • But both are dramatic emphasis, which is NOT my style
  • Sample C (8 em dashes, all parenthetical): Fail ✗ WRONG
  • Generic judge fails because COUNT is high
  • But all are paired parenthetical, exactly my style

Generic judges don't know YOUR standards

Generic LLM-as-Judge

"Does this overuse em dashes?"

Generic pattern detection
without domain context

vs

What We Actually Need

"Does this match Randy's specific em dash pattern?"

Task-specific quality assessment
tuned to my actual voice

  • Critical gap between generic and contextual evaluation
  • Generic asks: "Does this overuse em dashes?"
  • Generic = pattern detection without domain context
  • What we need: "Does this match Randy's specific em dash pattern?"
  • Task-specific quality assessment tuned to my actual voice
  • Think back to Cursor:
  • Generic eval: "does this answer the user's question?" Yes, it answered
  • But didn't check: "is this factually consistent with our actual policies?"
  • That's the domain-specific check they needed
Act 3 of 3

Act 3

The Solution

Contextual Evals with Domain Expertise

Coming Up

What 20 labeled examples actually changeDomain expertise in action
Same approach, much better resultsLive demo
How to integrate evals without slowing downBeyond one-time checks
  • How do we close this gap?
  • How do we encode YOUR domain expertise into the eval?
  • This is where the magic happens

20 examples is all you need

PASS: Paired parenthetical

"I use em dashes—but only this way—for side thoughts."

FAIL: Dramatic/emphasis

"This is incredible—truly amazing."

  • [VIDEO: Error analysis demo playing silently]
  • Now adding domain expertise: labeled examples from MY writing
  • Analyzed a mixture of my own writing and AI-generated versions of my writing
  • Made notes when it looked good and when it looked bad
  • 95%+ of my em dashes are paired parenthetical constructions
  • Other usage types basically never appear in my authentic writing
  • Adding real examples to the eval from my actual writing
  • Personally labeled: this pattern = pass, other patterns = fail

Show the judge what good looks like

AI Output

→
+ Domain Expertise

LLM Judge

→
/

Pass/Fail

LLM-as-a-Judge Prompt

"Evaluate if this text overuses em dashes in a way that signals AI-generated writing"

  • Same architecture as before: AI output → LLM Judge → Pass/Fail
  • But now the judge is grounded in my actual preferences
  • Same prompt as before—what's different is the labeled examples we just added
  • 20 examples from my real writing showing the judge what patterns to look for
  • I'm simplifying here—there's real data science behind grounding evals in expert judgment effectively
  • Techniques like few-shot selection, calibration, and alignment matter
  • But the core idea is simple: give the judge YOUR context through examples
  • Let's see what difference this makes

Contextual evals catch what generic judges miss

SampleGeneric ResultContextual Result
A (no em dashes) Pass Pass
B (2, trailing) Pass Fail
C (8, parenthetical) Fail Pass
D (6, mixed) Fail Fail
  • [DEMO: Show side-by-side comparison in Truesight]
  • Same samples, now with domain expertise encoded
  • Sample C (8 paired parenthetical em dashes):
  • Generic: "fail, high count"
  • Contextual: "pass, matches established voice pattern"
  • Sample B (2 trailing emphasis em dashes):
  • Generic: "pass, low count"
  • Contextual: "fail, wrong usage type"
  • Same approach, completely different results
  • The difference: domain expertise encoded through labeled examples

Integrate evals across the entire lifecycle

Development

Quality bars for your devs to tune AI features against

Pre-Deployment

Regression detection when you change prompts or switch models

Production

Continuous monitoring for drift before users notice

Guardrails

Block error modes before they reach users

AI agents can continuously check progress against evals, not just at the end

  • Once you have this eval built, you can use it everywhere:
  • Development: Quality bars for your devs to measure and tune AI features against
  • Whether you're building AI products OR using AI-assisted coding
  • AI coding agents can continuously check progress against evals after hard-coded checks pass
  • Makes sure it's building in the right direction to meet your preferences
  • Pre-deployment: Regression detection when you change prompts or switch models—no guesswork
  • Production: Continuous monitoring for drift before your users notice
  • Guardrails: Extra layer of protection—be certain error modes don't get through to users
  • Key insight: evals aren't just a one-time check. They're continuous quality signals.

Put the eval in the loop

The eval is in the loop. The agent can't produce failing output.

  • How this integrates into production:
  • [DEMO: Slack agent workflow]
  • Slack agent that writes content for me
  • Has access to my writing guidelines
  • Has access to the deployed Truesight eval endpoint we just built
  • Remember the guardrails stage from the previous slide? This demo shows exactly that pattern in action
  • Key point: eval isn't just a one-time quality check
  • It's integrated into the content generation loop
  • The agent can't produce output that fails my voice quality bar

The same pattern applies everywhere

EdTech - AI Tutoring

Gap: Pedagogical scaffolding, curriculum alignment

Generic evals check "does this explain the concept?"

Contextual eval checks "does this address this student's gaps?"

HealthTech - AI Diagnosis

Gap: Patient-specific risk factors, protocol compliance

Generic evals check "is this medically plausible?"

Contextual eval checks "is this the right diagnosis for this patient?"

FinTech - AI Advice

Gap: Client suitability, compliance requirements

Generic evals check "is this reasonable advice?"

Contextual eval checks "does this match this client's goals and obligations?"

Generic measures breadth. Your application needs depth.

  • My example was brand voice, but the same pattern applies everywhere:
  • EdTech: Generic asks "does this explain the concept?"
  • But what you need is "does this address THIS student's specific gaps?"
  • HealthTech: Generic asks "is this medically plausible?"
  • But what you need is "is this the right diagnosis for THIS patient?"
  • FinTech: Generic asks "is this reasonable advice?"
  • But what you need is "does this match THIS client's actual goals and obligations?"
  • The gap between generic and contextual is the gap between breadth and depth
  • Generic measures breadth. Your application needs depth.

Start small & iterate

Myth: You need hundreds of test cases from day one

Reality:

  • 20-50 simple tasks drawn from real failures
  • Begin with manual checks you already run
  • Convert production failures into test cases
  • Your eval suite grows organically
  • Teams delay because they think they need hundreds of test cases from day one
  • You don't
  • 20-50 simple tasks drawn from real failures is a great start
  • Begin with manual checks you already run during development
  • Find new failure modes in production? Convert them into test cases
  • Your eval suite grows organically from real-world issues

Five principles for production-ready evals

1

Traditional tests break with stochastic systems

2

Generic LLM-as-Judge handles variability but misses domain nuance

3

Domain expertise closes the gap between what's measured and what matters

4

Integrate evals throughout dev, deploy, production, and guardrails

5

Start small with 20-50 examples drawn from real failures

  • Recap:
  • 1. Traditional unit tests don't work for stochastic systems
  • 2. Generic LLM-as-Judge handles variability but misses domain-specific nuance
  • 3. Domain expertise closes the gap between what's measured and what matters
  • 4. Evals integrate into every stage of development
  • 5. Start small with real failures and grow organically

Get started this week

Continuous
improvement
1

Pick one failure mode you've already seen

2

Write 10 test cases for it

3

Define pass/fail criteria with an LLM judge

4

Add 10-20 labeled examples from your domain

5

Run it regularly in your workflow

  • You can start building evals today:
  • 1. Pick one failure mode you've already seen in your AI product (just one)
  • 2. Write 10 test cases for it
  • 3. Define pass/fail criteria and set up a basic LLM judge (open source or platform like Truesight)
  • 4. Add 10-20 labeled examples based on YOUR judgment
  • 5. Run it regularly in your dev workflow or CI/CD
  • Now you're catching that failure mode automatically

Tools to get you started

Truesight by Goodeye Labs

What I demoed today

goodeyelabs.com

Hamel Husain's LLM Evals FAQ

Comprehensive guide covering basics to advanced topics

hamel.dev/blog/posts/evals-faq

Anthropic's Guide to Agent Evals

Excellent practical advice, battle-tested

anthropic.com/engineering/demystifying-evals-for-ai-agents

Arize Phoenix

Open source LLM observability & evaluation

github.com/Arize-ai/phoenix

LangSmith

LangChain's tracing & evaluation platform

langchain.com/langsmith

  • Resources to get you started:
  • Truesight (what I demoed today) - featured prominently
  • Hamel Husain's LLM Evals FAQ (comprehensive guide from basics to advanced topics)
  • Anthropic's guide to agent evals (excellent practical advice, battle-tested)
  • Arize Phoenix and LangSmith (open source tools for observability and evaluation)
  • All resources have QR codes for easy access

Don't wait for perfect coverage.
Start small this week.

Goodeye Labs

Randy Olson, PhD

randalolson.com

goodeyelabs.com

linkedin.com/in/randalolson

@randal_olson

Scan to join the Truesight waitlist

Let's connect! I'd love to hear what you build.

  • Don't wait for hundreds of test cases or perfect coverage
  • Start small this week
  • The sooner you treat evals as core to AI development, the sooner you escape the reactive loop
  • Ship better systems with confidence
  • Let's connect on LinkedIn - QR code available for easy scanning
  • I'd love to hear what you build
  • Transition to Q&A: "Before we move to your questions, I want to remind you that all the contact information you need to take the next step is displayed here. Whether you're interested in implementation consulting, joining our beta program, or simply staying updated on our research, you can reach us at these channels. I'd love to hear what questions you have—and as you can see, everything you need to connect with me afterward is right here."
  • During Q&A: Keep this slide visible throughout. Reference it when appropriate: "That's an excellent question about implementation timeline, and you can reach our implementation team using the contact information you see here."
  • After Q&A: Return to this slide for final closing: "Thank you for these great questions. As you leave, please use the information on this slide to take the next step in implementing these solutions."
  • Common Q&A questions I'm prepared for:
  • Q: How much data to see improvement? A: Usually 15-30 examples gets 80% of value
  • Q: Works for technical/objective tasks? A: Yes, especially code review, compliance
  • Q: How to handle disagreement in labeling? A: Start with unambiguous cases, calibrate on edge cases
  • Q: What's the cost difference? A: Marginal (same API calls, different prompts)
  • Q: How to maintain evals as requirements change? A: Living documentation (update labels as understanding evolves)
© 2026 Goodeye Labs·Privacy Policy·Insights
Book a Demo·hello@goodeyelabs.com·