Insights from Goodeye Labs

Expert insights on LLM evaluation and AI quality assessment

Evaluation Is the Foundation of Agent Harness Design

Ege Altan, PhD·March 25, 2026·8 min read

OpenAI and Anthropic independently arrived at similar patterns for designing effective agent harnesses. In both cases, evaluation is the foundation. The teams that get judgment right will build better AI products.

Read article →

Article

The Tufte Test: Teaching an AI Agent to Make Better Data Visualizations

Randy Olson, PhD·March 19, 2026·4 min read

AI agents make generic charts. I held one to Tufte's data visualization principles and had it keep improving until the chart carried the story, the same standard that now ships as a public, reusable chart workflow.

Read article →

Experiment

The November 2025 AI Coding Surprise, Model by Model

Randy Olson, PhD·February 20, 2026·Interactive

In November 2025, AI coding tools went from halting and clumsy to surprisingly capable. We gave 22 models the same prompt and ran five replicates each to make the shift visible.

Explore experiment →

Article

The Return of the Data Scientists

Ege Altan, PhD·February 18, 2026·5 min read

Generating code is easy now, but more code has never meant better products. As production costs fall toward zero, the bottleneck shifts to judgment and taste. This is data science.

Read article →

Article

The AI Mirror Effect: Why Your AI Evaluations Need Domain Experts

Randy Olson, PhD·January 26, 2026·5 min read

The Anthropic Economic Index shows that the quality of what you put into AI almost perfectly predicts the quality of what you get out. If AI mirrors expertise, so must your AI evaluations.

Read article →

PresentationPortland AI Engineers

Beyond the Demo: Building Reliable AI with LLM Evaluations

Randy Olson, PhD·January 14, 2026

Learn how to build reliable AI systems using LLM evaluations. This talk covers why traditional testing breaks with stochastic systems, how generic LLM-as-Judge approaches miss domain nuance, and practical steps to implement contextual evaluations that actually work.

View presentation →

Article

2025 Year in Review for LLM Evaluation: When the Scorecard Broke

Randy Olson, PhD·December 28, 2025·15 min read

In 2025, we discovered we'd been measuring memorization, not intelligence. Models scored 80-90% on static benchmarks but dropped to 60-70% on truly novel problems. This year exposed the fundamental crisis in AI evaluation, and taught us what to build instead.

Read article →