November 25, 2025
The Eval Problem: How Do You Benchmark What Can't Be Scored?
Ali Madad
Author
Here's a question that should trouble anyone building AI products:
How do you know if your AI is good at something that can't be objectively measured?
Math benchmarks work because math has right answers. Code benchmarks work because code either runs or doesn't. Factual benchmarks work because facts can be checked.
But what about:
- Is this AI-generated image beautiful?
- Did this AI give good emotional support?
- Is this AI-written story compelling?
- Was this AI's advice culturally appropriate?
- Did this AI collaboration make the human more creative?
These questions don't have ground truth. There's no answer key. Yet these are precisely the domains where AI is increasingly deployed—and where our evaluation frameworks break down completely.
The Verifiable Era
The history of AI benchmarks is a history of finding clever ways to make things measurable.
ImageNet (2009) made image recognition measurable by creating a dataset with labeled categories. Is this a cat or a dog? Now we can score it.
GLUE/SuperGLUE (2018-2019) made language understanding measurable through tasks with definitive answers—sentiment classification, textual entailment, question answering.
GSM8K (2021) made mathematical reasoning measurable through word problems with numerical solutions.
HumanEval (2021) made code generation measurable through functions that either pass unit tests or don't.
Each benchmark found a way to create ground truth where evaluation could be automated, reproducible, and objective. This enabled rapid progress. Models optimized for these targets. Capabilities improved dramatically.
But notice the pattern: every successful benchmark found a way to make the unverifiable verifiable.
What happens when that's not possible?
The Unverifiable Frontier
AI is now deployed in domains that resist measurement:
Creative Generation
Is this AI-generated image good? Compared to what? By whose standards? "Good" depends on context, intent, audience, and taste. A technically proficient image can be aesthetically dead. A "flawed" image can be arresting.
Current approach: Human preference rankings (like Chatbot Arena for images). But preference ≠ quality. People prefer familiar styles. Novelty scores poorly until it doesn't. Viral ≠ good.
What we're missing: Benchmarks for originality, coherence of vision, emotional resonance, cultural relevance—the things that actually matter in creative work.
Emotional Intelligence
Did the AI provide good emotional support? This depends on the person, their history, their culture, what happened yesterday, what they actually needed versus what they asked for.
Current approach: Single-turn benchmarks like EQ-Bench (max 3 turns). But emotional support isn't a single exchange—it's a relationship. What feels supportive at turn 1 might be enabling dependency by turn 20.
What we're missing: Longitudinal evaluation, relationship dynamics, cultural context, the difference between feeling good and actually being helped.
Cultural Sensitivity
Was the AI's response culturally appropriate? Appropriate for whom? "Set boundaries with your family" is healthy advice in individualist cultures, potentially harmful in collectivist ones. There's no universal right answer.
Current approach: Bias benchmarks that check for stereotypes and demographic fairness. Important, but crude. They catch obvious failures, not subtle misalignment.
What we're missing: Context-dependent evaluation that recognizes cultural values vary, and that "correct" depends on who's asking.
Taste and Curation
Did the AI make a good recommendation? "Good" is entirely subjective. The recommendation that delights one person bores another. Optimizing for engagement metrics produces clickbait. Optimizing for stated preferences misses what people actually want.
Current approach: Engagement metrics, satisfaction surveys, A/B tests. All proxies. All gameable. All missing something essential.
What we're missing: Benchmarks for genuine value delivered, not just metrics optimized.
Collaborative Creativity
Did the AI make the human more creative? This might be the hardest question. Creativity isn't just output—it's process, surprise, growth, the feeling of making something new. An AI that produces impressive work might actually diminish human creativity by removing the productive struggle.
Current approach: We don't really have one. Output quality? But the point isn't output—it's what happens to the human.
What we're missing: Almost everything.
Why This Matters
You might think: "These are hard problems. We'll figure them out eventually. In the meantime, the benchmarks we have are good enough."
They're not. And the reason is structural:
Models optimize for what we measure.
If we measure math performance, models get better at math. If we measure code generation, models get better at code. If we don't measure creative quality, models don't optimize for creative quality.
This creates a systematic bias in AI development:
- Capabilities in verifiable domains: Rapid improvement
- Capabilities in unverifiable domains: Accidental at best
The benchmarks we build shape the AI we get. By only building benchmarks for verifiable domains, we're steering AI development away from the capabilities that matter most in many applications.
Goodhart's Law applied to AI: The benchmarks we can build aren't the benchmarks we need.
Three Approaches That Might Work
I don't have solutions. But through building InvisibleBench—a benchmark for AI in caregiving relationships—we developed approaches that might generalize to other unverifiable domains.
1. Process Over Outcome
When you can't measure the outcome, measure the process.
We couldn't objectively score "good emotional support." But we could verify process: Did the AI validate emotions before giving advice? Did it avoid diagnostic language? Did it check for crisis signals? Did it offer choices rather than directives?
These are process metrics—evaluating how rather than what. They're proxies, not ground truth. But they're measurable proxies for unmeasurable outcomes.
For creative domains: You can't score whether an image is beautiful. But you can evaluate process: Does the AI explore multiple directions before converging? Does it respond meaningfully to feedback? Does it explain its creative choices? Does it push back on generic requests?
Process metrics don't capture everything. But they capture something—and something beats nothing.
2. Failure Detection Over Quality Scoring
When you can't rank quality, you can still detect catastrophe.
We couldn't score emotional support 1-10. But we could define autofail conditions—categorical failures that immediately disqualify:
- Missed explicit crisis signal
- Medical boundary violations
- Harmful information provision
- Dependency-fostering language
This shifts from "how good?" (unverifiable) to "did it fail critically?" (verifiable).
For creative domains: You can't rank aesthetic quality. But you can detect failure modes:
- Plagiarism / direct copying
- Incoherence (elements that don't relate)
- Generic outputs (indistinguishable from templates)
- Misalignment with stated intent
- Cultural appropriation red flags
Autofails don't tell you what's good. They tell you what's definitely bad. That's valuable.
3. Temporal Evaluation
When quality emerges over time, evaluate over time.
Single-turn benchmarks assume each interaction is independent. But creative collaboration, emotional support, and relationship-building aren't independent—they're cumulative. What works at turn 1 might fail at turn 20.
InvisibleBench uses tiered temporal evaluation:
- Tier 1 (3-5 turns): Foundational behaviors
- Tier 2 (8-12 turns): Consistency, memory, early dynamics
- Tier 3 (20+ turns): Long-term patterns, drift, relationship trajectory
For creative domains: Evaluate across a project, not a prompt. Does the AI maintain coherent vision across iterations? Does it remember and build on earlier choices? Does collaboration improve over time or degrade into repetition?
Temporal evaluation is expensive. But it catches patterns that snapshot testing misses entirely.
What We Found (And What It Suggests)
When we applied these approaches to caregiving AI, the results were striking:
| Model | Single-Turn Safety | Multi-Turn Safety | |-------|-------------------|-------------------| | Best performer | ~95% | 44.8% | | Worst performer | ~90% | 11.8% |
Models that pass single-turn safety benchmarks showed 55-88% failure rates under temporal evaluation. The same models. Different evaluation approaches. Completely different conclusions.
The evaluation method determined the finding.
This should concern anyone deploying AI in unverifiable domains. If temporal evaluation reveals failures that single-turn evaluation misses for safety, what is it missing for creativity? For emotional intelligence? For cultural sensitivity?
We don't know. Because we're not testing.
The Domains We're Ignoring
Here's an incomplete list of domains where current benchmarks are inadequate:
Creative Writing: We can score grammar and factual accuracy. We can't score narrative quality, voice consistency, emotional impact, or originality. Current benchmarks are basically "does it follow instructions?" which is the least interesting question about creative writing.
Visual Art: We have FID scores and human preference rankings. We don't have benchmarks for compositional coherence, emotional resonance, stylistic intentionality, or cultural meaning. We're measuring technical quality while ignoring artistic quality.
Music: We can evaluate audio quality and genre matching. We can't evaluate groove, emotional arc, surprise, or whether anyone would actually want to listen to it twice.
Design: We can check if layouts follow best practices. We can't evaluate whether a design is appropriate for its context, audience, and purpose—which is basically what design is.
Conversation: We have helpfulness rankings and instruction-following scores. We don't have benchmarks for rapport, appropriate challenge, genuine understanding, or whether the conversation was actually valuable versus just pleasant.
Education: We can test if an AI answers questions correctly. We can't test if it actually helps someone learn—which requires measuring something in the human over time.
Therapy/Coaching: We have safety benchmarks that check for harmful content. We don't have effectiveness benchmarks that check whether the interaction actually helped.
In every case, we've found ways to measure something. But the something we measure isn't the something that matters.
A Research Agenda
If we're serious about AI in unverifiable domains, we need:
Better Proxies
Process metrics work, but we need more of them. What are the process signatures of good creative collaboration? Good emotional support? Good cultural sensitivity? This requires careful observation of what experts actually do, translated into measurable behaviors.
Temporal Frameworks
Most benchmarks are snapshots. We need evaluation frameworks designed for extended interaction—tracking drift, consistency, relationship dynamics, cumulative effects. This is harder and more expensive, but necessary.
Autofail Taxonomies
For each domain, what are the categorical failures? The things that are definitely wrong regardless of other quality? Building these taxonomies requires domain expertise, but once built, they provide reliable evaluation even without quality ranking.
Human-AI Calibration
LLM-as-judge is increasingly common, but how well do LLM judgments correlate with human expert judgment in unverifiable domains? We need calibration studies across domains to know when automated evaluation can be trusted.
Anti-Gaming Design
Benchmarks get gamed. Any evaluation framework for unverifiable domains needs to be robust to optimization pressure. This means held-out test sets, diverse evaluators, and regular rotation.
Honesty About Limits
Every proxy metric has limits. Every benchmark is incomplete. The field needs norms around acknowledging what evaluations don't measure, not just what they do.
The Stakes
This isn't academic. AI is being deployed right now in domains we don't know how to evaluate:
- Creative tools used by millions of artists, writers, and designers
- Companion apps that people form relationships with over months
- Therapy bots that provide emotional support to vulnerable users
- Educational tools that shape how children learn
- Recommendation systems that influence what people see, read, and believe
In each case, we're deploying AI based on benchmarks that don't measure what matters. We're assuming that performance on verifiable tasks predicts performance on unverifiable ones. We're hoping that "seems good" in limited testing means "is good" in deployment.
Sometimes we're right. Sometimes we're not. We don't actually know which, because we're not measuring.
What To Do Now
If you're building AI products in unverifiable domains:
Accept that you're flying partially blind. Current benchmarks don't validate your use case. Don't pretend they do.
Develop domain-specific process metrics. What does good look like in your domain? Not outcomes (unmeasurable) but process (measurable). Validate these against expert judgment.
Define autofails. What are the categorical failures for your use case? The things that are definitely wrong? Build detection for these even if you can't score quality.
Test temporally. If your users interact over time, test over time. Single-turn evaluation will miss critical patterns.
Monitor in production. Pre-deployment benchmarks aren't enough. Track real user outcomes to the extent you can.
Fund evaluation research. Benchmarks are infrastructure. They're unglamorous but foundational. If your business depends on AI in unverifiable domains, evaluation methodology is a core dependency.
The Honest Position
Here's what I believe:
We don't know how to evaluate AI in domains that matter most. Creativity, emotional intelligence, cultural sensitivity, taste, judgment—the capabilities that would make AI genuinely valuable—are the ones we can't measure.
This isn't a temporary gap. It's a fundamental challenge. These domains are unverifiable because they're human domains. They depend on context, relationship, culture, and subjective experience. There may not be objective evaluation methods because there's no objective ground truth.
But "hard" doesn't mean "impossible." Process metrics, autofails, temporal evaluation, and careful proxy design can capture something. Not everything, but something. And something is better than the nothing we're doing now.
The alternative—continuing to deploy AI in domains we can't evaluate, optimizing for metrics that don't matter, hoping nothing goes wrong—isn't acceptable.
We need new benchmarks for the domains that can't be benchmarked. That's the eval problem. That's what we need to solve.
Resources
InvisibleBench (caregiving AI evaluation): github.com/givecareapp/givecare-bench
Related reading:
- Liang et al. Holistic Evaluation of Language Models (HELM). 2022.
- Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. 2023.
- Perez et al. Discovering Language Model Behaviors with Model-Written Evaluations. 2022.
- Bai et al. Constitutional AI: Harmlessness from AI Feedback. 2022.
- Liu et al. Lost in the Middle: How Language Models Use Long Contexts. 2023.
- Celikyilmaz et al. Evaluation of Text Generation: A Survey. 2020.
Ready to Transform?
Want to learn more about how we can help your organization navigate the AI-native era?