SCTY
Back

February 12, 2026

AI Surfaces: A Field Report from CYLNDR Off-Site 2026

Last week I gave a talk at CYLNDR Off-Site 2026 at Neue House NYC. The deck was called AI Surfaces, a practical field report from the last year of building with AI, not a prediction about what's coming next.

Before I hit the first slide, I asked everyone to scan a QR code. Their phones opened to a 2x2 grid: four big buttons, four emojis. 🔥 💡 🤯 💩. Tap whenever. No login, no friction. Think Twitch chat, but in a conference room. Reactions streamed live into the talk as I moved through each section: a real-time signal check, not applause. When a claim landed, 🔥 spiked. When something felt off or overstated, 💩 showed up. The 🤯 moments mapped almost exactly to the slides I expected. The 💡 hits surprised me.

I wanted their actual read on the room, not a show of hands at the end.

This is the write-up. The full slide deck is at scty.org/talks/cyl26.


Three Kubrick Films as a Diagnostic Lens

I opened with Stanley Kubrick. Most pop-culture AI references reach for dystopian sci-fi: killer robots, rogue superintelligence, humanity's last stand. Kubrick goes somewhere more uncomfortable. His films aren't about machines. They're about people: how we coordinate badly, how we're shaped by memory we didn't choose, how we talk ourselves into things we know are wrong. That felt closer to the truth of where AI actually is right now.

Dr. Strangelove: Coordination Catastrophe. Smart systems, no shared control plane. Every actor follows local logic while the whole thing drifts toward failure. This is the current state of most multi-agent AI stacks, and most org decision-making about AI. The bottleneck isn't model IQ anymore. It's coordination quality across product, policy, vendor, and legal loops.

The Shining: Haunted by Training Data. A system trapped by its own memory. The past keeps returning even when everyone wants a fresh start. Models reproduce legacy bias, stale process, and historical power structure unless teams actively re-architect prompts, retrieval, and feedback loops. Many failures we call new are old patterns at machine speed.

Lolita: Mutual Delusion. A compelling story can normalize bad judgment when everyone wants it to be true. Users over-trust, founders over-claim, media and capital reinforce the same false readiness signal. Mutual delusion scales faster than evidence, unless we force contact with reality.

These three archetypes became the diagnostic lens for everything that followed. They also closed the talk, which I'll get to.


Where We Actually Are

I did a scorecard of last year's predictions. The short version: mostly right on direction, uneven on speed.

The big reframe that crystallized in 2025 was GDPVal: AGI discourse moved from abstract intelligence tests to measurable economic substitution. The new bar isn't "can it pass the Turing test," it's "can it produce expert-level output at radically better speed and cost, tied to measurable value creation?" That's a deployment and labor-economics question, not a philosophy question.

The expert referral markets tell the same story. Scarce domain judgment is being priced like premium infrastructure. Model builders are financially weaponizing distribution channels to acquire expert data and validation loops. If experts are this expensive to source, advantage accrues to teams that systematize judgment rather than treat it as ad hoc labor.

The notable papers from the past year matter because they convert AI strategy from "which model is smartest" to "which system fails safely, predictably, and economically." Interpretability and alignment research isn't academic anymore. It makes risk legible and operational.


The Anatomy of a Winning AI Product

The surface shift section was the core of the talk. The central claim: winning AI products are systems, not just models.

I walked through seven layers of the stack:

  1. Model: raw capability. Necessary but increasingly commoditized.
  2. System: prompts, skills, and workflows that convert capability into repeatable product behavior.
  3. Agent: tools, memory, and routing that turn plans into actions across contexts.
  4. Human: judgment, taste, and domain expertise that decide quality in ambiguous work.
  5. Governance: safety, ethics, and accountability as design inputs, not post-hoc compliance.
  6. Infra: providers, compute, and cost as the hard scaling constraint behind every product decision.
  7. Protocols: standards like MCP, A2A, and AG-UI becoming the interoperability layer across every other layer.

The protocols layer is where the real standardization story is happening right now. MCP gives models consistent tool and data context. A2A lets one agent discover and delegate to another across system boundaries. AGENTS.md is a portable control file telling coding agents how to behave in a repo. These aren't curiosities. They're becoming infrastructure.

On the ownership question, I ran a 2x2: rented vs. owned on one axis, closed vs. open on the other. Most teams entered AI in the rented/closed quadrant (fastest onboarding, but pricing, roadmap, and policy sit with the provider). The strongest corner is owned/open: control, transparency, and cost tuning compounding in one stack. Most serious operators are moving there.


Market Inversions

The inversion data was stark. Between 2024 and 2025:

  • LLM market share rotated: leadership changed, multi-provider normalization became structural.
  • Startups outpaced incumbents at the apps layer by shipping narrower, faster surfaces.
  • Build-versus-buy flipped: packaged AI solutions won because integration speed beat bespoke ambition.

AI also sells differently now. Trial intent is higher, time-to-value is immediate, and PLG adoption scales from teams upward before enterprise process fully engages. Budgets moved from pilot curiosity to production dependence fast.


The Slop Problem

Total AI output is compounding. Quality throughput isn't compounding with it. That's the slop problem, and it's a scaling-law problem, not a vibes problem.

The target regime (what I called Law X) is one where good output scales at least as fast as total output. We're not there. The current dynamic is that every incremental quality improvement is swamped by volume growth. The operating implication is that taste, curation, and evaluation infrastructure become the competitive surface, not raw generation capability.

This is why I spent time on InvisibleBench, a benchmark we built at GiveCare for evaluating AI caregiving assistants across 29 scenarios and 224 turns. The rubric covers belonging, safety, memory, compliance, and trauma. Miss a crisis or stereotype a caller and the scenario score goes to zero. Polished interaction can mask unreliability. Benchmarks need to test the hard cases, not just the demos.


Engineering Is Being Renamed in Real Time

The engineering evolution slide was one of the most discussed. I charted nine eras:

Prompt Engineering → Vibe Coding → Vibe Engineering → Agents → Memory → Context Engineering → Agent Harnesses → Ralph Loops → Personal Assistants

Each rename reflects a deeper change in how work gets done. We're currently living through the transition from context engineering (what information gets selected and structured at the right moment) to agent harnesses (evals, constraints, retries, failure handling for production-safe autonomy) to ralph loops (continuous autonomous cycles: monitor, act, evaluate, re-run with minimal human interruption).

The human-agent contract that emerges looks like: Delegate → Run → Diff → Approve. Coding is shifting from typing every token to supervising execution loops. The gain comes from orchestration quality, not typing speed. It's choreography.


The Uncomfortable Inversions

Two slides that generated the most room reaction:

AI hiring humans. An agent identifies a physical-world job, posts it, evaluates human responses, and dispatches a person to execute. Humans as on-demand actuators inside an AI workflow. The inversion is real and it's not slowing down.

Worldcoin 50/50. Proof-of-personhood can restore trust and market integrity in a bot-saturated internet. The same mechanism can normalize biometric surveillance. Both things are true. The governance and power model decide which one we get. I don't think this is a villain-or-savior story. It's a genuine dilemma that deserves a genuine debate.


The Close: Back to Kubrick

I closed by mapping the three takeaways back to the opening archetypes:

Strangelove → Governance. The race dynamic is less about a single breakthrough and more about many actors locked in strategic escalation. The bottleneck is coordination, not capability.

The Shining → Haunting. Models carry the residue of their training data. Those ghosts shape output, bias, and behavior in production. Responsibility for curation and oversight doesn't disappear when you ship.

Lolita → Shared accountability. Users over-trust systems while builders over-claim control. Both illusions compound risk. Closing that gap is a design problem, a communication problem, and an institutional problem.


The reaction data held up. The 🤯 clusters landed on the AI-hiring-humans slide and the engineering evolution arc. 🔥 ran hot through the market inversions. A few 💩 hits during the Worldcoin section, which is exactly right. That slide is supposed to be uncomfortable. The 💡 moments were scattered in ways I didn't predict, which is the whole point of running live reactions: you find out what actually landed versus what you thought would land.

People are genuinely grappling with this, not just consuming takes about it.

If you were there and want to continue the conversation, or want to bring this framework to your team, reach out: [email protected].