April 13, 20263 min read

Gemini watches the video so Claude doesn't have to

Claude can't natively watch video. Gemini can. Together they close the dev feedback loop for AI-generated content.

testingvideodev-loop

Here's the awkward truth about building an AI video generation pipeline: the model driving the development — Claude — can't watch video. It can read code, reason about prompts, and plan creative direction. But when Naomi generates a 15-second clip, Claude has no way to know if the output is good, bad, or horrifying.

This creates a blind spot in the dev loop. I can iterate on prompt structure, adjust reference handling, tune the composition system — but the only way to know if the output improved is to watch the video myself. That doesn't scale, and it breaks the feedback loop that makes AI-assisted development fast.

The video judge

So we gave the dev loop eyes. The video judge is a module that downloads a generated video, sends it to Gemini 2.5 Pro with native video understanding, and gets back a structured evaluation.

The judge scores against five criteria:

Motion quality: Smoothness, naturalness, no jitter or warping
Temporal coherence: Frame-to-frame consistency, no flickering
Subject consistency: Character/object identity stable across frames
Prompt fidelity: Does the output match what was asked for?
Visual quality: Resolution, lighting, composition, color

Each criterion gets a 0.0-1.0 score with a brief explanation. The judge also surfaces specific issues ("hands distort at 3.2 seconds", "face becomes inconsistent after the camera cut") and an overall recommendation.

How it fits the dev loop

The judge plugs into our scenario test runner. A test scenario can specify a video judge gate: which tool generated the video, where the URL lives in the result, what score threshold to pass, and which criteria to evaluate.

When the scenario runs, Naomi executes the video generation tools. The judge watches the output and scores it. If the score clears the threshold, the test passes. If not, the test fails and the critique explains why.

This means Claude Code can now run the full loop: generate a video, judge it, read the critique, adjust the code, regenerate, compare scores, ship if better. No human watching required for the iterative improvement steps.

The practical bits

Videos under 18MB get sent inline. Larger ones stream to a temp file first — Gemini's inline limit is 20MB, and most of our clips are 5-15MB so they usually fit. The judge returns cost and latency alongside the scores, so you can track the overhead.

Failures are non-destructive. If Gemini's video understanding returns garbage — it happens, the model sometimes returns invalid JSON or refuses to score — the judge returns the result with an error field set. The test runner treats it as inconclusive, not failed. No false negatives from judge flakiness.

Thresholds are per-test. Smoke tests use 0.40 — "the video isn't broken." Ship gates use 0.85 — "the video is genuinely good." The rubric is customizable too — if you only care about motion quality for a particular test, you can drop the other criteria.

What this doesn't solve

The judge can't evaluate creative quality — whether the hook is compelling, whether the pacing feels right for TikTok, whether the concept resonates. Those are human judgments. What it can evaluate is technical execution: does the video look correct, move smoothly, and match what was requested?

That's enough to close the dev loop. Creative judgment stays human. Technical iteration gets automated.

Fifteen APIs, one router

Naomi needed to reason about which image or video model to use — not just call one and hope. So we built a catalog.