3 min read

Pixel truth meets semantic reasoning

Single-call video decomposition was brittle. Splitting the job between ffmpeg and Gemini made it reliable.

researchvideodecomposition

One of the core capabilities Naomi needs is the ability to watch a TikTok video and understand what makes it work. Not just "this is a motivational video" — but shot by shot: what the camera does, how the edit lands, where the text overlay appears, what the hook structure is, and how to recreate it with her own character.

We tried doing this in a single Gemini call. Send the full video, ask for a structured breakdown. It worked maybe 60% of the time. The other 40%, it confused timestamps, merged adjacent shots, hallucinated cuts that didn't exist, or missed soft transitions entirely.

The problem: asking one model to handle both the geometry of video (where do cuts happen?) and the semantics (why does this edit work?) in a single pass is asking it to be both a machine and a critic simultaneously.

Splitting the job

The new pipeline has four stages, each using the right tool for the right job.

Stage 1 is mechanical. ffmpeg runs scene detection with a pixel-difference threshold. No AI involved — just math. If the pixel content changes dramatically between frames, that's a cut. This gives us frame-accurate cut timestamps. It can't tell you why a cut exists, but it can tell you where every cut is, and it won't hallucinate a cut that isn't there.

Stage 2 is coordination. One Gemini 2.5 Pro call that sees the full video plus ffmpeg's cut list. Its job: decide which ffmpeg boundaries are real editorial cuts versus false positives (camera shake, flash frames), and identify soft transitions that ffmpeg missed (fades, dissolves). This is where semantic reasoning starts — but it's grounded in ffmpeg's pixel-truth data, not making timestamps from scratch.

Stage 3 is the workers. One Gemini 2.5 Flash call per shot, running in parallel with a concurrency limit of 6. Each worker sees a single clip — not the whole video, just its shot — and outputs structured analysis: subject, action, camera movement, composition, lighting, mood, text overlay. No cross-shot context, no timestamps. One clip in, one JSON out. This eliminates the timestamp confusion problem entirely, because each worker only knows about its own shot.

Stage 4 is synthesis. One Gemini 2.5 Pro call that sees all the worker outputs (text only — no video input, so it's cheap). It reasons about the narrative arc: hook, build, payoff. It identifies the visual style. And it writes a recreation brief — the theme, the creative idea, why it works, a caption, and a storyboard for recreating it with different content.

The economics

Pro for coordination and synthesis — where reasoning quality matters. Flash for workers — where cost matters because there are N of them. A typical 30-second TikTok with 8 shots costs about $0.15 total: $0.07 for the coordinator, $0.08 for the workers, $0.02 for the synthesizer.

Compare that to a single Pro call that might cost $0.10 but fails 40% of the time. Reliability at a similar price point.

Soft failures everywhere

Every stage is designed to fail gracefully. If the coordinator fails, we fall back to raw ffmpeg cuts. If a worker fails, we keep the rest and mark that shot as analysis_failed. If the synthesizer fails, we still have the per-shot analysis — just no narrative summary.

The lesson: don't ask one tool to be good at everything. Ask each tool to be good at one thing, and connect them with clean handoffs.