What Is Frame Drift?
When you use a diffusion model to stylize video, each frame is processed independently. The model has no memory of what the previous frame looked like. It only sees the current frame, your prompt, and whatever conditioning inputs you provide.
This means that even with an identical prompt and seed, subtle changes in the source frame — different lighting, a slight pose shift, a different background element — produce visibly different outputs. Over the length of a clip, these differences accumulate. Characters look like different people between cuts. Colors drift. Scene geometry contradicts itself.
Why Prompts Alone Don't Solve It
A common first attempt is to add more specificity to the prompt: describe the character, the lighting, the exact shade of the palette. This helps slightly, but it doesn't address the fundamental problem. Diffusion models are stochastic — they sample from a probability distribution. Two samples from the same distribution aren't identical.
The more you push stylization strength, the more latitude the model takes with your input — and the more drift you see.
The Three Anchors That Fix It
Solving drift requires conditioning the model on something more stable than text. ShotLock uses three mechanisms:
- Style Packs encode the visual language of your project as IP-Adapter and ControlNet conditioning — the same conditioning applied to every frame.
- Character Cards use reference photos of your subjects as IP-Adapter conditioning anchors. The model is pulled toward the identity in those photos on every frame.
- Scene Cards anchor the environment — lighting direction, atmosphere, spatial feel — so backgrounds don't reinvent themselves between cuts.
What This Looks Like in Practice
With drift: a 5-second clip at 1fps produces 5 frames where the lead character has slightly different face shape, a different jacket color, and the background shifts between takes. Cut that into a sequence and it looks like a slideshow of different people.
With ShotLock: the same clip produces 5 frames where the character looks like the same person, the palette is coherent, and the background maintains its spatial logic. The output looks like a sequence — because it was anchored like one.