How IP-Adapter Powers Character Consistency in AI Video

The Problem with Text-Only Character Description

Describing a character in a prompt — "a woman with dark hair, wearing a leather jacket, blue eyes" — gives the model a constraint, but not a tight one. The model has millions of representations of that description and will sample from them unpredictably across frames.

What IP-Adapter Does

IP-Adapter (Image Prompt Adapter) conditions the diffusion process on reference images directly, bypassing the loose coupling of text descriptions. It works by extracting image embeddings from your reference photos and using them as additional context during the denoising process.

The result: the model outputs images that look more like the reference, not just images that match a textual description of the reference. The face is the face. The jacket is the jacket.

How ShotLock Wires This Up

ShotLock's Character Cards store 2–5 reference photos per character. When you submit a render job, the WorkflowBuilder injects one LoadImage node per reference photo, batches them via ImageBatch nodes, and connects the batch to the IPAdapterApply node in the ComfyUI workflow.

Every frame of your clip is conditioned on all character references simultaneously. The IP-Adapter weight (set via the Style Pack) controls how strongly the reference anchors the output — higher weights mean tighter identity lock, lower weights give the style more room to breathe.

Multi-Reference Batching

Using multiple reference images improves stability. A single reference from one angle might cause the model to over-index on that angle. Multiple references from different angles, lighting conditions, or expressions give the IP-Adapter a more robust representation of the character's identity.