This is a model study. I wanted to make fan art for one of my favorite bands, Cartographer, and I had a specific aesthetic in mind: true handmade stop-motion claymation, not CGI, not smooth animation. Think Wallace and Gromit energy with a punk rock set.
The model I used for this is wan-video/wan-2.7-i2v, an image-to-video model that also accepts an audio clip as input. That last part matters: it means the animation can respond to the actual music, not just a generic motion prompt. You give it a reference frame and a track, and it generates motion around both.
Start with a clean reference
Before touching the video model, my first step was to generate a clean reference image of the band using GPT Image 1.5. Plain grey background, neutral lighting, just the characters. I always start here. A good clean reference gives you a locked-in visual identity to pull from: the faces, proportions, clothing, style. All of that before you start throwing them into complex environments or motion.
Once the reference is solid, you can start imagining the actual shots. I generated a few tighter music video style images to think through what individual angles could look like before writing any prompts for video generation.
Here is the honest part: the full stop-motion claymation look I was after never quite landed. The model kept trending toward smooth, polished motion rather than the choppy physical clay feel. The reference images turned out great. The animation itself is more claymation-adjacent than true stop-motion. Still looks cool, just not the Wallace and Gromit thing.
The prompts
These are the starting points, not the exact final prompts. I ran a lot of iterations from here. Two versions: a tight close-up on the singer and a wide band shot. Both share the same claymation instruction block I kept consistent throughout.
Close-up hero shot
Tight close-up hero shot of the female singer singing straight into camera, rocking out with subtle head banging, aggressive performance energy, hot pink microphone, gritty industrial warehouse, flickering green overhead light, wispy haze, occasional lens flare, fast handheld 90s punk rock music video energy, no cuts, one continuous shot.
Very important: this must look like true handmade stop-motion claymation, not live action and not smooth CGI. Tactile clay surface, visible sculpted imperfections, slight fingerprints, tiny dents, handmade hair strands, replacement-animation mouth shapes, miniature set feel, practical stop-motion lighting, subtle frame stepping, slight stop-motion jitter, staccato motion, choppy handmade animation cadence, camera movement feels handheld but also shot as stop-motion, high shutter feel, punchy gritty texture.
Wide band shot
Wide full-band performance shot, high energy rock band playing hard in a dark industrial warehouse, female singer front and center aggressively singing into a hot pink microphone, guitarist and bassist thrashing on either side, drummer pounding hard in the back, gritty 90s punk music video energy, fast handheld camera, slight push-ins and lateral sway, no cuts, one continuous shot, flickering neon green fluorescent light overhead, wet reflective floor, wispy haze, occasional lens flare, chaotic but readable stage blocking.
Very important: preserve true handmade stop-motion claymation in every frame. This must look like physical clay puppets on a miniature set, not live action and not polished CGI. Tactile clay texture, visible sculpted imperfections, slight fingerprints, tiny dents, handmade hair, replacement-animation mouth shapes, practical miniature lighting, subtle stop-motion jitter, choppy frame-by-frame motion, staccato animation cadence, high shutter feel, crunchy gritty texture. The whole band is rocking out hard with exaggerated stop-motion performance, head banging, stomping, leaning into instruments, energetic but still clearly claymation.
Negative prompt (applied to both)
No live-action realism, no smooth cinematic motion, no fluid skin deformation, no glossy 3D animation, no Pixar look, no rubbery motion interpolation, no overly polished CGI, no realistic human skin, no clean digital hair simulation, no stabilized camera, no slow dreamy motion.
Even with all of that, the model kept drifting away from the true Wallace and Gromit feel I was after. So I appended one more instruction to every prompt:
Animate as practical frame-by-frame stop-motion on a handmade miniature set at 10 to 12 fps, keeping the imperfect claymation look dominant over realism.
It still does not fully respect this. But the results are good enough that I kept going.
What worked and what did not
The character designs and the visual world came out well. The clay texture, the gritty warehouse, the punk energy. That part translated. The animation motion itself was the problem. Even with staccato and frame-stepping instructions in every prompt, the model defaulted to something too fluid. The stop-motion jitter mostly was not there.
The audio-to-motion connection was interesting. When I fed in an actual Cartographer track, the singer's movement had a loosely synced quality to the rhythm. Not perfectly timed, but clearly informed by it. More testing needed there.
Overall I think this model is great for narrow use cases. The problems were prompt adherence, reproducibility even from identical reference start frames, and general instruction follow. The biggest annoyance was it would not respect my "no cuts" rule and kept trying to create shots I did not direct.
Iteration and editing are the real work
The single biggest thing that separates AI video that feels intentional from AI video that feels like slop is not the model. It is how much you work it. Generate a lot. Expect most of it to be unusable. Pick the best takes from across many runs and cut them together the same way you would cut footage you actually shot.
And then treat it like footage. Color grade it. Add effects. Push the look. The raw output from these models is a starting point, not a finished product. The difference between a clip that looks cheap and one that holds up is usually what you did to it after generation. That step is where a lot of people stop short.
Here is the final edit on Instagram.