A flexible framework to bridge video understanding, generation and editing
We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.
Showcasing our model's ability to handle complex editing scenarios with high motion, intricate local changes, and multi-element transformations while preserving temporal coherence.
Multi-element transformations combining appearance, lighting, and environmental changes
Change the mans black jacket to a long, tattered gray overcoat with frayed edges and a high collar, and replace the green wall with a peeling, faded blue wallpaper covered in faded, handwritten notes and faded red symbols.
Change the mans black jacket to a long, tattered gray coat lined with faintly glowing thread, and replace the cold blue light pulse with a sudden burst of warm amber light that illuminates floating golden dust and glowing geometric symbols instead of ash and sigils.
Change the womans black workout attire to a vibrant crimson sports bra and matching high-waisted leggings, and replace the white towel with a long, flowing silk scarf in the same crimson hue that billows dramatically with each motion.
Change the red fox to a small, silver-furred Arctic fox with a frost-tipped tail, and replace the sun-dappled forest with a quiet, snow-laden boreal woodland at twilight, where the golden light becomes a cool blue moonlight and the firefly-like motes become glowing ice crystals that shimmer and drift in slow, silent spirals.
Change the womans flowing black coat to a tailored, knee-length camel wool trench coat with a crisp white collar, and replace the autumn leaves and loose papers swirling around their ankles with falling cherry blossom petals drifting lazily in the breeze.
Change the mans orange prison jumpsuit to a tattered gray hospital gown, and replace the flickering greenish fluorescent bulb with a single, flickering red emergency light that casts a blood-red glow across the corridor, making the shadows appear as if stained with dried blood.
Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion
Change the womans green jacket to a deep crimson cloak that billows dramatically with each step.
Change the characters armored suit from red-and-black to matte charcoal gray with glowing cyan circuitry accents.
Change the womans white shirt to a blood-red silk blouse that clings to her form.
Change the womans pink shirt to a deep crimson robe with gold embroidery.
Change the mans sharp dark suit and tie to a weathered, oil-stained mechanics jumpsuit.
Change the womans black top to a flowing, blood-red silk gown that clings and billows with her motion.
Precise object-level modifications while preserving surrounding context and motion
Change the real raccoon to a stuffed raccoon.
Change the firefighters pizza to a steaming cup of coffee.
Change the characters light brown fur to deep obsidian-black fur with swirling icy blue ethereal mist.
Change the golden retriever to a black Labrador.
Change the man in the tailored suit to a woman in a bloodstained white lace gown.
Change the golden retriever to a sleek black Border Collie with white-tipped paws.
Change the mans crisp blue suit to a tailored charcoal-gray wool coat with a high collar.
Change the butterfly to a violet butterfly.
Left side: source video | Right side: edited result.
Add a scarf around the first foxs neck.
Add a tiny pirate hat on the parrots head.
Add a red headband to the players forehead.
Add a tiny crown to the hummingbirds head.
Add glowing neon stripes to the motorcycle
Add a bow tie around the rabbits neck.
Add a straw hat on the first ducks head.
Add pink gloves to the bikers.
Add safety goggles to the scientist.
Remove the meditation cushion from the scene.
Remove the two cubs from the scene.
Remove the two lizards from the scene, but keep the runner and the desert dune under a blazing noon sun.
Remove the black cat from the scene.
Remove the bicycle from the scene.
Remove the large brush from the artists hand.
Remove the two deer from the scene, but keep the painter and the lake under a crisp autumn sky.
Remove the sunglasses from the woman.
Change the fox into a badger.
Change the cat to white.
Change the color of the dress to gold.
Change the color of the boat from purple to orange.
Replace the blue cargo ship with a green ferry.
Replace the human photographer with a robotic photographer.
Change the womans white kimono to a deep indigo silk kimono with faint silver thread embroidery.
Change the barbers straw hat to a polished silver fedora and replace the golden afternoon light with cool, overcast twilight.
Change the time of day from twilight to early dawn, transforming the color palette from fading sunset hues to rising sunrise tones.
Change the warm, low glow from the left side to a cool, blue moonlight coming from above and slightly behind.
Change the hyenas' fur color from sandy to a vibrant, neon-pink, and replace the blazing sunset with a cool, twilight purple and teal sky.
Change the womans dark blazer and white shirt to a sleek, charcoal-gray tactical suit.
Change the mans dark jacket to a long, tattered overcoat made of aged leather.
Change the glowing red gemstone to a cold, crystalline blue orb that emits a faint, steady pulse of icy light.
Change the warm, low glow from the left side to a cool, blue moonlight coming from above.
Change the womans brown jacket to a sleek, metallic silver trench coat.
Change the young mans light-colored shirt to a faded, weathered olive-green military vest.
Change the womans black top to a flowing, ivory silk robe with gold-thread embroidery.
Change the womans red dress to a flowing white gown with faint gold embroidery.
Change the young womans white top to a dark navy-blue tactical vest with subtle reflective strips.
Change the womans white veil to a deep crimson velvet cloak.
Comparison of diffusion- and flow-based video editing methods on the FiVE benchmark using FiVE-Acc metrics.
| Method | FiVE-YN | FiVE-MC | FiVE-∪ | FiVE-∩ | FiVE-Acc ↑ |
|---|---|---|---|---|---|
| TokenFlow | 19.36 | 35.51 | 36.68 | 18.18 | 27.43 |
| DMT* | 34.78 | 62.06 | 62.98 | 33.86 | 48.42 |
| VidToMe | 20.03 | 33.50 | 36.20 | 17.34 | 26.77 |
| AnyV2V | 30.62 | 45.42 | 48.96 | 27.09 | 38.02 |
| VideoGrain† | 30.50 | 43.97 | 44.30 | 30.17 | 37.23 |
| Wan-Edit | 41.41 | 52.53 | 55.72 | 38.22 | 46.97 |
| Omni-Video-2 (Ours) | 63.77 | 83.30 | 85.99 | 61.08 | 73.53 |
Table 1. Quantitative comparison on FiVE benchmark.