Omni-Video 2

A flexible framework to bridge video understanding, generation and editing

1Fudan University  |  2Shanghai Academy of Artificial Intelligence for Science
*Corresponding Author      Project Lead

Introduction

We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.

Omni-Video 2 Framework
Figure 1. Framework of Omni-Video 2

Advanced Video Editing

Showcasing our model's ability to handle complex editing scenarios with high motion, intricate local changes, and multi-element transformations while preserving temporal coherence.

Complex Edit 6 videos

Multi-element transformations combining appearance, lighting, and environmental changes

Source Edited

Change the mans black jacket to a long, tattered gray overcoat with frayed edges and a high collar, and replace the green wall with a peeling, faded blue wallpaper covered in faded, handwritten notes and faded red symbols.

Source Edited

Change the mans black jacket to a long, tattered gray coat lined with faintly glowing thread, and replace the cold blue light pulse with a sudden burst of warm amber light that illuminates floating golden dust and glowing geometric symbols instead of ash and sigils.

Source Edited

Change the womans black workout attire to a vibrant crimson sports bra and matching high-waisted leggings, and replace the white towel with a long, flowing silk scarf in the same crimson hue that billows dramatically with each motion.

Source Edited

Change the red fox to a small, silver-furred Arctic fox with a frost-tipped tail, and replace the sun-dappled forest with a quiet, snow-laden boreal woodland at twilight, where the golden light becomes a cool blue moonlight and the firefly-like motes become glowing ice crystals that shimmer and drift in slow, silent spirals.

Source Edited

Change the womans flowing black coat to a tailored, knee-length camel wool trench coat with a crisp white collar, and replace the autumn leaves and loose papers swirling around their ankles with falling cherry blossom petals drifting lazily in the breeze.

Source Edited

Change the mans orange prison jumpsuit to a tattered gray hospital gown, and replace the flickering greenish fluorescent bulb with a single, flickering red emergency light that casts a blood-red glow across the corridor, making the shadows appear as if stained with dried blood.

High Motion 6 videos

Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion

Source Edited

Change the womans green jacket to a deep crimson cloak that billows dramatically with each step.

Source Edited

Change the characters armored suit from red-and-black to matte charcoal gray with glowing cyan circuitry accents.

Source Edited

Change the womans white shirt to a blood-red silk blouse that clings to her form.

Source Edited

Change the womans pink shirt to a deep crimson robe with gold embroidery.

Source Edited

Change the mans sharp dark suit and tie to a weathered, oil-stained mechanics jumpsuit.

Source Edited

Change the womans black top to a flowing, blood-red silk gown that clings and billows with her motion.

Diverse Local Edit 8 videos

Precise object-level modifications while preserving surrounding context and motion

Source Edited

Change the real raccoon to a stuffed raccoon.

Source Edited

Change the firefighters pizza to a steaming cup of coffee.

Source Edited

Change the characters light brown fur to deep obsidian-black fur with swirling icy blue ethereal mist.

Source Edited

Change the golden retriever to a black Labrador.

Source Edited

Change the man in the tailored suit to a woman in a bloodstained white lace gown.

Source Edited

Change the golden retriever to a sleek black Border Collie with white-tipped paws.

Source Edited

Change the mans crisp blue suit to a tailored charcoal-gray wool coat with a high collar.

Source Edited

Change the butterfly to a violet butterfly.

Basic Video Editing

Left side: source video | Right side: edited result.

Source Edited

Add a scarf around the first foxs neck.

Source Edited

Add a tiny pirate hat on the parrots head.

Source Edited

Add a red headband to the players forehead.

Source Edited

Add a tiny crown to the hummingbirds head.

Source Edited

Add glowing neon stripes to the motorcycle

Source Edited

Add a bow tie around the rabbits neck.

Source Edited

Add a straw hat on the first ducks head.

Source Edited

Add pink gloves to the bikers.

Source Edited

Add safety goggles to the scientist.

Source Edited

Remove the meditation cushion from the scene.

Source Edited

Remove the two cubs from the scene.

Source Edited

Remove the two lizards from the scene, but keep the runner and the desert dune under a blazing noon sun.

Source Edited

Remove the black cat from the scene.

Source Edited

Remove the bicycle from the scene.

Source Edited

Remove the large brush from the artists hand.

Source Edited

Remove the two deer from the scene, but keep the painter and the lake under a crisp autumn sky.

Source Edited

Remove the sunglasses from the woman.

Source Edited

Change the fox into a badger.

Source Edited

Change the cat to white.

Source Edited

Change the color of the dress to gold.

Source Edited

Change the color of the boat from purple to orange.

Source Edited

Replace the blue cargo ship with a green ferry.

Source Edited

Replace the human photographer with a robotic photographer.

Source Edited

Change the womans white kimono to a deep indigo silk kimono with faint silver thread embroidery.

Source Edited

Change the barbers straw hat to a polished silver fedora and replace the golden afternoon light with cool, overcast twilight.

Source Edited

Change the time of day from twilight to early dawn, transforming the color palette from fading sunset hues to rising sunrise tones.

Source Edited

Change the warm, low glow from the left side to a cool, blue moonlight coming from above and slightly behind.

Source Edited

Change the hyenas' fur color from sandy to a vibrant, neon-pink, and replace the blazing sunset with a cool, twilight purple and teal sky.

Source Edited

Change the womans dark blazer and white shirt to a sleek, charcoal-gray tactical suit.

Source Edited

Change the mans dark jacket to a long, tattered overcoat made of aged leather.

Source Edited

Change the glowing red gemstone to a cold, crystalline blue orb that emits a faint, steady pulse of icy light.

Source Edited

Change the warm, low glow from the left side to a cool, blue moonlight coming from above.

Source Edited

Change the womans brown jacket to a sleek, metallic silver trench coat.

Source Edited

Change the young mans light-colored shirt to a faded, weathered olive-green military vest.

Source Edited

Change the womans black top to a flowing, ivory silk robe with gold-thread embroidery.

Source Edited

Change the womans red dress to a flowing white gown with faint gold embroidery.

Source Edited

Change the young womans white top to a dark navy-blue tactical vest with subtle reflective strips.

Source Edited

Change the womans white veil to a deep crimson velvet cloak.

Performance Comparison

Comparison of diffusion- and flow-based video editing methods on the FiVE benchmark using FiVE-Acc metrics.

Method FiVE-YN FiVE-MC FiVE-∪ FiVE-∩ FiVE-Acc ↑
TokenFlow 19.36 35.51 36.68 18.18 27.43
DMT* 34.78 62.06 62.98 33.86 48.42
VidToMe 20.03 33.50 36.20 17.34 26.77
AnyV2V 30.62 45.42 48.96 27.09 38.02
VideoGrain† 30.50 43.97 44.30 30.17 37.23
Wan-Edit 41.41 52.53 55.72 38.22 46.97
Omni-Video-2 (Ours) 63.77 83.30 85.99 61.08 73.53

Table 1. Quantitative comparison on FiVE benchmark.