Omni-Video

Democratizing Unified Video Understanding and Generation

1Fudan University, 2Shanghai Academy of Artificial Intelligence for Science Equal Contribution, *Corresponding Author

Introduction

Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.

data-overview

Figure 1. Framework of Omni-Video

Text-to-Image Results

Generated Image

An image features a golden retriever playing in autumn leaves, tongue lolling. Crisp red and orange foliage blankets the ground, with a weathered wooden fence and distant misty mountains framing the scene.

Generated Image

An image depicts a dystopian, cyberpunk-inspired scene. In the foreground, a figure wearing an orange jacket with a blue and black helmet is standing, facing the viewer. The figure's helmet is adorned with a blue circular element. The buildings exhibit a dilapidated appearance, with visible signs of wear and rust.

Generated Image

A volcanic springs steaming in Icelandic winters. Azure pools steam among obsidian rocks where snowflakes vanish near bubbling geothermal vents.

Generated Image

A biplane soaring above cotton-cumulus clouds. Ragtime echoes from propeller buzz over patchwork landscapes where toy windmills spin on distant hillsides.

Generated Image

An image captures a woman wearing sunglasses and a jacket, standing on a boat and smiling. She is wearing a backpack, which is located on her back. The boat is floating on a river, and the water is visible in the background.

Generated Image

Antarctic icebergs sculpted by polar winds. Azure caverns glow in monumental ice cliffs floating through silver seas where penguins porpoise through waves.

Generated Image

A futuristic image featuring a woman with glowing yellow eyes. She is wearing a black and gold outfit, which is adorned with intricate patterns and lights. The woman has long hair, and her eyes are prominently glowing yellow. She is the main focus of the scene.

Generated Image

An image that captures the essence of retro chic. At the center of the frame is a woman exuding a sense of confidence and style.

Generated Image

A volcanic springs steaming in Icelandic winters. Azure pools steam among obsidian rocks where snowflakes vanish near bubbling geothermal vents.

Generated Image

A black and white photograph of a serene mountain lake surrounded by trees. The mountain can be seen in the distance, towering over the lake. The reflection of the mountain and its snowy peak can be seen clearly in the calm waters of the lake.

Generated Image

A large, well-lit bedroom with a bed positioned towards the right side of the room. The bed is adorned with a gold comforter, giving it a luxurious appearance. The room also features a chair on the left side and another chair closer to the center of the room. The overall atmosphere of the bedroom is elegant and inviting.

Generated Image

A large elephant standing in a field with trees and bushes. The elephant is positioned near the center of the scene, surrounded by a variety of trees and bushes. Some of the trees have no leaves, while others have a mix of green and brown leaves. The elephant appears to be grazing or exploring the area, possibly searching for food.

Text-to-Video Results

A video captures the majestic beauty of a mountain range, where the peaks are dusted with snow and the valleys are lush with greenery. The clouds, fluffy and white, are scattered across the sky, adding a sense of depth and dimension to the scene.

A beautiful girl with a powerful, confident stance at the peak of a mountain she has just climbed, looking out at the view. The style is triumphant and inspiring. The lighting is the clear, unfiltered light of high altitude. The overall impression is one of strength and achievement.

A rabbit hopping through a misty forest floor at dawn, its movements quiet and cautious. The style is atmospheric and mysterious, with the rabbit as a gentle guide through the scene. The lighting is the soft, diffused light of a foggy morning. The overall impression is one of magical, quiet exploration.

A video captures the process of pouring a liquid into a metal cup. The liquid appears to be a dark brown color, possibly a type of coffee or tea.

A video shows a man exploring a bustling, foreign marketplace, his face a mixture of curiosity and wonder. The style is adventurous and sensory. The lighting is a vibrant, chaotic mix of sunlight and stall lights. The overall impression is one of joyful cultural immersion.

A video shows a man's face breaking into a warm, genuine smile as he recognizes someone off-camera. The style is candid and heartwarming. The lighting is soft and natural. The overall impression is one of sudden, happy recognition.

A video shows massive, powerful ocean waves crashing against a dark, rugged cliff face, sending spray high into the air. The style is dramatic and awe-inspiring, emphasizing the raw power of the sea. The lighting is the moody, gray light of a stormy day. The overall impression is one of nature's untamable force.

A video shows a beautiful, intricate sea anemone, its tentacles waving gently in the current of a tide pool. The style is detailed and vibrant. The lighting is the clear, bright light of day. The overall impression is one of a living, breathing flower of the sea.

A video shows a baby's wide, fascinated eyes as they look up at a colorful mobile spinning above their crib. The style is wondrous and focused, capturing their developing senses. The lighting is soft and colorful, perhaps from the mobile itself. The overall impression is one of a whole new world opening up.

Video Editing Results

Add a sea.

Add a hat.

Add a hat.

Turn the mountain into trees.

Discard the path.

Eliminate the man.

Image Editing Results

Source Image

Source

Edited Image

Edited

Redesign this picture into tourism illustration

Source Image

Source

Edited Image

Edited

Add a red flower to the center of the cactus plant.

Source Image

Source

Edited Image

Edited

Add color to the photo

Source Image

Source

Edited Image

Edited

Wipe off the bird from the photo.

Source Image

Source

Edited Image

Edited

Change this photo into kyoto animation

Source Image

Source

Edited Image

Edited

Add more logs to the fire

Source Image

Source

Edited Image

Edited

Change the color of the water to turquoise

Source Image

Source

Edited Image

Edited

Change the color of the clouds to pink

Video Understanding Results

Video Understanding Results

Omni-Video demonstrates advanced video understanding capabilities, providing detailed analysis and interpretation of video content through multimodal large language models.

Think Mode Results

Think Mode Results

Omni-Video's think mode enables sophisticated reasoning and step-by-step analysis for complex video and image understanding tasks, demonstrating advanced cognitive capabilities in multimodal AI.