Omni-Video: Democratizing Unified Video Understanding and Generation

Introduction

Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.

Figure 1. Framework of Omni-Video

Text-to-Video Results (480p)

Text-to-Video Results (360p)

A video captures the majestic beauty of a mountain range, where the peaks are dusted with snow and the valleys are lush with greenery. The clouds, fluffy and white, are scattered across the sky, adding a sense of depth and dimension to the scene.

A beautiful girl with a powerful, confident stance at the peak of a mountain she has just climbed, looking out at the view. The style is triumphant and inspiring. The lighting is the clear, unfiltered light of high altitude. The overall impression is one of strength and achievement.

A rabbit hopping through a misty forest floor at dawn, its movements quiet and cautious. The style is atmospheric and mysterious, with the rabbit as a gentle guide through the scene. The lighting is the soft, diffused light of a foggy morning. The overall impression is one of magical, quiet exploration.

A video captures the process of pouring a liquid into a metal cup. The liquid appears to be a dark brown color, possibly a type of coffee or tea.

A video shows a man exploring a bustling, foreign marketplace, his face a mixture of curiosity and wonder. The style is adventurous and sensory. The lighting is a vibrant, chaotic mix of sunlight and stall lights. The overall impression is one of joyful cultural immersion.

A video shows a man's face breaking into a warm, genuine smile as he recognizes someone off-camera. The style is candid and heartwarming. The lighting is soft and natural. The overall impression is one of sudden, happy recognition.

A video shows massive, powerful ocean waves crashing against a dark, rugged cliff face, sending spray high into the air. The style is dramatic and awe-inspiring, emphasizing the raw power of the sea. The lighting is the moody, gray light of a stormy day. The overall impression is one of nature's untamable force.

A video shows a beautiful, intricate sea anemone, its tentacles waving gently in the current of a tide pool. The style is detailed and vibrant. The lighting is the clear, bright light of day. The overall impression is one of a living, breathing flower of the sea.

A video shows a baby's wide, fascinated eyes as they look up at a colorful mobile spinning above their crib. The style is wondrous and focused, capturing their developing senses. The lighting is soft and colorful, perhaps from the mobile itself. The overall impression is one of a whole new world opening up.

Video Editing Results (Group 1)

Add a person sitting on a bench.

Replace the panda with a human.

Replace the boat with a yacht.

Replace the kite with a bird.

Change the skier's jacket from green to yellow.

Add a hot air balloon floating above the clouds.

Video Editing Results (Group 2)

Add a sea.

Add a hat.

Turn the mountain into trees.

Discard the path.

Eliminate the man.

Text-to-Image Results

An image features a golden retriever playing in autumn leaves, tongue lolling. Crisp red and orange foliage blankets the ground, with a weathered wooden fence and distant misty mountains framing the scene.

An image depicts a dystopian, cyberpunk-inspired scene. In the foreground, a figure wearing an orange jacket with a blue and black helmet is standing, facing the viewer. The figure's helmet is adorned with a blue circular element. The buildings exhibit a dilapidated appearance, with visible signs of wear and rust.

A volcanic springs steaming in Icelandic winters. Azure pools steam among obsidian rocks where snowflakes vanish near bubbling geothermal vents.

A biplane soaring above cotton-cumulus clouds. Ragtime echoes from propeller buzz over patchwork landscapes where toy windmills spin on distant hillsides.

An image captures a woman wearing sunglasses and a jacket, standing on a boat and smiling. She is wearing a backpack, which is located on her back. The boat is floating on a river, and the water is visible in the background.

Antarctic icebergs sculpted by polar winds. Azure caverns glow in monumental ice cliffs floating through silver seas where penguins porpoise through waves.

A futuristic image featuring a woman with glowing yellow eyes. She is wearing a black and gold outfit, which is adorned with intricate patterns and lights. The woman has long hair, and her eyes are prominently glowing yellow. She is the main focus of the scene.

An image that captures the essence of retro chic. At the center of the frame is a woman exuding a sense of confidence and style.

A volcanic springs steaming in Icelandic winters. Azure pools steam among obsidian rocks where snowflakes vanish near bubbling geothermal vents.

A black and white photograph of a serene mountain lake surrounded by trees. The mountain can be seen in the distance, towering over the lake. The reflection of the mountain and its snowy peak can be seen clearly in the calm waters of the lake.

A large, well-lit bedroom with a bed positioned towards the right side of the room. The bed is adorned with a gold comforter, giving it a luxurious appearance. The room also features a chair on the left side and another chair closer to the center of the room. The overall atmosphere of the bedroom is elegant and inviting.

A large elephant standing in a field with trees and bushes. The elephant is positioned near the center of the scene, surrounded by a variety of trees and bushes. Some of the trees have no leaves, while others have a mix of green and brown leaves. The elephant appears to be grazing or exploring the area, possibly searching for food.

Image Editing Results

Source

Edited

Redesign this picture into tourism illustration

Source

Edited

Add a red flower to the center of the cactus plant.

Source

Edited

Add color to the photo

Source

Edited

Wipe off the bird from the photo.

Source

Edited

Change this photo into kyoto animation

Source

Edited

Add more logs to the fire

Source

Edited

Change the color of the water to turquoise

Source

Edited

Change the color of the clouds to pink

Video Understanding Results

Omni-Video demonstrates advanced video understanding capabilities, providing detailed analysis and interpretation of video content through multimodal large language models.

Think Mode Results

Omni-Video's think mode enables sophisticated reasoning and step-by-step analysis for complex video and image understanding tasks, demonstrating advanced cognitive capabilities in multimodal AI.