92 papers - avg viability 6.7
Recent advancements in generative video technology are focusing on enhancing realism and interactivity, addressing key challenges in visual effects, human-object interactions, and autonomous systems. For instance, new frameworks like EffectMaker and GenHOI are streamlining the creation of customized visual effects and improving hand-object interaction consistency, respectively, by integrating multimodal models and advanced attention mechanisms. Meanwhile, FAR-Drive is pioneering closed-loop video generation for autonomous driving, allowing for real-time interaction and consistency across multiple camera views. Additionally, frameworks such as AVControl are enabling efficient training of audio-visual controls, making it easier to integrate diverse modalities without extensive architectural changes. These developments not only enhance the quality and realism of generated videos but also have significant commercial implications, particularly in entertainment, gaming, and autonomous technologies, where the demand for immersive and interactive experiences is rapidly growing. The field is clearly shifting towards more scalable and flexible solutions that prioritize user control and contextual relevance.
CounterVid enhances video-language models by generating counterfactual videos to reduce action and temporal hallucinations.
FAR-Drive is a closed-loop video generation framework for autonomous driving that ensures high fidelity and low latency.
Synthesize anatomically plausible and behaviorally rich facial expressions from natural language descriptions of Action Units, overcoming limitations of existing text-to-face models.
A novel framework for generating high-fidelity egocentric videos using sparse 3D hand joints for motion control.
MotionGrounder is a Diffusion Transformer framework enabling multi-object motion transfer with fine-grained control, grounding captions to specific objects in generated videos.
EffectMaker is a unified reasoning-generation framework that enables reference-based VFX customization, offering a scalable and flexible paradigm for customized VFX generation.
A unified framework for conditional image grading that bridges words and colors, producing visually pleasing and stylistically coherent results aligned with human aesthetics.
A lightweight, extendable framework for efficient audio-visual control in video generation, enabling modular training of diverse modalities with minimal architectural changes.
Generate controllable human motion videos from text using a cascaded text-to-skeleton and pose-conditioned diffusion model, with a new synthetic dataset to address the lack of training data.
GenHOI enhances video generation models with object-consistent hand-object interaction by injecting reference object information, outperforming existing methods in in-the-wild scenarios.