MixDPO is a unified Direct Preference Optimization (DPO) method that jointly utilizes textual and visual preferences to fine-tune video-language models (VLMs). It enhances VLM robustness against hallucinations, particularly in action recognition and temporal reasoning, by leveraging synthetic counterfactual video data.
MixDPO is a new method that helps video-language AI models (VLMs) understand videos better and make fewer mistakes, especially when describing actions and timing. It works by training these models using both text descriptions and actual visual examples of what's right and wrong, making them more reliable.
Mixed-modality Direct Preference Optimization
Was this definition helpful?