MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

1University of California, Santa Cruz, 2University of California, Santa Barbara, 3Microsoft,
* Equal Contribution

†Corresponding to: xhe89@ucsc.edu, xwang366@ucsc.edu

Figure 1. MMWorld covers seven broad disciplines and 69 subdisciplines, focusing on the evaluation of multi-faceted reasoning beyond perception (e.g., explanation, counterfactual thinking, future prediction, domain expertise). On the right is a video sample from the Health & Medicine discipline.

Abstract

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models"---interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

Dataset Characteristics

Figure 2. Comparison between MMWorld and previous benchmarks for real-world video understanding on a variety of criteria. Multi-faced include Explanation (Explain.), Counterfactual Thinking (Counter.), Future Prediction (Future.) and Domain Expertise (Domain.) MMWorld is the first multi-discipline and multitask video understanding benchmark that covers wider reasoning questions, and also included first-party data annotations.

Question Types Distribution

Figure 3. The questions in MMWorld primarily evaluate seven understanding and reasoning abilities of models to provide correct answers.

Synthetic Data Generation Pipeline

Figure 4. Schematic diagram of the synthetic data generation pipeline in MMWorld. It starts with generating subdiscipline-specific queries, followed by video retrieval from YouTube-8M and YouTube. Keyframes are extracted for visual-based QA generation, and videos are transcribed using an ASR module for audio-based QA generation.

Study on MLLM Performance at Different Difficulty Levels for Average Humans

Figure 5. Model performance at different difficulty levels for average humans. Average human difficulty levels are defined by 3 turkers’ performance per question: Easy (3/3 correct answers), medium (2/3 correct), hard (1/3 correct), and expert (0/3 correct).

Painting Icon Examples

Discipline:

Subdiscipline:

Question:

Leaderboard

Below we shows the current leaderboard of MMWorld based on results on the human annotated datasets.

Model Art & Sports Business Science Health & Medicine Embodied Tasks Tech & Engineering Game Average
Random Choice 25.03 25.09 26.44 25.00 26.48 30.92 25.23 26.31
GPT-4o 47.87 91.14 73.78 83.33 62.94 75.53 80.32 62.54
Claude 3.5 Sonnet 54.58 63.87 59.85 54.51 30.99 58.87 59.44 54.54
GPT-4V 36.17 81.59 66.52 73.61 55.48 61.35 73.49 52.30
Gemini 1.5 Pro 37.12 76.69 62.81 76.74 43.59 69.86 66.27 51.02
Video-LLaVA-7B 35.91 51.28 56.30 32.64 63.17 58.16 49.00 44.60
Video-Chat-7B 39.53 51.05 30.81 46.18 40.56 39.36 44.98 40.11
ChatUnivi-7B 24.47 60.84 52.00 61.11 46.15 56.74 52.61 39.47
mPLUG-Owl-7B 29.16 64.10 47.41 60.07 23.78 41.84 62.25 38.94
VideoChatGPT-7B 26.84 39.16 36.45 53.12 36.60 41.49 36.55 33.27
PandaGPT-7B 25.33 42.66 39.41 38.54 35.43 41.84 40.16 32.48
ImageBind-LLM-7B 24.82 42.66 32.15 30.21 46.85 41.49 41.37 31.75
X-Instruct-BLIP-7B 21.08 15.85 22.52 28.47 18.41 22.34 26.10 21.36
LWM-1M-JAX 12.04 17.48 15.41 20.49 25.87 21.99 11.65 15.39
Otter-7B 17.12 18.65 9.33 6.94 13.29 15.96 15.26 14.99
Video-LLaMA-2-13B 6.15 21.21 22.22 31.25 15.38 19.15 24.90 14.03
Open-Source Proprietary

More models are coming.

BibTeX


      @article{he2024mmworld,
        title   = {MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos},
        author  = {Xuehai He and Weixi Feng and Kaizhi Zheng and Yujie Lu and Wanrong Zhu and Jiachen Li and Yue Fan and Jianfeng Wang and Linjie Li and Zhengyuan Yang and Kevin Lin and William Yang Wang and Lijuan Wang and Xin Eric Wang},
        year    = {2024},
        journal = {arXiv preprint arXiv: 2406.08407}
      }