In the Eye of MLLM

Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

1State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University   
2College of AI, Tsinghua University
NeurIPS D&B 2025

Abstract

The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent.

To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of 1,757 gaze-based QA pairs from 913 videos, generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues.

EgoGazeVQA Introduction

We propose EgoGazeVQA, the first MLLM benchmark that incorporates essential gaze signals for understanding user intent in egocentric settings. We present examples of Spatial, Temporal, and Causal Intent QA, demonstrating how gaze information improves MLLMs' performance. Radar charts compare model performance across different scenarios and activities, showing consistent gains with our gaze-guided prompting strategy.

Method

Construction Pipeline

Construction pipeline of EgoGazeVQA. We craft the benchmark in three steps: Stage 1: Egocentric video clips are processed to extract frame captions and gaze coordinates to capture user focus. Stage 2: A MLLM model generates spatial/temporal-aware and intention-related Q&A pairs using a customized prompt. Stage 3: Human annotators manually review the generated Q&A pairs for quality dimensions to ensure high-quality data.

Gaze-Guided Prompting Strategies

Gaze-guided prompting strategies in EgoGazeVQA. We experiment with three gaze-guided prompting strategies: (Left) Gaze as Textual Prompt - gaze coordinates are presented as text inputs to guide model responses; (Center) Gaze as Visual Prompt - highlights gaze points directly on video frames; (Right) Gaze Salience Maps as Prompt - utilizes heatmaps of gaze trajectories to provide contextual cues for understanding spatial and temporal intent.

Key Contributions

  • First Gaze-Guided VQA Benchmark: EgoGazeVQA is the first egocentric video QA benchmark that leverages gaze information to evaluate MLLM's understanding of human intentions in daily-life videos
  • Three Gaze-Guided Prompting Strategies: We introduce Gaze as Textual Prompt, Gaze as Visual Prompt, and Sequential Gaze Salience Maps to enhance MLLM's ability to interpret spatial, temporal, and intent-related cues
  • Comprehensive Evaluation: Extensive experiments on 7 state-of-the-art MLLMs (GPT-4o mini, Gemini 2.0, Qwen2.5-VL, InternVL2.5, etc.) demonstrating significant performance improvements with gaze guidance
  • Fine-tuning Analysis: Empirical study showing how LoRA fine-tuning enhances model's ability to leverage gaze signals and how gaze estimation accuracy impacts prompting effectiveness

Experimental Results

Main Results on EgoGazeVQA Benchmark

Method Strategy Spatial Temporal Causal Average
Human - 80.7 75.6 95.2 83.8
Qwen2.5-VL-72B w/o gaze 57.1 45.2 79.3 60.5
Qwen2.5-VL-72B Textual 60.0 50.7 84.2 65.0
Qwen2.5-VL-72B Visual 59.7 48.1 84.1 63.9
Qwen2.5-VL-72B Salience Map 64.3 50.3 84.3 66.3

Key Findings: The best-performing method (Qwen2.5-VL-72B + Salience Map) achieves 66.3% average accuracy, with +5.8% improvement over the baseline. However, there remains a significant gap compared to human performance (83.8%), highlighting the challenge and room for improvement in gaze-guided egocentric video understanding.

Qualitative Results

Qualitative Examples

Visual examples from our EgoGazeVQA benchmark and gaze-guided prompting. The successful examples show how gaze signals significantly enhance MLLMs' performance in tasks demanding precise spatial reasoning and intent interpretation, such as distinguishing closely situated objects in the kitchen and accurately inferring user focus in complex scenes. Challenging cases reveal limitations when there is drastic body motion or gaze saccades that can mislead the model.

Citation

@misc{peng2025eyemllmbenchmarkingegocentric,
    title={In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting}, 
    author={Taiying Peng and Jiacheng Hua and Miao Liu and Feng Lu},
    year={2025},
    eprint={2509.07447},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.07447}
}